Robotics has a scaling law — just not the one you’re hoping for.
Video gives robots eyes. World models give them imagination. Neither gives them hands.The scaling law is real. It just runs on a fuel you can’t download.
The scaling law showed up for robots. It just brought a different curve.
Everyone agrees on the diagnosis: robots have no internet. There’s no pile of robot experience on the web the way there was a pile of text. So the field has spent two years on the follow-up — if you can’t download the data, can you at least scale it?
The answer is more interesting than yes or no. Robotics does have a scaling law — feed these models more and they get predictably better, and by at least one 327-paper meta-analysis, often faster than language models do. But it’s not the curve you’re hoping for. The axis that’s cheap to scale isn’t the axis that’s scarce. And the scarcest thing doesn’t live in any dataset you can grow.
Pixels. A billion hours of internet video, plus every frame a world model can dream up on demand.Abundant. Growing. Nearly free.
Actions and force. First-person motor commands and contact signals that live on no website and in no video.What the scaling law actually runs on.
Three things I’ve come to believe about how this plays out — and then the bet I’d make.
II · THE CHEAP AXIS
You can’t teleoperate your way to an internet.
The gold-standard way to get robot data is teleoperation: a human in a VR rig or on a controller, puppeting the robot through a task while every joint angle and camera frame gets logged. It works — it’s how most of the good datasets got built. It’s also physically capped. As Jim Fan of NVIDIA likes to point out, teleoperation fundamentally doesn’t scale: there are 24 hours in a robot’s day, and a human has to be on the other end of every one of them.
Put that next to the numbers.
| Training data | Scale | Can you download it? |
|---|---|---|
| Usable robot-manipulation data | ~300,000 hours | No — collected one episode at a time |
| Internet video | ~1,000,000,000 hours | Yes |
| Text (LLM pre-training) | ~300,000,000,000,000 tokens | Yes |
Bessemer puts all the usable robot-manipulation data in the world on the order of 300,000 hours. The internet holds roughly a billion hours of video. Text has 300 trillion tokens. The gap isn’t incremental, it’s structural — three orders of magnitude you’re never going to teleoperate your way across.
So the smart money stopped trying to collect the missing internet and started faking one, in two ways. The first mines the human video that already exists — every cooking tutorial is a person doing dexterous physical work. The second generates video outright, from models that learned how the world looks and can now produce more of it on demand.
Both bet that pixels, which you have, can stand in for actions, which you don’t. Over the last year, both started to pay off — concretely enough to put numbers on.
III · EYES
Video is a better teacher than it should be.
The obvious objection to learning from video is that video has no actions in it. You can watch a thousand hours of someone slicing onions and never recover the motor command their arm sent. And their arm isn’t your arm — a human hand won’t map cleanly onto a two-finger gripper. That’s the embodiment gap, and for a while it looked disqualifying.
It mostly isn’t. The trick is to let the model infer the missing action from the footage itself: look at two consecutive frames, guess what changed between them, and treat that guess as a latent action. Train on enough guessed actions and you get a representation that transfers better than it has any business doing.
How much better? Here’s the result that got my attention. LAPA trains on nothing but raw video — zero robot commands — and still beats a top vision-language-action model (one that sees, reads an instruction, and acts) that learned from real robot actions.
The model that never saw a real action beat the one that did.
There’s a catch, and the catch is the good part. LAPA won on language and on objects it had never seen — but it still flubbed the actual grasp more often than the action-trained model did. Video didn’t buy the robot its grip. It bought its eyes. Hold onto that; it comes back.
And it isn’t a fluke. Run the idea the other way around and you land in the same place: V-JEPA 2 watches on the order of a million hours of web video, then gets all of 62 hours on a real robot — and after that it picks up and sets down things it’s never seen, nobody having trained it on the task, 65 to 80 percent of the time. Sixty-two hours is nothing. The team behind EgoMimic says it flat out: to them, an hour of human-hand video beats an hour of their own robot’s.
This is what I mean by eyes. Video doesn’t teach a robot to act. It teaches it to see — to carry priors, the built-in expectations a model brings to a new scene, about objects and motion and what a plausible physical sequence looks like. That’s most of perception, handed over almost for free.
IV · IMAGINATION
World models run on compute, not human time.
The second factory is stranger, and I think more important. A world model learns to generate the future: give it the current frame and an action, and it renders what happens next, like a learned video game of reality. Once you have one, you stop collecting attempts. You imagine them.
The cleanest version I’ve seen goes like this. You hand a humanoid one teleoperated demo — a single pick-and-place, once — then let a video world model dream up the rest: new objects, new layouts, rooms the robot has never been in. You train on the dreams. It comes back having learned 22 behaviors nobody showed it. From one demo. The write-up that nails this, DreamGen, calls those dreamed episodes “neural trajectories,” and the multiplication ran north of 300×.
The curve is the part worth dwelling on. The number of imagined runs and the downstream success rate rise together, log-linearly — a straight line on a log plot, the same shape the language-model scaling curves have. That’s a new axis to scale, and it’s the first one in robotics that human time doesn’t bottleneck. The old axis cost a teleoperator-hour per unit of data. This one costs compute — you pay in GPU time instead of human hours, at least for the kinds of data the generator already understands.
If the scaling story holds, this is the axis that bends it — not bigger teleop farms, not more humans in headsets. Generated experience, scaled with the one input that compounds.
V · THE MISSING HANDS
Both factories deal in pixels. A pixel isn’t a hand.
That’s the optimistic case, and it’s strong: perception nearly for free, imagination scaled with compute. Which is the moment to ask what’s still missing. It’s hands.
A pixel is what something looks like, not what it feels like or how it behaves the instant you touch it. A video model can render a hand closing on a mug with every photon right and the physics wrong — the grip that would actually slip, the force the lift actually takes. Watch a generated clip of a robot folding a shirt and the cloth settles into creases no real fabric makes. World models are weakest right where manipulation is hardest: contact, deformation, the long horizon where small errors compound. The prettiest generated clip in the world will still let two objects pass through each other unless something stops it.
VI · THE BET
The one axis whose shape is physics.
So which axis actually pays off? My bet is the one nobody can opt out of: time. The world models that capture physical consistency won’t be the ones that paint the prettiest clip. They’ll be the ones built to respect the arrow of time — to generate the next state from the last, conditioned on an action, the way a language model writes the next word. Cause before effect. Build it that way and you’re not really generating video; you’re learning a dynamics model, a model of how the world moves, which is the whole point of a world model. Temporal consistency doesn’t get bolted on afterward. It falls out of the causal structure.
Generate the whole clip in one shot, every frame leaning on every other. Footage that looks right and moves wrong — objects pass through each other, contacts never resolve.Better demos today.
Predict the next state from the last, conditioned on an action. Not footage — a dynamics model, whatever generator rides on top.Time runs one way, and that keeps it honest.
One thing I want to head off: this isn’t “autoregression beats diffusion.” Those are orthogonal. You can make a diffusion model causal — diffusion forcing already does — and you can predict causally in a latent space without rendering a single pixel, which is exactly what V-JEPA does. The bet isn’t on the machinery. It’s on the conditioning: causal, action by action, time running one way.
And no, this doesn’t feel force — no pixel model does, which was the whole of the last section. But getting the dynamics right is the hard part of physical consistency, and it’s the part the prettiest acausal model gets wrong. A world that won’t teleport objects or pass them through each other is a world you can ground with touch later, instead of one that hallucinates the rest away. Causality is the scaffold the force data finally has something to attach to.
I’ll be honest about the evidence, because it isn’t closed. The cleanest datapoint, GR-2, scores 97.7% across the tasks it trained on — but that’s in-distribution success, not proof it generalizes. Nothing today is causal, contact-rich, and generalizing all at once. That gap is exactly the bet.
So here’s the version with a date on it: the first robust contact-rich manipulation — reliably handling a deformable object, reorienting something in the hand — comes from a causal dynamics model, not from scaling a whole-clip video generator. If it’s the other way in two years, I was wrong. Of every axis you could scale in robotics, temporal causality is the one I can find whose shape is physics and not aesthetics. That’s where I’d put the chips.