Thoughts · World models

Three world-model fights,
one question.

Whether to generate pixels, whether video models grasp physics, whether an LLM is already a world model. They look like three debates. They’re the same question in three costumes.

By Peike Li June 2026 After reading too many of these arguments back to back

There are three loud arguments about world models right now, and the field treats them as separate debates. I don’t think they are. They’re the same question wearing three different costumes — and once you spot the costume, the answer to all three gets a lot clearer.

Costume one: should a world model generate pixels, or predict in some abstract space and never bother rendering? Costume two: do video models like Sora actually learn physics, or just learn what physics looks like? Costume three: is a large language model (LLM) already a world model, just from predicting the next word?

Strip the costumes off and each one asks the same thing: does getting the output right force you to get the world right? Predict the next pixel, the next frame, the next token well enough — are you forced to understand what made it? My read of the evidence: not automatically. Not yet. And the place it breaks is the same in all three.

I · PIXELS

Should a world model even draw the world?

Yann LeCun has been the loudest “no.” When OpenAI shipped Sora as a “world simulator” in early 2024, he called the whole idea of modeling the world by generating pixels “as wasteful and doomed to failure” as a long-abandoned approach. His alternative is JEPA (Joint Embedding Predictive Architecture): predict in an abstract latent space — a compressed internal representation, not pixels — and throw away the detail you can’t predict anyway. He’s now put real money behind it — $1.03B, the seed round for AMI Labs, the company he left Meta to start, whose pitch is a world model that doesn’t render.

Here’s the part the headlines miss: the two sides agree more than they let on. LeCun himself says there’s “no question that world models should perform prediction in latent space.” Everyone’s predicting in latent space. The actual fight is narrower — do you also need a decoder that turns that latent state back into something you can watch? OpenAI’s Sora camp and Eric Xing’s generative GLP camp say yes: you render in order to ground and check the prediction. LeCun says the decoder is a tax you pay to model details that don’t matter.

So costume one, undressed: is reconstructing the observation a necessary part of understanding the world, or an expensive distraction? Hold that thought.

II · PHYSICS

Looking right is not being right.

This is where it gets measured, and the cleanest test I’ve seen is almost unfairly simple. Train a video model on a tiny, tidy physics world — balls moving and colliding — then check whether it learned the laws or just memorized the clips. On the physics it trained on, it’s basically perfect: 0.012 error. Step outside that distribution and the error jumps to 0.427 — an order of magnitude worse — and throwing more data and parameters at it barely helps. What the model is really doing is reaching for the nearest clip it saw, not running the rule. (Small model, toy 2D world — not Sora-scale. But the pattern keeps showing up.) That’s the phyworld result, ICML 2025.

[Sora] is at best a semi-reliable guide to how the world looks. These two—how the world works, and how the world looks—are fundamentally different.— Gary Marcus, ‘No, Sora has not learned physics’

Even OpenAI’s own Sora report admits it: the model can make a man eat a burger and leave bite marks, but glass shatters wrong and objects pop into existence. It nails what the world looks like and fumbles how it works. That’s costume two. It’s costume one with a number attached: getting the pixels right (looking right) isn’t getting the state right (being right). You only find out at the edges.

III · TOKENS

Is the next word enough?

Run the same test on language. Ilya Sutskever’s famous claim is that “predicting the next token well means that you understand the underlying reality that led to the creation of that token.” And there’s real evidence for a soft version of this. Train a small model only on Othello game moves — no board, no rules — and you can read the actual board state back out of its internals (a “probe” is just a little classifier that checks what the model secretly represents). That probe’s error drops from 26.2% on an untrained model to 1.7% on a trained one, and if you edit that internal board, the model’s predictions change with it. There’s a little world in there, and it’s causal.

That’s real evidence an LLM can grow a little world inside itself. But Othello is a tidy, rule-clean game, and “there’s a board in there” is a long way from “it models the real world.” The skeptical case is simple, and I think it’s mostly right: predicting what a person would say isn’t predicting what will happen. Richard Sutton, who basically founded RL, won’t even grant the premise — mimicking people, he says, “is not really to build a model of the world at all.” And poke at the internals and the thing looks less like one clean model than a pile of local rules — a “bag of heuristics” (Melanie Mitchell’s phrase) that, as Gary Marcus likes to note, can drop a city in the middle of the Atlantic and ace the benchmark anyway. Strong evidence of a real, general world model in there? Not yet.

Same shape. Third time. The next token is right; whether the world behind it is right is exactly what’s in dispute.

IV · THE TELL

Render vs simulate.

World Labs has a clean way to cut this that I think generalizes well past their own work: a renderer outputs what a viewer would see; a simulator outputs what is actually there. Their line about video models is the whole debate in one sentence — the buildings in the drone shot look flawless from above, but “try to drive through the city below and they fall apart.”

Pixels

The surface it predictsThe next pixels / frame

In-distributionLooks like 3D understanding

The OOD tellDrive through the city and it falls apart

Physics

The surface it predictsThe next frame of motion

In-distribution0.012 error — near perfect

The OOD tell0.427 error — copies the nearest example

Tokens

The surface it predictsThe next token

In-distributionA readable, causal board state

The OOD tellCities in the Atlantic; a bag of heuristics

Three debates, one structure. The surface prediction is a real signal that there’s structure underneath — but matching the surface is necessary, not sufficient, for having recovered it.

That’s the unifying move. These aren’t identical fights — what counts as “the world” differs in each, which is why the three falsifiers look so different — but the shape is the same. All three are render-vs-simulate, and the tell — the thing that exposes which one you’ve actually got — is the same every time: in-distribution, a renderer is indistinguishable from a simulator; out-of-distribution, only the simulator survives. Sora looks like physics until you ask for a collision it never saw. Call it the OOD tell. (It’s a tendency, not a law — at internet scale some physics partially generalizes, so the line blurs. But the direction holds.)

And that reframes the question worth arguing about. It isn’t “is X a world model” — a word fight nobody wins, because the phrase already means about five different things. It’s “does X get the state right when the surface stops being a reliable guide?” That one you can measure.

V · THREE BETS

Where I’d put my chips.

First — and here I’ll take the other side of my own evidence: I think autoregressive, causal world models plus scale will close the OOD gap. The phyworld result that spooks everyone ran small diffusion models on a toy 2D world — not the recipe I’d bet on. The recipe I’d bet on is the boring one that already worked for language: predict the next state causally, and scale. Wayve’s GAIA-1 already reframed driving as next-token prediction and saw LLM-style scaling laws kick in. My read is that OOD physics is a data-and-objective problem, not a wall. I’m wrong if a frontier autoregressive world model, scaled another order of magnitude, still shows order-of-magnitude OOD physics degradation that doesn’t shrink with scale by 2028.

Second: the “does it actually understand physics” question gets answered by robots before it gets answered by video. A video model that fumbles physics just looks a little off; a robot that fumbles physics drops the cup. Embodiment punishes being wrong, so that’s where the real out-of-distribution pressure — and the real progress — shows up first. I’m wrong if the decisive evidence on physical understanding through 2027 keeps coming from video benchmarks rather than embodied agents.

Third: “world model” becomes the new “AGI” — a phrase so overloaded that serious labs quietly stop putting it in their titles. It already means about five different things; once the marketing catches up, precise people route around it. I’m wrong if “world model” is still the headline term in major model releases through 2027.

None of this needs the words “world model.” It needs one test, asked three ways. In short: stop arguing about whether the thing understands the world, and watch what it does when the world stops looking familiar. :)

Three world-model fights,one question.