Thoughts · The Bitter Lesson

Three thoughts on the bitter lesson.

“The large language models are learning from training data. It’s not learning from experience.” Sutton’s 2025 read of his own 2019 essay.— Richard Sutton, Dwarkesh Podcast, September 2025

Peike Li May 2026 ≈ 7 min read

years between Sutton’s two chapters: the 2019 essay and the 2025 Dwarkesh re-read.

scaling axes the frontier has pivoted to since pre-training hit the data wall: RL, test-time search, multi-agent.

years of AI history that taught Sutton the same lesson over and over: chess, Go, speech, vision.

2032

My bet the year LLM pre-training probably joins hand-coded grammars on the list of cautionary tales.

I · TWO CHAPTERS

The bitter lesson has two chapters, and most people only know the first one.

Sutton’s first chapter is the 2019 essay. The thesis is that 70 years of AI research show the same pattern over and over: researchers build human knowledge into systems, the systems work for a few years, then a general method that scales with computation comes along and eats them. Sutton runs through chess, Go, speech, vision. Same story. The prescription is short: the two methods that scale arbitrarily are search (try a lot of moves, score them, pick the good ones) and learning (fit a big model to a lot of data via gradient descent). Stop trying to be clever, point compute at those, and wait.

The second chapter is more recent. On Dwarkesh in September 2025, Sutton updated his own argument. He doesn’t think LLMs are following the lesson. He put it bluntly: “The large language models are learning from training data. It’s not learning from experience.” The field, in his read, has been quoting the 2019 essay while doing roughly the opposite of what it prescribes.

Chapter 1 · 2019

“The two methods that seem to scale arbitrarily in this way are search and learning.”The Bitter Lesson, March 2019

Chapter 2 · 2025

“The large language models are learning from training data. It’s not learning from experience.”Dwarkesh Podcast, September 2025

Here are three thoughts on what’s been written between Sutton’s two chapters.

II · THE NEXT CHAPTER

The LLM era is setting up to be the biggest bitter lesson yet.

The 2019 essay reads like a list of cautionary tales. Each example has the same shape: an era worships its dominant paradigm — chess heuristics, hand-tuned grammars, painstaking phoneme decoders — and a general method comes along that scales with compute and renders the careful work irrelevant. The point isn’t that human knowledge is useless. It’s that the most-celebrated paradigm in any given decade tends to be the next one eaten.

The natural prediction follows. LLMs themselves are setting up to be the next chapter.

Pre-training a model on the entire crawlable internet is the largest-scale exercise in baking human knowledge into a system ever attempted. Compared to hand-tuned chess heuristics, it’s enormous — but it’s the same move, scaled up. The model isn’t learning the world; it’s learning what humans have written about the world. The data is bounded, downstream of human cognition, and the data wall is real. That’s what Sutton means by “not learning from experience.”

If the pattern holds — and it has held every previous time — by around 2032 we’ll talk about training models to mimic internet text the way we now talk about hand-coded grammar rules: as a beautiful local optimum that was obviously the wrong thing.

III · WHERE KNOWLEDGE LIVES

Human knowledge never left AI. It just keeps moving upstream.

Whenever someone announces “scale ate the human knowledge,” the next question should be: which human knowledge?

The history of the field is a slow upward migration of where human work happens.

Layer	What humans did	Era of focus	Status today
Feature engineering	Hand-designed inputs to the model	2000s – early 2010s	Eaten by SGD on raw inputs
Architecture design	Layer ordering, attention, residuals	2010s	Mostly dormant since the Transformer
Objective design	Writing reward functions and evaluation criteria	2020s	Partly automated by RL / RLHF
Evaluation & alignment	Picking what “good” means	2025+	Next on the menu

Hand-crafted features got replaced by representations the network learns end-to-end. That layer is gone. Architecture design isn’t gone — convolutions, attention, and self-play were real human-knowledge wins — but nothing in the last three years has displaced the Transformer at iso-compute. Call it a dormant equilibrium.

Shimon Whiteson called all of this out in 2019 as the “Sweet Lesson”: many of the methods that scale — attention, self-play, even reinforcement learning (RL) itself — are themselves human inventions. So “scale beats human knowledge” uses human knowledge to argue against human knowledge, which is incoherent at the meta level.

The honest answer is that the bitter lesson and the sweet lesson are locating the same fight at different layers of abstraction. Each layer of human input gets absorbed by the layer above; the lesson works locally at every layer even when the meta-claim wobbles. Feature engineering lived at a layer scale could eat. Architecture design lives at a layer scale can’t yet reach. Objective design is being partly automated by RL right now. Evaluation and alignment are next.

The bitter lesson isn’t “no human knowledge.” It’s “human knowledge keeps moving upstream, and the layers below get automated.” The work doesn’t disappear. The work moves.

IV · ALWAYS ANOTHER AXIS

The real lesson is “there’s always another axis to scale.”

If you only read the 2019 essay, the bitter lesson sounds like “compute wins.” That’s not quite right. What actually wins is the meta-pattern: every time progress on one scaling axis stalls, the field finds a new one.

Era	Active axis	Wall hit	Next axis
2020 – 24	Pre-training	Data wall	RL fine-tuning
2024 – 25	RL fine-tuning	Reward design	Test-time search
2025 – 26	Test-time search	Diminishing returns on long chains of thought	Multi-agent
2026+	Multi-agent	???	World models / embodied environments

Pre-training scaling worked beautifully until around 2024, when the field started running into the data wall. It pivoted to RL on chains of thought. That worked until reward design became the bottleneck. The next pivot was test-time search — DeepSeek-R1, o1, the whole reasoning-model wave. That’s now hitting its own ceiling: the empirical literature on long chains of thought (see arXiv 2502.12215) shows that more reasoning steps often flip correct answers to wrong, and sequential scaling tops out fast. So the field is pivoting again, this time to multi-agent compositions.

Each time, someone declares “scaling is dead” and means “this axis is saturated.” The two claims are not the same.

To make this falsifiable rather than retroactive astrology: if by end-2027, multi-agent systems still sit at the top of the major capability evals (METR Long Horizon, SWE-bench, GPQA) and no new axis has produced a >10× compute-efficiency gain on any of them tracked by Epoch AI, the thesis is wrong and the bitter lesson has run out of axes for the first time in its history.

The real lesson, the more durable one, is that there’s always another axis to scale. Obvious in retrospect, never obvious in the moment. Every time the current axis tops out, the field gets convinced it’s the end of the road. It’s never been the end of the road.

V · THE FOURTH TIME

History happens twice. First as bitter lesson, then as bitter lesson.

If the axis-switching pattern holds, multi-agent will eventually hit its own wall, and what the field reaches for after that is world models — neural systems that learn from interacting with environments instead of from human text. Probably the cleanest test the bitter lesson gets.

The setup is identical to the chess era, the speech era, the vision era. A domain with a long tradition of human-engineered structure (physics engines, hand-tuned simulators, world-state representations) meets a competing approach that throws compute at letting the system learn the structure itself. We’ve watched this exact arc three times. Each time the human-engineered side gets a long head start, the early scaling looks weak, and somewhere in the middle the bitter lesson hits and the curves cross.

The question isn’t whether the curves will cross. The question is whether — given that this is the fourth time we’ll watch the same arc — anyone bets on the bitter lesson early instead of late.

The lesson is the same each time, and somehow it keeps catching us by surprise. Maybe what we should learn from Sutton’s two chapters is that there’s going to be a third chapter and a fourth, and that even the bitter lesson takes longer to arrive than its author expects.