I grabbed my morning coffee and dove into Apple’s new paper, The Illusion of Thinking (okay fine, ChatGPT and I dove into it together). And you know what? None of this really surprised me.
I’ve been working through creating custom “thinking” agents, leveraging reasoning models. The head-shaped-hole on my desk tells you how well it’s going. I found this new research very affirming and helped pivot my approach.
Sure, the results from Apple’s research are dramatic: Large Reasoning Models (LRMs) collapsing on Tower of Hanoi, slacking off on hard puzzles, overthinking the easy ones; but the main takeaway feels familiar:
Models mimic thought. They don’t actually think.
Let’s break it down.
Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.
What Apple Found About LRMs
Apple built a suite of synthetic puzzles: Tower of Hanoi, River Crossing, Blocks World, and more. These puzzles are designed to avoid training contamination and scale difficulty cleanly. When tested on these puzzles, even the most advanced “thinking” models struggled. Here’s what they found:
Sharp Accuracy Collapse
Models do great on easy puzzles, then completely fall apart as complexity grows.
“Claude 3.7 Thinking accuracy drops from ~85% (3 disks) to 0% (10 disks)”
(Apple, 2024, Fig. 6, p. 8)
This wasn’t a small dip. It was a cliff.
Inverted Effort Scaling
As problems get harder, models initially reason more… but then give up.
“Upon approaching a critical threshold… models begin reducing their reasoning effort despite increasing problem difficulty.”
(Apple, 2024, p. 8)
Even when they’ve got plenty of token budget left, they stop trying:
“Despite operating well below their generation length limits… models fail to take advantage of additional inference compute.”
(Apple, 2024, p. 8)
Overthinking Simple Tasks
On easy puzzles, models often ramble, generating too many unnecessary steps, sometimes hurting their own performance.
“In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives—an ‘overthinking’ phenomenon.”
(Apple, 2024, p. 9)
“Solution accuracy tends to decrease or oscillate as thinking progresses.”
(Apple, 2024, Fig. 7b)
No True Generalization
Even when you give the model the algorithm, it still struggles to execute.
“Even when we provide the algorithm in the prompt… the observed collapse still occurs at roughly the same point.”
(Apple, 2024, p. 10)
“They fail to learn reusable subroutines or logical abstractions.”
(Apple, 2024, p. 3)
In one case, Claude 3.7 nailed 100+ moves in Tower of Hanoi… but failed after just 4 moves in a simpler River Crossing puzzle. That’s not logic, it’s memorized patterns.
Why This Isn’t Shocking
Benchmarks Mask Reality
Most math/coding benchmarks bleed into training data. That inflates scores and hides fragility. Apple’s puzzles dodge that with clean, procedurally generated tasks.
“Chain-of-Thought” is not Chain-of-Belief
CoT outputs look thoughtful, but they’re just likely text sequences. There’s no inner critic or real logical grounding behind them.
Bigger is not Better
Scaling helps fluency, not reasoning. Even the best models hit a wall.
“We must stop confusing thinking-shaped text with actual thought.”
(Apple, 2024, p. 3)Advertisements
Why Research Like This Matters
For starters, it justified what I was feeling and experiencing. More importantly, we need independent teams like Apple’s to call these things out because product marketing sure won’t.
This paper shows:
- The value of controlled evaluation (procedural puzzles, infinite clean data)
- How reasoning trace analysis reveals failure modes long before final answers do
- And that, without this transparency, we’d still be equating “longer prompts” with “deeper thought”.
So… What Now?
Let’s not throw the LRMs out with the hallucination water. They’re still incredibly awesome. But we need to adjust how we work with them:
- Be skeptical of outputs that look thoughtful. Dig into the reasoning traces to understand how the model arrived at its answer. Find experts on the content to review and sign off that it is, in fact, right and helpful.
- Don’t rely on final accuracy alone. Prioritize evaluation environments that expose how well models handle new rules and growing complexity.
- Explore hybrid approaches that go beyond language prediction, like adding memory, symbolic reasoning, or external tools to guide real problem-solving.
LRMs are master mimics. But real reasoning? Still out of reach. There’s no silver bullet, just a long road ahead, paved with better models and smarter tests.
Until then, we’re all just watching text that looks like thinking.
Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.
