The Illusion of Thinking: Models Don’t Actually Reason

I grabbed my morning coffee and dove into Apple’s new paper, The Illusion of Thinking (okay fine, ChatGPT and I dove into it together). And you know what? None of this really surprised me.

I’ve been working through creating custom “thinking” agents, leveraging reasoning models. The head-shaped-hole on my desk tells you how well it’s going. I found this new research very affirming and helped pivot my approach.

What Apple Found About LRMs

Apple built a suite of synthetic puzzles: Tower of Hanoi, River Crossing, Blocks World, and more. These puzzles are designed to avoid training contamination and scale difficulty cleanly. When tested on these puzzles, even the most advanced “thinking” models struggled. Here’s what they found:

Sharp Accuracy Collapse

Models do great on easy puzzles, then completely fall apart as complexity grows.

“Claude 3.7 Thinking accuracy drops from ~85% (3 disks) to 0% (10 disks)”
(Apple, 2024, Fig. 6, p. 8)

This wasn’t a small dip. It was a cliff.

Inverted Effort Scaling

As problems get harder, models initially reason more… but then give up.

“Upon approaching a critical threshold… models begin reducing their reasoning effort despite increasing problem difficulty.”
(Apple, 2024, p. 8)

Even when they’ve got plenty of token budget left, they stop trying:

“Despite operating well below their generation length limits… models fail to take advantage of additional inference compute.”
(Apple, 2024, p. 8)

Overthinking Simple Tasks

On easy puzzles, models often ramble, generating too many unnecessary steps, sometimes hurting their own performance.

“In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives—an ‘overthinking’ phenomenon.”
(Apple, 2024, p. 9)

“Solution accuracy tends to decrease or oscillate as thinking progresses.”
(Apple, 2024, Fig. 7b)

No True Generalization

Even when you give the model the algorithm, it still struggles to execute.

“Even when we provide the algorithm in the prompt… the observed collapse still occurs at roughly the same point.”
(Apple, 2024, p. 10)

“They fail to learn reusable subroutines or logical abstractions.”
(Apple, 2024, p. 3)

In one case, Claude 3.7 nailed 100+ moves in Tower of Hanoi… but failed after just 4 moves in a simpler River Crossing puzzle. That’s not logic, it’s memorized patterns.

Why This Isn’t Shocking

Benchmarks Mask Reality

Most math/coding benchmarks bleed into training data. That inflates scores and hides fragility. Apple’s puzzles dodge that with clean, procedurally generated tasks.

“Chain-of-Thought” is not Chain-of-Belief

CoT outputs look thoughtful, but they’re just likely text sequences. There’s no inner critic or real logical grounding behind them.

Bigger is not Better

Scaling helps fluency, not reasoning. Even the best models hit a wall.

“We must stop confusing thinking-shaped text with actual thought.”
(Apple, 2024, p. 3)

Advertisements

Why Research Like This Matters

For starters, it justified what I was feeling and experiencing. More importantly, we need independent teams like Apple’s to call these things out because product marketing sure won’t.

This paper shows:

The value of controlled evaluation (procedural puzzles, infinite clean data)
How reasoning trace analysis reveals failure modes long before final answers do
And that, without this transparency, we’d still be equating “longer prompts” with “deeper thought”.

So… What Now?

Let’s not throw the LRMs out with the hallucination water. They’re still incredibly awesome. But we need to adjust how we work with them:

Be skeptical of outputs that look thoughtful. Dig into the reasoning traces to understand how the model arrived at its answer. Find experts on the content to review and sign off that it is, in fact, right and helpful.
Don’t rely on final accuracy alone. Prioritize evaluation environments that expose how well models handle new rules and growing complexity.
Explore hybrid approaches that go beyond language prediction, like adding memory, symbolic reasoning, or external tools to guide real problem-solving.

LRMs are master mimics. But real reasoning? Still out of reach. There’s no silver bullet, just a long road ahead, paved with better models and smarter tests.

Until then, we’re all just watching text that looks like thinking.

Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.

The Illusion of Thinking: Models Don’t Actually Reason

What Apple Found About LRMs

Why This Isn’t Shocking

Why Research Like This Matters

So… What Now?

One thought on “The Illusion of Thinking: Models Don’t Actually Reason”

Add yours

Leave a Reply Cancel reply

Categories

Top Posts & Pages

Follow

Blog Stats

What Apple Found About LRMs

Why This Isn’t Shocking

Why Research Like This Matters

So… What Now?

Rate this:

Please share the love!

Related

One thought on “The Illusion of Thinking: Models Don’t Actually Reason”

Add yours

Leave a Reply Cancel reply

Categories

Top Posts & Pages

Follow

Blog Stats