The Illusion of Thinking

← Back to logbook

Gonna toot my own horn here: Apple just dropped this paper today: "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" which hits on a LOT of points I just made on my post yesterday. Good timing i guess

A tl/dr of the paper is that Apple did some research on large reasoning models (LRMs) in controllable puzzle environments like Tower of Hanoi, river crossing, checkers, etc. They found that they fail to have generalizable reasoning capabilities in higher complexity problem spaces and struggle with overthinking in low complexity problems. They did some unique work looking at reasoning traces (not just output) to see how the model can follow instructions and move step-by-step in these problems, allowing them to see surprisingly poor instruction-following capabilities.

Similar to what I said yesterday, they confirmed:

Overthinking/reasoning collapse - accuracy progressively declines as problems become more complex before eventually just reaching 0% accuracy. And it's not that the problem itself is harder... it's just solving tower_of_hanoi(5) much more than tower_of_hanoi(100) despite it being literally a "can you follow the algorithm" problem (and, even when the researchers gave them the algorithm, they still didn't perform as well!)
Benchmark saturation - We are really focused on mathematical and coding benchmarks. But these struggle from:

Data contamination - they found degradation in performance from AIME24→25. Are we still just pattern matching?
We cannot "see" into the reasoning traces of the steps that were taken to reach an answer, especially for easily verifiable outputs like math/coding.
We cannot "tune" difficulty of these problems/benchmarks to see how the LRMs can handle increasingly more complex problems
We're missing the forest (general reasoning) for the trees (benchmarks)

Caveats:

These are very contrived problem spaces.
In the appendix you can nit their prompting and evaluation of reasoning traces (maybe they just suck at prompting).
Isn't this kind of obvious?

Future considerations:

How can we evaluate LRMs more robustly? We cannot assume that if the model gave a correct answer, it was reasoning correctly/its reasoning was useful.

Some relevant papers that also are skeptical of reasoning models n stuff:

Models can have a drop in performance of up to 65% if you literally change the variables in math problems (cite).
You can literally insert gibberish into reasoning models that are unrelated to the problem and it doesn't degrade performance at all. Moreover, what models output as what they say they are reasoning often doesn't actually represent what the model is truly reasoning (cite).