Future of reasoning models

← Back to logbook

Some thoughts on what does the future of reasoning models look like. Where are we now, what problems do we need to overcome, and what should they do?

Calibration

Current RLVR models often "overthink" on simple problems because of inference-time scaling. We should calibrate how deeply a model thinks to the difficulty of the problem (assuming the model grasps the cusp of the user's question). Qwen3 demonstrated that RL models can incorporate this into their loss function.

Planning

We just use special <thinking> tokens to designate "reasoning" from the actual answer. The problem with RL, and by extension RLVR, is that if the model makes a step too far off-policy, it can't "backtrack." Certainly PPO/GRPO attempt to fix this, but for autoregressive models, if a model starts to think in the "wrong direction," you're doomed from the start. We should try exploring paradigms where it's not just <thinking> then <answer> but <plan> then <evaluate> then <thinking> then <answer> (or something of the sort) to get models to avoid starting down the wrong path. Claude Code does this through prompting by telling the model to make a checklist of items to follow, but this doesn't scale particularly well.

Benchmark Saturation

It's not a surprise that we are saturating even the hardest exams ever. I forsee a problem where we over-index on benchmarks. That is, we saturate difficult benchmark k, then move to more difficult benchmark k+1, then k+2, so on and so forth. But in this process of chasing benchmarks, we "miss the forest for the trees" → we miss general reasoning for really good benchmarks.

Swarms

In human orgs, adding a linear amount of people doesn't necessarily linearly scale output. Bureaucracy, process, and other things have to exist for this to be true. But is this the case for swarms of agents? Can we orchestrate swarms of Claude Code's to organize themselves and plan better than an individual agent could? And, how do you train collaboration between models into RLVR? (similar to training models for tool calls)