Recursive Reasoning Models: Understanding The Intuitions From Code Implementation POV

Solving Complex Reasoning Tasks By Thinking In Loops

What if a tiny neural network could solve Sudoku, navigate mazes, and crack abstract reasoning puzzles? Not by being massive, but by recursively thinking again and again until it gets the right answer. This post explores the core ideas behind recursive reasoning models from a code implementation point of view, because I believe it’s much easier to learn by seeing.

The Idea: Recursion Over Scale

The dominant paradigm in AI today is scaling: bigger models, more data, more compute. And for good reason. Models like Claude Opus 4.7 (speculatively ~4T parameters) and GPT-5.5 (speculatively ~2T+ parameters) are state-of-the-art precisely because general-purpose reasoning, coding, writing, and multi-domain understanding genuinely benefit from scale. You need that breadth of knowledge baked into the weights.

But there’s a class of problems where scale alone doesn’t help. Hard constraint-satisfaction puzzles like Sudoku, maze pathfinding, and abstract reasoning (ARC-AGI) require deep iterative reasoning over a specific problem instance. A 671B-parameter LLM like DeepSeek R1 scores 0% on Sudoku-Extreme. A 7M-parameter recursive model scores 87%. These problems don’t need more world knowledge. They need more thinking time on the problem at hand.

At least that’s my intuition on things, and I could be wrong, but it seems like these are really two different paradigms solving two different kinds of problems. Large foundation models excel at breadth and generalization across tasks. Recursive reasoning models excel at depth on specific tasks where the bottleneck is test-time computation, not parameter count and also requires per-task fine-tuning as the problem statement is so niche like solving Sudoku, Mazes, ARC-AGI benchmarks, etc.

The core idea behind recursive reasoning models is simple. Instead of building a huge network that solves the problem in one forward pass, we build a tiny network that learns to incrementally improve its answer. Given a question, an initial (bad) answer, and some latent “scratchpad” state, the model learns a single operation: make the answer slightly better. Then we apply that operation over and over.

The fundamental algorithm:

# x: the embedded input question (shape [B, 97, 512] for Sudoku)
# y: the model's current answer (same shape, randomly initialized)
# z: latent reasoning scratchpad (same shape, internal working memory)
# net: a single tiny network, reused for every update

# N_sup (halt_max_steps=16): supervision steps, each gets its own loss + grad update.
# T (H_cycles=3): deep recursion cycles per step. Only the last one has gradients.
# n (L_cycles=6): latent refinement steps per cycle.

for step in range(N_sup):                   # up to 16 chances to improve the answer
    x = embed(input, puzzle_id)             # compute input embedding once per step

    with torch.no_grad():                   # first T-1 cycles: warm-up, no gradients
        for cycle in range(T - 1):
            for i in range(n):
                z = net(z, y + x)           # refine scratchpad using answer + question
            y = net(y, z)                   # update answer using scratchpad

    for i in range(n):                      # last cycle: gradients flow here
        z = net(z, y + x)
    y = net(y, z)

    loss = cross_entropy(decode(y), ground_truth)
    loss.backward()                         # gradients only from the last cycle
    y, z = y.detach(), z.detach()           # carry state forward, cut gradient graph

The three levels of recursion

The pseudocode above has three nested loops. Here’s what each one does:

Level 1: Deep Supervision (the outer for step loop, N_sup=16 steps)

Each step runs a full forward pass, computes a loss, and does a backward pass. The model gets up to 16 chances to improve its answer. After each step, y and z are detached from the computation graph but carried forward to the next step. So step 10 benefits from all the reasoning accumulated in steps 1-9, without needing to backpropagate through them. This is the key trick for emulating extreme depth without extreme memory: you only store activations for one step at a time, but the latent state accumulates knowledge across all steps.

Level 2: Deep Recursion (T=3 cycles per step)

Within each supervision step, the model runs T=3 full cycles of latent recursion. The first T-1=2 cycles run under torch.no_grad() (free warm-up), and only the last cycle carries gradients for backpropagation. This works because deep supervision trains the model so that each cycle genuinely improves (y, z), regardless of where (y, z) started. The no-grad cycles are “free thinking” that brings (y, z) closer to the correct answer before the learning signal kicks in. If you’re familiar with multi-turn LLMs that use tool calls during chain-of-thought, the idea is similar: tool-call results inform the model’s subsequent reasoning but aren’t themselves part of the loss. Here, cycles 1-2 refine (y, z) and that improved state feeds into cycle 3, but no gradients flow through cycles 1-2 themselves.

Level 3: Latent Recursion (n=6 iterations per cycle, then 1 answer update)

Each cycle runs the network 7 times: 6 times to refine the scratchpad (z = net(z, y + x)) and once to update the answer (y = net(y, z)). The net here is a 2-layer Transformer block stack with ~5M parameters. The same weights are used for both the z-update and the y-update, differentiated only by what gets injected as input. What this network actually does (token mixing via mlp_t or attention, channel mixing via SwiGLU, RMSNorm stabilization) is the subject of the Architecture section below.

Per step, the model runs 3 cycles * 7 net calls * 2 transformer blocks = 42 effective layers. Over 16 steps, that’s 672 layers of reasoning from a 5M-parameter network.