Emergent Slow Thinking in LLMs as Inverse Tree Freezing

· 2025 · cs.AI · arXiv 2509.23629

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Reinforcement learning with verifiable rewards (RLVR) enables large language models to acquire slow, multi-step reasoning from sparse final-answer signals. We provide a statistical-physics picture of this emergence. We show that an autoregressive model's finite capacity forces it to compress its exponentially large prefix space into a Markov network of predictive states, on which slow thinking unfolds as a random walk -- the Concept Network (CoNet) picture. Within CoNet, RLVR dynamics are governed by two mechanisms: merging of compatible paths and frustrated competition among incompatible ones. Together they drive the network through nucleation, growth, and freezing into multi-input, single-output directed inverse trees. The picture reproduces the training dynamics of a 1.5-billion-parameter LLM and yields three predictions: reasoning chains lengthen as a geometric necessity of sparse topology; SFT induces catastrophic forgetting through bridge-node rupture; and frustration drives policy collapse. Building on the structural timing inherent in inverse-tree freezing, we propose Annealed-RLVR -- a brief SFT intervention at the moment of maximum frustration. It outperforms standard RLVR on both in- and out-of-distribution benchmarks, with the largest gains at high sampling budgets where standard RLVR collapses. The same SFT applied after the trees freeze instead triggers catastrophic forgetting, isolating timing as the active ingredient.

representative citing papers

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.

On Advantage Estimates for Max@K Policy Gradients

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.

citing papers explorer

Showing 3 of 3 citing papers.

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation cs.LG · 2026-06-04 · unverdicted · none · ref 41 · internal anchor
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
On Advantage Estimates for Max@K Policy Gradients cs.LG · 2026-06-04 · unverdicted · none · ref 23 · internal anchor
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 108 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

fields

years

verdicts

representative citing papers

citing papers explorer