Real-Time Progress Prediction in Reasoning Language Models

Anders S{\o}gaard; Constanza Fierro; Hans Peter Lyngs{\o}e Raaschou-Jensen

arxiv: 2506.23274 · v4 · pith:CVFYCLPVnew · submitted 2025-06-29 · 💻 cs.LG · cs.AI

Real-Time Progress Prediction in Reasoning Language Models

Hans Peter Lyngs{\o}e Raaschou-Jensen , Constanza Fierro , Anders S{\o}gaard This is my paper

classification 💻 cs.LG cs.AI

keywords progressmodelsreasoningreal-timeambiguitylabelslanguagelong

0 comments

read the original abstract

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hypothesis generation and updating in large language models
cs.LG 2026-05 unverdicted novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.