SliceGraph maps process isomers in multi-run CoT reasoning, finding that 85.5% of 954 problem-model cells show correct trajectories splitting into multiple process families with 76.6% of run pairs cross-family on average.
LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals
7 Pith papers cite this work. Polarity classification is still indexing.
abstract
This work characterizes large language models' chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth. This structure already exists in base models, while reasoning training primarily accelerates convergence toward termination-related subspaces rather than introducing new representational organization. While early reasoning steps follow similar trajectories, correct and incorrect solutions diverge systematically at late stages. This late-stage divergence enables mid-reasoning prediction of final-answer correctness with ROC-AUC up to 0.87. Furthermore, we introduce trajectory-based steering, an inference-time intervention framework that enables reasoning correction and length control based on derived ideal trajectories. Together, these results establish reasoning trajectories as a geometric lens for interpreting, predicting, and controlling LLM reasoning behavior.
years
2026 7verdicts
UNVERDICTED 7representative citing papers
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear predictors.
LoRi distills implicit chain-of-thought by matching low-rank structures in hidden states, raising math-reasoning accuracy toward explicit CoT levels on LLaMA and Qwen models.
Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Introduces effective dimension d_ρ from spectral analysis of reasoning trajectories to distinguish task hardness (0.93 AUC on MATH500) and uses kinematic features for early correctness prediction from partial generations.
citing papers explorer
-
SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning
SliceGraph maps process isomers in multi-run CoT reasoning, finding that 85.5% of 954 problem-model cells show correct trajectories splitting into multiple process families with 76.6% of run pairs cross-family on average.
-
Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
-
Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling
Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear predictors.
-
LoRi: Low-Rank Distillation for Implicit Reasoning
LoRi distills implicit chain-of-thought by matching low-rank structures in hidden states, raising math-reasoning accuracy toward explicit CoT levels on LLaMA and Qwen models.
-
Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning
Hidden-Align adds an auxiliary loss to align hidden states of correct reasoning paths at the pre-answer token in RLVR, improving pass@1 by 3.8-6.2 points over DAPO on eight math benchmarks for Qwen3 models of 1.7B-14B scale.
-
Hypothesis generation and updating in large language models
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
-
Geometric Signatures of Reasoning: A Spectral Perspective on Task Hardness
Introduces effective dimension d_ρ from spectral analysis of reasoning trajectories to distinguish task hardness (0.93 AUC on MATH500) and uses kinematic features for early correctness prediction from partial generations.