Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness

· 2026 · cs.CL · arXiv 2603.22816

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Language models increasingly show their work by writing step-by-step reasoning before answering. But are these steps genuinely used, or is the answer rigid - fixed before reasoning begins? We introduce the Step-Level Reasoning Capacity (SLRC) metric and prove it is a consistent causal estimator (Theorem 1). We propose LC-CoSR, a training method with Lyapunov stability guarantees that directly reduces rigidity. Evaluating 16 frontier models (o4-mini, GPT-5.4, Claude Opus, Grok-4, DeepSeek-R1, Gemini 2.5 Pro, and others) across six domains at N=133-500, we find reasoning falls into three modes. OpenAI's o4-mini shows 74-88% step necessity on five of six tasks (73.8-88.3%) - the highest SLRC in our study. The critical differentiator is RL-based reasoning training, not thinking tokens: Grok-4's reasoning mode shows lower faithfulness than its non-reasoning mode (1.4% vs 7.2% necessity). We discover a faithfulness paradox - high-SLRC models are more susceptible to sycophancy - and propose the Reasoning Integrity Score (RIS = SLRC x (1-Sycophancy)), which significantly predicts error detection (rho=0.66, p=0.026). LC-CoSR achieves 2.6x less negative reward than FARL and CSR baselines without external model dependencies.

representative citing papers

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

cs.CL · 2026-06-29 · conditional · novelty 8.0

Clinical reasoning graphs reveal that LLMs exhibit diagnostic competence on complex cases but lack consistent schema-scale reasoning patterns across similar cases.

citing papers explorer

Showing 1 of 1 citing paper.

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency cs.CL · 2026-06-29 · conditional · none · ref 15 · internal anchor
Clinical reasoning graphs reveal that LLMs exhibit diagnostic competence on complex cases but lack consistent schema-scale reasoning patterns across similar cases.

Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness

fields

years

verdicts

representative citing papers

citing papers explorer