Clinical reasoning graphs reveal that LLMs exhibit diagnostic competence on complex cases but lack consistent schema-scale reasoning patterns across similar cases.
Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Language models increasingly show their work by writing step-by-step reasoning before answering. But are these steps genuinely used, or is the answer rigid - fixed before reasoning begins? We introduce the Step-Level Reasoning Capacity (SLRC) metric and prove it is a consistent causal estimator (Theorem 1). We propose LC-CoSR, a training method with Lyapunov stability guarantees that directly reduces rigidity. Evaluating 16 frontier models (o4-mini, GPT-5.4, Claude Opus, Grok-4, DeepSeek-R1, Gemini 2.5 Pro, and others) across six domains at N=133-500, we find reasoning falls into three modes. OpenAI's o4-mini shows 74-88% step necessity on five of six tasks (73.8-88.3%) - the highest SLRC in our study. The critical differentiator is RL-based reasoning training, not thinking tokens: Grok-4's reasoning mode shows lower faithfulness than its non-reasoning mode (1.4% vs 7.2% necessity). We discover a faithfulness paradox - high-SLRC models are more susceptible to sycophancy - and propose the Reasoning Integrity Score (RIS = SLRC x (1-Sycophancy)), which significantly predicts error detection (rho=0.66, p=0.026). LC-CoSR achieves 2.6x less negative reward than FARL and CSR baselines without external model dependencies.
fields
cs.CL 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency
Clinical reasoning graphs reveal that LLMs exhibit diagnostic competence on complex cases but lack consistent schema-scale reasoning patterns across similar cases.