In-distribution sampling across 25 models and controlled interventions with DAG-verified content show that semantic reasoning and validation content, not token count, drive CoT gains.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emph{answer placement} rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflicting-answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near-zero at 7B across five open-weight model families; wrong-answer following is strong at 3B--7B and attenuates sharply at larger scales. Replications on MATH, within-stable comparisons at 7B, and suffix-free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation-time probes indicate that final answers are rarely early-determined during generation (${<}5\%$ early commitment), yet consumption-time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three-prerequisite protocol (question-only control, format characterization, and an all-position sweep) as a practical minimum for future corruption-based faithfulness studies.
fields
cs.AI 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters
In-distribution sampling across 25 models and controlled interventions with DAG-verified content show that semantic reasoning and validation content, not token count, drive CoT gains.