The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

· 2026 · cs.LG · arXiv 2605.10799

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emph{answer placement} rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflicting-answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near-zero at 7B across five open-weight model families; wrong-answer following is strong at 3B--7B and attenuates sharply at larger scales. Replications on MATH, within-stable comparisons at 7B, and suffix-free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation-time probes indicate that final answers are rarely early-determined during generation (${<}5\%$ early commitment), yet consumption-time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three-prerequisite protocol (question-only control, format characterization, and an all-position sweep) as a practical minimum for future corruption-based faithfulness studies.

representative citing papers

Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters

cs.AI · 2026-06-29 · accept · novelty 7.0

In-distribution sampling across 25 models and controlled interventions with DAG-verified content show that semantic reasoning and validation content, not token count, drive CoT gains.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters cs.AI · 2026-06-29 · accept · none · ref 2 · internal anchor
In-distribution sampling across 25 models and controlled interventions with DAG-verified content show that semantic reasoning and validation content, not token count, drive CoT gains.

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

fields

years

verdicts

representative citing papers

citing papers explorer