The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
Pith reviewed 2026-05-19 17:21 UTC · model grok-4.3
The pith
Standard chain-of-thought corruption tests measure the placement of the final answer rather than the importance of reasoning steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When benchmark chains end with an explicit terminal answer line, corruption tests measure answer placement rather than computational importance of steps. Matched GSM8K examples show that removing the final answer statement collapses suffix sensitivity by about 19 times. Conflicting-answer prompts with correct reasoning but wrong final answer drive accuracy to zero or near zero in models up to 7B. Generation probes find less than 5 percent early commitment, indicating the effect is consumption-time following of explicit answer text.
What carries the argument
Suffix sensitivity to the location of explicit answer text in the provided chain, which determines consumption-time output following.
If this is right
- Corruption sensitivity tracks the location of explicit answer text rather than fixed computational depth.
- Wrong-answer following is strong at 3B to 7B scales and attenuates at larger models.
- Final answers are rarely early-determined during generation, with under 5 percent early commitment.
- Replications on MATH and suffix-free chains confirm the pattern of tracking explicit answer location.
- The proposed three-prerequisite protocol of question-only control, format characterization, and all-position sweep is required for valid faithfulness studies.
Where Pith is reading between the lines
- Existing claims about reasoning faithfulness from corruption studies on standard benchmarks may need re-testing with format-controlled chains.
- Models appear to treat the explicit final answer as a strong signal for what to output, independent of the preceding reasoning.
- Future benchmarks could avoid this by using suffix-free answer formats or by randomizing answer placement.
- Similar format confounds might affect other evaluation methods that rely on structured output in reasoning tasks.
Load-bearing premise
The observed suffix sensitivity and conflicting-answer following arise primarily from consumption-time format following rather than from any early commitment during generation or from the intrinsic computational structure of the reasoning.
What would settle it
A test showing that corruption sensitivity does not decrease after removing the final answer line from chains, or that accuracy remains high on conflicting-answer prompts even with explicit wrong answers removed.
Figures
read the original abstract
Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emph{answer placement} rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflicting-answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near-zero at 7B across five open-weight model families; wrong-answer following is strong at 3B--7B and attenuates sharply at larger scales. Replications on MATH, within-stable comparisons at 7B, and suffix-free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation-time probes indicate that final answers are rarely early-determined during generation (${<}5\%$ early commitment), yet consumption-time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three-prerequisite protocol (question-only control, format characterization, and an all-position sweep) as a practical minimum for future corruption-based faithfulness studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard chain-of-thought corruption studies on benchmarks like GSM8K and MATH primarily measure sensitivity to the explicit placement of the terminal answer line rather than the computational importance of intermediate reasoning steps. Using matched GSM8K examples, the authors show that removing only the final answer statement collapses suffix sensitivity by ~19× (N=300, p=0.022 for Qwen 2.5-3B). Conflicting-answer prompts drive accuracy to zero or near-zero at 7B across model families, with replications on MATH, suffix-free variants, within-stable 7B comparisons, and generation-time probes (<5% early commitment) all indicating that the effect is a consumption-time format-following phenomenon. A three-prerequisite protocol (question-only control, format characterization, all-position sweep) is proposed for future work.
Significance. If the results hold, the work is significant because it identifies a systematic confound that could invalidate or reinterpret many prior CoT faithfulness evaluations. Direct experimental interventions (matched removals, conflicting prompts), statistical support, replications across benchmarks and scales, and generation probes provide concrete, falsifiable evidence distinguishing consumption-time readout effects from intrinsic reasoning structure. This strengthens the case for more controlled evaluation protocols and highlights format sensitivity as a key factor in LLM reasoning studies.
minor comments (3)
- [Abstract and §4] The abstract and main text refer to 'within-stable 7B comparisons' without an explicit definition or reference to the relevant section or table; adding a one-sentence clarification would improve readability.
- [Experimental setup and generation probes] The exact prompt templates for the matched GSM8K examples and the precise criterion used to detect early commitment (<5%) in generation-time probes are not fully detailed; including them (or a pointer to supplementary material) would aid reproducibility.
- [Results on conflicting-answer prompts] Table or figure presenting per-model accuracies for the conflicting-answer condition across the five families would make the scale-dependent attenuation claim easier to evaluate at a glance.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, accurate summary of our claims, and recommendation for minor revision. We appreciate the recognition that the work identifies a potential systematic confound in prior CoT faithfulness evaluations and provides concrete experimental distinctions between consumption-time format effects and intrinsic reasoning structure.
Circularity Check
No significant circularity identified
full rationale
The paper's claims rest on direct experimental interventions (matched removals of final answer lines, conflicting-answer prompts, replications on MATH and suffix-free variants, and generation probes) rather than any derivation, equation, or self-referential definition. No load-bearing steps reduce to fitted inputs, self-citations, or ansatzes by construction; results are self-contained measurements against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language models preferentially follow explicit terminal answer statements when present in the input chain.
Reference graph
Works this paper leans on
-
[1]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022. 24
work page 2022
-
[2]
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023
work page 2023
- [3]
-
[4]
Measuring Faithfulness in Chain-of-Thought Reasoning
T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Durmus, D. Hernandez, N. Joseph, Z. Kernion, A. Askell, B. Jones, S. Bowman, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Jacobson, S. Johnson, J. Kernion, S. Kravec, L. Lovitt, S. Ringer, E. Tran-Johnson, and C. Olah. Measuring faithfulness in c...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [5]
- [6]
-
[7]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024
work page 2024
-
[9]
Solving math word problems with process- and outcome-based feedback
J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [10]
-
[11]
A. Madaan and A. Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango.arXiv preprint arXiv:2209.07686, 2022
-
[12]
W. Merrill and A. Sabharwal. The expressive power of transformers with chain of thought. In International Conference on Learning Representations, 2024
work page 2024
-
[13]
A. Saparov and H. He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. InInternational Conference on Learning Representations, 2023
work page 2023
- [14]
- [15]
- [16]
-
[17]
Towards Understanding Sycophancy in Language Models
M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. Johnston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, and J. Kaplan. Towards understanding sycophancy in language models.arXiv preprint arXiv:2310.13548, 2023. A Slice Development Narrative The easy s...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Generation phase.The model generates a complete chain of thought for each problem. We retain only examples where the self-generated chain produces the correct answer (Ncorrect = 147; generation accuracy = 0.49)
-
[19]
Consumption phase.For each correctly-solved example, we take the model’sowngenerated steps and apply the same three-condition protocol from Section 7.2: •SG-SC: self-generated steps + correct answer line, •SG-CC: self-generated steps + conflicting wrong answer, •QO: question only (no chain). If the consumption objection holds, i.e., the model reasons more...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.