{"paper":{"title":"The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Standard chain-of-thought corruption tests measure the placement of the final answer rather than the importance of reasoning steps.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.LG","authors_text":"Gabriel Garcia","submitted_at":"2026-05-11T16:26:50Z","abstract_excerpt":"Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \\emph{answer placement} rather than where intermediate computation is carried out.\n  Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflic"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure answer placement rather than where intermediate computation is carried out.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The observed suffix sensitivity and conflicting-answer following arise primarily from consumption-time format following rather than from any early commitment during generation or from the intrinsic computational structure of the reasoning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Standard chain-of-thought corruption tests measure the placement of the final answer rather than the importance of reasoning steps.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f5177fdbb93a58aaf4b913f665f23eeb11cc7c234ebe09f40409f7a863eded0b"},"source":{"id":"2605.10799","kind":"arxiv","version":2},"verdict":{"id":"9712a8bf-d550-416f-955c-bd5749536f3c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T17:21:25.127791Z","strongest_claim":"when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure answer placement rather than where intermediate computation is carried out.","one_line_summary":"Corruption studies of CoT faithfulness largely measure explicit answer placement in prompt format rather than computational importance of reasoning steps.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The observed suffix sensitivity and conflicting-answer following arise primarily from consumption-time format following rather than from any early commitment during generation or from the intrinsic computational structure of the reasoning.","pith_extraction_headline":"Standard chain-of-thought corruption tests measure the placement of the final answer rather than the importance of reasoning steps."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.10799/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-19T14:35:30.318464Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T10:31:17.746146Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T08:58:44.427882Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"46d125dd4663f856d463ccac3a3e5c11d3ea68fc034b728587a2006e52984abc"},"references":{"count":19,"sample":[{"doi":"","year":2022,"title":"J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Process","work_id":"3ad43e9f-25ad-40a8-b9b2-5bb9e11a1af0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Rep","work_id":"ea4174ff-88f0-46a2-9f56-24226a7d5ec5","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"M. Turpin, J. Michael, E. Perez, and S. R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing ","work_id":"41af5020-4945-463a-a44b-e0ea640af54a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Measuring Faithfulness in Chain-of-Thought Reasoning","work_id":"86ca07b8-4628-4f51-8938-a82683386ae4","ref_index":4,"cited_arxiv_id":"2307.13702","is_internal_anchor":true},{"doi":"","year":2024,"title":"Let’s think dot by dot: Hidden computa- tion in transformer language models","work_id":"745f12c5-dbd0-4b89-a2aa-e78d08e61bf1","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":19,"snapshot_sha256":"8f3dedccedb04fcb28a9a2444444b8e7053294755c9f607615796fd52c140420","internal_anchors":4},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}