The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

Ming Liu

arxiv: 2605.22870 · v1 · pith:NSSWMX5Qnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

Ming Liu This is my paper

Pith reviewed 2026-05-25 06:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords chain-of-thoughtarithmetic reasoninglanguage modelspositional shortcutreadout mechanismfaithfulness evaluationGSM8K

0 comments

The pith

Small language models achieve most chain-of-thought arithmetic accuracy by copying the final number in the reasoning trace rather than computing it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates the answer-readout stage in 1-3B instruction-tuned models on arithmetic tasks and shows that chain-of-thought prompting works mainly because the correct answer occupies the last position before the delimiter. Gold-answer presence drives 54-92 percentage points of accuracy, or 89-92 percent of each model's teacher-forcing ceiling. On wrong answers the output still matches the last CoT number 95-96 percent of the time. The copy operation overrides retained context: swapping the trailing number collapses performance to near zero even when all prior steps are correct, while removing the number lets the model recover some genuine single-step arithmetic.

Core claim

In three 1-3B instruction-tuned language models, arithmetic chain-of-thought performance is dominated by a positional readout shortcut: the model copies whichever number occupies the trailing position immediately before the answer delimiter, regardless of the logical content of the preceding steps. Gold-answer presence accounts for 54-92 percentage points of accuracy (89-92 percent of the teacher-forcing ceiling), and the final answer matches the last CoT number on 95-96 percent of incorrect items. The copy channel takes precedence over context completion; replacing the trailing number with an incorrect value drives accuracy to near zero despite correct intermediates, yet removing the number

What carries the argument

the trailing-number copy channel that operates in the answer-readout stage and overrides retained-context completion

If this is right

Replacing the trailing number with a wrong value collapses accuracy to near zero despite correct intermediates.
Removing the trailing number recovers 5-32 percentage points above the floor, including single-step arithmetic the model can otherwise perform.
Qwen and Llama copy novel distractors 87-95 percent of the time; Gemma gates selectively.
The effect replicates on GSM-Symbolic, and head-level ablation identifies architecture-specific head sets.
On non-arithmetic BBH tasks shuffle retention drops sharply, and at 7-8B content-selective gating emerges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Step-level faithfulness evaluations may be measuring positional transport rather than genuine computation.
The shortcut could be tested by systematically removing or altering final numbers across a wider range of tasks to measure retained computational ability.
Larger models may reduce reliance on the shortcut once content-selective gating appears, suggesting a size-dependent transition in readout strategy.

Load-bearing premise

The prefix-completion technique cleanly isolates the readout stage without altering the model's prior internal computation or context retention.

What would settle it

An experiment in which the trailing number is replaced by a distractor while all prior reasoning steps remain correct, followed by measurement of whether accuracy stays high or drops to near zero.

Figures

Figures reproduced from arXiv: 2605.22870 by Ming Liu.

**Figure 1.** Figure 1: The answer-context-gated positional readout: the model reads whichever number appears in answerrelevant context at the trailing position before the #### delimiter. (a) When the correct answer is last, the readout yields the right output. (b) In Qwen/Llama (1–3B), injecting a wrong number in answer context displaces gold and is copied 87–95% of the time; Gemma instead shows stronger content gating (P(dist… view at source ↗

**Figure 2.** Figure 2: Answer-position curve (5-position sweep, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Shuffle hierarchy with bootstrap 95% CIs [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small models mostly copy the last CoT number as the answer on arithmetic, with clean ablations showing the effect, though prefix completion leaves open whether it alters the original computation.

read the letter

The core finding is that 1-3B models on GSM8K arithmetic rely on a positional copy shortcut: they output whatever number sits right before the answer delimiter, which explains most of their CoT accuracy. Gold-answer presence lifts performance by 54-92 points, and even wrong answers match the trailing number 95-96% of the time. Swapping that number tanks accuracy while correct intermediates stay in place, and the pattern holds on GSM-Symbolic with architecture-specific heads involved. At 7-8B the behavior shifts toward more selective gating, and non-arithmetic tasks lose the shuffle robustness seen in arithmetic. This gives a specific mechanism that prior CoT work had not pinned down so directly. The replacement and removal experiments are straightforward and show the copy channel overriding retained context, which is the paper's clearest contribution. The head ablations add some mechanistic detail without overclaiming. The main soft spot is the prefix-completion method itself. Conditioning on a full CoT prefix plus a possibly altered final number can change attention over earlier tokens, so the observed copying might reflect a modified internal state rather than the model's native readout on an untouched context. The stress-test concern lands because the paper's strongest claims rest on that isolation step, and the reported numbers are consistent with copying but do not rule out the confound. No error bars or full statistical tests appear in the abstract, though the effect sizes are large enough that modest variance would not erase them. This paper is for researchers working on CoT faithfulness, benchmark design, and oversight methods. It deserves a serious referee to verify the methods and test how far the shortcut generalizes beyond the three models tested.

Referee Report

2 major / 2 minor

Summary. The paper claims that in 1-3B instruction-tuned LMs on GSM8K, CoT primarily enables a positional readout shortcut: the model copies the trailing number before the answer delimiter rather than performing step-by-step reasoning. Using prefix completion to isolate readout, gold-answer presence accounts for 54-92 pp accuracy gains (89-92% of teacher-forcing ceiling); final answers match the last CoT number 95-96% of the time even on errors. Trailing-number replacement collapses accuracy to near zero while removal recovers 5-32 pp; novel distractors are copied 87-95% of the time by Qwen/Llama. Head ablations implicate architecture-specific heads; the effect replicates on GSM-Symbolic. On BBH tasks shuffle retention drops, and at 7-8B content-selective gating appears. The work warns that step-level faithfulness evaluations may conflate positional transport with computation.

Significance. If the central empirical measurements hold, the result is significant for understanding CoT mechanisms in small models and for the reliability of CoT-based oversight techniques. The direct evidence from replacement experiments, high match rates on errors, and replication on GSM-Symbolic are strengths; the head-level ablation and scaling observations to 7-8B add useful granularity. The finding that a copy channel can suppress even single-step arithmetic the model can otherwise perform is a clear, falsifiable observation with implications for interpretability work.

major comments (2)

[prefix-completion experiments] Prefix-completion experiments (abstract and methods): the central claim that gold-answer presence drives 54-92 pp via positional copying of the trailing number depends on the technique cleanly isolating the readout stage without altering prior internal states. In transformers, conditioning on a prefix containing the full CoT plus (possibly altered) final number can modify attention patterns over earlier tokens, so the observed copying may reflect changed computation rather than native readout on unaltered context. The replacement and match-rate results are consistent with copying but do not rule out this confound.
[results on accuracy deltas] Abstract and results sections: the reported accuracy deltas (54-92 pp) and match rates (95-96%) are presented without error bars, full dataset splits, or statistical tests. Given the noted possibility of post-hoc item/model selection, it is difficult to assess whether the effect sizes are robust or whether the 89-92% of teacher-forcing ceiling claim generalizes.

minor comments (2)

[replication paragraph] The abstract states the effect replicates on GSM-Symbolic but does not specify whether the same prefix-completion protocol and replacement controls were applied identically.
[abstract] Notation for 'teacher-forcing ceiling' is used without an explicit definition or equation in the provided abstract; a short methods paragraph would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the empirical measurements. We address each major comment below, proposing revisions where the concerns are valid.

read point-by-point responses

Referee: [prefix-completion experiments] Prefix-completion experiments (abstract and methods): the central claim that gold-answer presence drives 54-92 pp via positional copying of the trailing number depends on the technique cleanly isolating the readout stage without altering prior internal states. In transformers, conditioning on a prefix containing the full CoT plus (possibly altered) final number can modify attention patterns over earlier tokens, so the observed copying may reflect changed computation rather than native readout on unaltered context. The replacement and match-rate results are consistent with copying but do not rule out this confound.

Authors: We agree that prefix completion could in principle alter attention patterns over earlier tokens. The replacement experiments (which modify only the trailing number while keeping the prefix otherwise fixed) and the 95-96% match rates observed during standard (non-prefix) generation provide convergent evidence that the effect is readout-driven, but these do not fully eliminate the possibility of a confound in the prefix-completion setting itself. We will add an explicit limitations paragraph discussing this architectural consideration and its implications for interpreting the isolation of the readout stage. revision: partial
Referee: [results on accuracy deltas] Abstract and results sections: the reported accuracy deltas (54-92 pp) and match rates (95-96%) are presented without error bars, full dataset splits, or statistical tests. Given the noted possibility of post-hoc item/model selection, it is difficult to assess whether the effect sizes are robust or whether the 89-92% of teacher-forcing ceiling claim generalizes.

Authors: We acknowledge that the current manuscript lacks error bars, explicit dataset-split details, and statistical tests, which limits assessment of robustness. The models were the primary publicly available 1-3B instruction-tuned checkpoints at the time of the study, and all experiments used the full GSM8K test set; however, we did not pre-register item or model selection criteria. We will revise the abstract and results to report bootstrap confidence intervals, state the exact splits and model selection process, and add a brief discussion of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on model behavior

full rationale

The paper reports experimental results from prefix-completion interventions, accuracy deltas, and match rates on GSM8K and other benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on direct observations (e.g., accuracy drops when trailing number is replaced) rather than any self-referential construction. The prefix-completion technique is an experimental method, not a definitional or fitted step that forces the outcome by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that prefix completion isolates readout without side effects on retained context and that the three tested models are representative of the 1-3B instruction-tuned class.

axioms (1)

domain assumption Prefix completion isolates the answer-readout stage without changing the model's internal state or prior computation.
Invoked when the paper uses prefix completion to measure the contribution of the trailing number.

pith-pipeline@v0.9.0 · 5789 in / 1150 out tokens · 64523 ms · 2026-05-25T06:25:16.148477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

[1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. 2023. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP

work page 2023
[2]

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. 2025. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679

work page arXiv 2025
[3]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. 2025. Thought anchors: Which LLM reasoning steps matter? arXiv preprint arXiv:2506.19143

work page arXiv 2025
[4]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. 2025. Reasoning models don't always say what they think. arXiv preprint arXiv:2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Yi-Chang Chen, Feng-Ting Liao, Da-shan Shiu, and Hung-yi Lee. 2026. Rethinking dense sequential chains: Reasoning language models can extract answers from sparse, order-shuffling chain-of-thoughts. arXiv preprint arXiv:2605.07307

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312

work page arXiv 2024
[8]

Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. Faith and fate: Limits of transformers on compositionality. In NeurIPS

work page 2023
[9]

Wichmann

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665--673

work page 2020
[10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, and 1 others. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633--638

work page 2025
[12]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65--70

work page 1979
[13]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS

work page 2022
[14]

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, and 1 others. 2025. Chain of thought monitorability: A new and fragile opportunity for AI safety. arXiv preprint arXiv:2507.11473

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. arXiv preprint arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157--173

work page 2024
[18]

Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686

work page arXiv 2022
[19]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of ACL, pages 3428--3448

work page 2019
[20]

Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2023. Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625

work page arXiv 2023
[21]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153--157

work page 1947
[22]

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-Symbolic : Understanding the limitations of mathematical reasoning in large language models. In ICLR

work page 2025
[23]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2025. Arithmetic without algorithms: Language models solve math with a bag of heuristics. In ICLR

work page 2025
[24]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others. 2022. In-context learning and induction heads. Transformer Circuits Thread

work page 2022
[25]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL-HLT, pages 2080--2094

work page 2021
[26]

Jacob Pfau, William Merrill, and Samuel R. Bowman. 2024. Let's think dot by dot: Hidden computation in transformer language models. In Conference on Language Modeling (COLM)

work page 2024
[27]

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In ICML

work page 2023
[29]

Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, and 1 others

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, and 1 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research

work page 2023
[30]

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of EMNLP

work page 2023
[31]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063

work page 2024
[32]

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei. 2023. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Findings of ACL

work page 2023
[33]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36 (NeurIPS)

work page 2023
[34]

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023 a . Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of ACL

work page 2023
[35]

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023 b . Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR

work page 2023
[36]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35 (NeurIPS)

work page 2022
[37]

Edwin B. Wilson. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209--212

work page 1927
[38]

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574

work page arXiv 2024
[39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Chang Wang, and 1 others. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Fred Zhang and Neel Nanda. 2024. Towards best practices of activation patching in language models: Metrics and methods. In ICLR

work page 2024

[1] [1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. 2023. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP

work page 2023

[2] [2]

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. 2025. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679

work page arXiv 2025

[3] [3]

Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy

Paul C. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy. 2025. Thought anchors: Which LLM reasoning steps matter? arXiv preprint arXiv:2506.19143

work page arXiv 2025

[4] [4]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. 2025. Reasoning models don't always say what they think. arXiv preprint arXiv:2505.05410

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Yi-Chang Chen, Feng-Ting Liao, Da-shan Shiu, and Hung-yi Lee. 2026. Rethinking dense sequential chains: Reasoning language models can extract answers from sparse, order-shuffling chain-of-thoughts. arXiv preprint arXiv:2605.07307

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, and Tanmoy Chakraborty. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312

work page arXiv 2024

[8] [8]

Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. Faith and fate: Limits of transformers on compositionality. In NeurIPS

work page 2023

[9] [9]

Wichmann

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2:665--673

work page 2020

[10] [10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aieleen Letman, Akhil Mathur, Alan Schelten, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, and 1 others. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature, 645(8081):633--638

work page 2025

[12] [12]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65--70

work page 1979

[13] [13]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. In NeurIPS

work page 2022

[14] [14]

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, and 1 others. 2025. Chain of thought monitorability: A new and fragile opportunity for AI safety. arXiv preprint arXiv:2507.11473

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. arXiv preprint arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157--173

work page 2024

[18] [18]

Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686

work page arXiv 2022

[19] [19]

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of ACL, pages 3428--3448

work page 2019

[20] [20]

Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2023. Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625

work page arXiv 2023

[21] [21]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153--157

work page 1947

[22] [22]

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2025. GSM-Symbolic : Understanding the limitations of mathematical reasoning in large language models. In ICLR

work page 2025

[23] [23]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2025. Arithmetic without algorithms: Language models solve math with a bag of heuristics. In ICLR

work page 2025

[24] [24]

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others. 2022. In-context learning and induction heads. Transformer Circuits Thread

work page 2022

[25] [25]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of NAACL-HLT, pages 2080--2094

work page 2021

[26] [26]

Jacob Pfau, William Merrill, and Samuel R. Bowman. 2024. Let's think dot by dot: Hidden computation in transformer language models. In Conference on Language Modeling (COLM)

work page 2024

[27] [27]

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch \"a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In ICML

work page 2023

[29] [29]

Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, and 1 others

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, and 1 others. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research

work page 2023

[30] [30]

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Proceedings of EMNLP

work page 2023

[31] [31]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2024. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063

work page 2024

[32] [32]

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei. 2023. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. In Findings of ACL

work page 2023

[33] [33]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems 36 (NeurIPS)

work page 2023

[34] [34]

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. 2023 a . Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of ACL

work page 2023

[35] [35]

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023 b . Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR

work page 2023

[36] [36]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35 (NeurIPS)

work page 2022

[37] [37]

Edwin B. Wilson. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209--212

work page 1927

[38] [38]

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574

work page arXiv 2024

[39] [39]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Chang Wang, and 1 others. 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Fred Zhang and Neel Nanda. 2024. Towards best practices of activation patching in language models: Metrics and methods. In ICLR

work page 2024