pith. machine review for the scientific record. sign in

arxiv: 2605.00817 · v1 · submitted 2026-05-01 · 💻 cs.CL

Recognition: unknown

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-09 18:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsprocedural executioninstruction followingreasoning benchmarksdiagnostic evaluationarithmetic algorithmserror analysis
0
0 comments X

The pith

LLMs lose accuracy on following multi-step arithmetic procedures as length grows from 5 to 95 steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark that gives language models a step-by-step arithmetic algorithm plus two input numbers and asks for the final result. It varies the number of steps and the number of look-back references to earlier variables to increase difficulty while keeping the operations simple. Across 14 models and 55 datasets the first-answer accuracy falls from 61 percent on the shortest procedures to 20 percent on the longest ones. The authors catalog specific failure modes such as skipping the answer, stopping early, or inserting unrequested operations. These patterns indicate that models can produce correct short answers without reliably executing the full instructed sequence.

Core claim

The central claim is that when models receive explicit step-wise arithmetic procedures, their first-answer accuracy averages 61 percent on 5-step versions but falls to 20 percent on 95-step versions across 14 models and 55 datasets. Generation analysis reveals recurring errors including missing final answers, premature termination, self-correction after mistakes, under-executed step traces, and hallucinated extra operations.

What carries the argument

A controlled diagnostic benchmark that supplies a step-wise arithmetic algorithm defined over intermediate variables together with two numeric inputs and requires the model to return only the final computed value.

If this is right

  • Models that succeed on short procedures frequently fail on longer ones due to incomplete step execution.
  • Apparent reasoning performance on benchmarks can hide substantial shortfalls in faithful instruction following.
  • Common error types include early termination, under-execution of traces, and insertion of unprompted steps.
  • Final-answer accuracy alone is insufficient to certify that a model has carried out the full procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training that rewards only final answers may discourage faithful step-by-step adherence.
  • The same benchmark pattern could be applied to non-arithmetic domains such as code generation or logical deduction sequences.
  • Adding explicit verification prompts at each step might reduce the observed accuracy gap on long procedures.

Load-bearing premise

The measured accuracy decline is produced by breakdowns in following the instructed procedure rather than by context-window limits, arithmetic skill gaps, or prompt formatting differences.

What would settle it

If models maintain high accuracy on the 95-step procedures once every intermediate result is explicitly requested and verified before the next step, the drop would be shown to stem from something other than procedural execution failure.

Figures

Figures reproduced from arXiv: 2605.00817 by Abhishek Upperwal, Mayank Singh, Pritam Kadasi, Sailesh Panda.

Figure 1
Figure 1. Figure 1: Accuracy of various language models as a function of algorithmic step count (5–95). Performance view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy (%) of language models under varying look-back dependencies (1–7). As the required look-back view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy and execution behavior across increasing algorithm lengths. While exact-match accuracy view at source ↗
Figure 4
Figure 4. Figure 4: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 5
Figure 5. Figure 5: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]). Output magnitude view at source ↗
Figure 6
Figure 6. Figure 6: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 7
Figure 7. Figure 7: Median expected output across steps for small ( view at source ↗
Figure 8
Figure 8. Figure 8: Median expected output across steps for Mid range models (14B, 30B). While some models show view at source ↗
Figure 9
Figure 9. Figure 9: Median expected output across steps for larger models ( view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy across input ranges as a function of algorithm length. All ranges show a consistent decline view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Both data view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy heatmap across models and input ranges. Performance varies across models, with no uniform view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy heatmap across models and task types. Addition and subtraction tasks generally yield higher view at source ↗
Figure 14
Figure 14. Figure 14: Coverage (non-null answer rate) across increasing step counts. While many models maintain high view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of the normalized position of the first generated answer across models. Models vary in view at source ↗
Figure 16
Figure 16. Figure 16: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 17
Figure 17. Figure 17: Accuracy across input ranges as a function of algorithm length. Performance degrades rapidly with view at source ↗
Figure 18
Figure 18. Figure 18: Accuracy across task types. The model achieves higher accuracy on addition (10.8%) and subtraction view at source ↗
Figure 19
Figure 19. Figure 19: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 20
Figure 20. Figure 20: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated view at source ↗
Figure 21
Figure 21. Figure 21: Median expected output across steps for different task types (addition, subtraction, multiplication, view at source ↗
Figure 22
Figure 22. Figure 22: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 23
Figure 23. Figure 23: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 24
Figure 24. Figure 24: Accuracy across input ranges as a function of algorithm length. Performance degrades with increasing view at source ↗
Figure 25
Figure 25. Figure 25: Accuracy across task types. The model achieves higher accuracy on addition (85.5%) and subtraction view at source ↗
Figure 26
Figure 26. Figure 26: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 27
Figure 27. Figure 27: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated view at source ↗
Figure 28
Figure 28. Figure 28: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 29
Figure 29. Figure 29: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 30
Figure 30. Figure 30: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 31
Figure 31. Figure 31: Accuracy across input ranges as a function of algorithm length. Performance degrades with increasing view at source ↗
Figure 32
Figure 32. Figure 32: Accuracy across task types. The model achieves higher accuracy on addition (65.3%) compared to view at source ↗
Figure 33
Figure 33. Figure 33: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 34
Figure 34. Figure 34: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 35
Figure 35. Figure 35: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 36
Figure 36. Figure 36: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 37
Figure 37. Figure 37: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 38
Figure 38. Figure 38: Accuracy across input ranges as a function of algorithm length. Performance remains uniformly low view at source ↗
Figure 39
Figure 39. Figure 39: Accuracy across task types. The model achieves uniformly low accuracy across all tasks, with only view at source ↗
Figure 40
Figure 40. Figure 40: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 41
Figure 41. Figure 41: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 42
Figure 42. Figure 42: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 43
Figure 43. Figure 43: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 44
Figure 44. Figure 44: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 45
Figure 45. Figure 45: Accuracy across input ranges as a function of algorithm length. Performance remains uniformly low view at source ↗
Figure 46
Figure 46. Figure 46: Accuracy across task types. The model achieves uniformly low accuracy across all tasks, with only view at source ↗
Figure 47
Figure 47. Figure 47: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 48
Figure 48. Figure 48: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 49
Figure 49. Figure 49: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 50
Figure 50. Figure 50: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 51
Figure 51. Figure 51: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 52
Figure 52. Figure 52: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 53
Figure 53. Figure 53: Accuracy across task types. The model achieves higher accuracy on addition (91.1%) and subtraction view at source ↗
Figure 54
Figure 54. Figure 54: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 55
Figure 55. Figure 55: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 56
Figure 56. Figure 56: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 57
Figure 57. Figure 57: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 58
Figure 58. Figure 58: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 59
Figure 59. Figure 59: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 60
Figure 60. Figure 60: Accuracy across task types. The model achieves higher accuracy on addition (88.9%) and subtraction view at source ↗
Figure 61
Figure 61. Figure 61: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 62
Figure 62. Figure 62: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 63
Figure 63. Figure 63: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 64
Figure 64. Figure 64: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 65
Figure 65. Figure 65: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 66
Figure 66. Figure 66: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 67
Figure 67. Figure 67: Accuracy across task types. The model achieves higher accuracy on addition (72.9%) compared to view at source ↗
Figure 68
Figure 68. Figure 68: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 69
Figure 69. Figure 69: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated view at source ↗
Figure 70
Figure 70. Figure 70: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 71
Figure 71. Figure 71: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 72
Figure 72. Figure 72: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 73
Figure 73. Figure 73: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 74
Figure 74. Figure 74: Accuracy across task types. The model achieves higher accuracy on addition (86.7%) and subtraction view at source ↗
Figure 75
Figure 75. Figure 75: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 76
Figure 76. Figure 76: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 77
Figure 77. Figure 77: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 78
Figure 78. Figure 78: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 79
Figure 79. Figure 79: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 80
Figure 80. Figure 80: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 81
Figure 81. Figure 81: Accuracy across task types. The model achieves higher accuracy on addition (85.3%) and subtraction view at source ↗
Figure 82
Figure 82. Figure 82: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 83
Figure 83. Figure 83: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 84
Figure 84. Figure 84: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 85
Figure 85. Figure 85: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 86
Figure 86. Figure 86: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 87
Figure 87. Figure 87: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 88
Figure 88. Figure 88: Accuracy across task types. The model achieves higher accuracy on addition (98.1%) compared view at source ↗
Figure 89
Figure 89. Figure 89: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 90
Figure 90. Figure 90: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 91
Figure 91. Figure 91: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 92
Figure 92. Figure 92: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 93
Figure 93. Figure 93: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 94
Figure 94. Figure 94: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 95
Figure 95. Figure 95: Accuracy across task types. The model achieves higher accuracy on addition (40.3%) and subtraction view at source ↗
Figure 96
Figure 96. Figure 96: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 97
Figure 97. Figure 97: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 98
Figure 98. Figure 98: Median expected output across steps for different task types (addition, subtraction, multiplication, division, view at source ↗
Figure 99
Figure 99. Figure 99: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 100
Figure 100. Figure 100: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 101
Figure 101. Figure 101: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 102
Figure 102. Figure 102: Accuracy across task types. The model achieves higher accuracy on addition (99.7%) and subtraction view at source ↗
Figure 103
Figure 103. Figure 103: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 104
Figure 104. Figure 104: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 105
Figure 105. Figure 105: Median expected output across steps for different task types (addition, subtraction, multiplication, view at source ↗
Figure 106
Figure 106. Figure 106: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
Figure 107
Figure 107. Figure 107: Accuracy across input data types (integer vs. floating-point) as a function of algorithm length. Accuracy view at source ↗
Figure 108
Figure 108. Figure 108: Accuracy across input ranges as a function of algorithm length. Performance declines with increasing view at source ↗
Figure 109
Figure 109. Figure 109: Accuracy across task types. The model achieves higher accuracy on addition (99.9%) and subtraction view at source ↗
Figure 110
Figure 110. Figure 110: Median expected output across steps for integer and floating-point inputs, separated by correct and view at source ↗
Figure 111
Figure 111. Figure 111: Median expected output across steps for different input ranges ([0,1], [1,10], [10,100]), separated by view at source ↗
Figure 112
Figure 112. Figure 112: Median expected output across steps for different task types (addition, subtraction, multiplication, view at source ↗
Figure 113
Figure 113. Figure 113: Accuracy and prediction comparison types across increasing algorithm lengths. The green line (% view at source ↗
read the original abstract

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a diagnostic benchmark for procedural execution in LLMs consisting of step-wise arithmetic algorithms of increasing length (5 to 95 steps) with look-back dependencies over intermediate variables. Using two numeric inputs per procedure, it evaluates 14 models across 55 datasets and reports that first-answer accuracy falls from 61% (5 steps) to 20% (95 steps). Generation analysis identifies recurring failure modes including missing answers, premature answers, self-correction after initial errors, under-executed traces, and hallucinated extra steps, arguing that final-answer accuracy on reasoning tasks can conceal substantial deficits in faithful instruction following.

Significance. If the accuracy decline can be attributed specifically to breakdowns in step-by-step fidelity, the work supplies a scalable diagnostic that separates procedural execution from final-answer correctness. The evaluation scale (14 models, 55 datasets) and explicit failure taxonomy offer concrete, falsifiable observations that could guide improvements in instruction-following reliability.

major comments (1)
  1. [Abstract] Abstract: The central claim attributes the accuracy drop from 61% to 20% to failures of procedural fidelity (missing steps, premature termination, hallucinated operations). However, the description provides no evidence that prompt length, total token count, or arithmetic-operation count were held constant while varying only the number of steps and look-back dependencies. Without such controls, the observed trend is consistent with general context-length degradation or operation-count scaling and does not isolate the intended phenomenon.
minor comments (1)
  1. The abstract refers to '55 datasets' and 'simple arithmetic operations' but does not indicate whether the procedures are generated programmatically or manually curated; a brief methods paragraph clarifying construction would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our diagnostic study of procedural execution in LLMs. The primary concern is the lack of explicit controls for prompt length, token count, and operation count in the abstract. We address this point directly below and outline revisions to improve clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim attributes the accuracy drop from 61% to 20% to failures of procedural fidelity (missing steps, premature termination, hallucinated operations). However, the description provides no evidence that prompt length, total token count, or arithmetic-operation count were held constant while varying only the number of steps and look-back dependencies. Without such controls, the observed trend is consistent with general context-length degradation or operation-count scaling and does not isolate the intended phenomenon.

    Authors: We agree that the abstract does not provide evidence of such controls. The benchmark design intentionally scales procedural complexity by increasing the number of steps (from 5 to 95) and introducing look-back dependencies on intermediate variables, which necessarily lengthens the prompt as each step adds fixed descriptive text and variable references. We did not hold total prompt length or token count constant across conditions (e.g., via padding or alternative constructions), nor did we isolate arithmetic-operation count independently of step count. The generation-level analysis, however, identifies failure modes—such as skipped steps, premature termination, and hallucinated operations—that are tied to the procedural structure and dependencies rather than generic length effects. To address the referee's valid point, we will revise the abstract to note the design trade-offs and add a dedicated paragraph in the Methods or Limitations section discussing potential confounds with context length and operation scaling, while retaining the core claim that the observed failure taxonomy suggests deficits in faithful instruction following beyond final-answer accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study with direct accuracy reporting

full rationale

The paper constructs a diagnostic benchmark consisting of step-wise arithmetic procedures of increasing length and measures first-answer accuracy across models and datasets. All reported results (e.g., 61% to 20% accuracy drop) are direct empirical observations from model generations, not derived quantities, fitted parameters renamed as predictions, or quantities obtained via self-referential definitions. No equations, uniqueness theorems, ansatzes, or load-bearing self-citations appear in the core claims; the study contains no derivation chain that reduces to its inputs by construction. The central measurements are therefore self-contained and falsifiable by replication on the same benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical diagnostic study; no free parameters, axioms, or invented entities are introduced beyond standard practices of prompting LLMs and measuring output accuracy.

pith-pipeline@v0.9.0 · 5454 in / 1101 out tokens · 39056 ms · 2026-05-09T18:48:26.279158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    Transformer Circuits Thread , volume=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

  2. [2]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Math neurosurgery: Isolating language models’ math reasoning abilities using only forward passes , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  3. [3]

    arXiv preprint arXiv:1811.01157 , year=

    Identifying and controlling important neurons in neural machine translation , author=. arXiv preprint arXiv:1811.01157 , year=

  4. [4]

    International Conference on Machine Learning , pages=

    Task-specific skill localization in fine-tuned language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  5. [5]

    Back attention: Understanding and enhancing multi-hop reasoning in large language models

    Back attention: Understanding and enhancing multi-hop reasoning in large language models , author=. arXiv preprint arXiv:2502.10835 , year=

  6. [6]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

  7. [7]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  8. [8]

    Advances in neural information processing systems , volume=

    Are sixteen heads really better than one? , author=. Advances in neural information processing systems , volume=

  9. [9]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Do not think that much for 2+ 3=? on the overthinking of o1-like llms , author=. arXiv preprint arXiv:2412.21187 , year=

  10. [10]

    IEEE transactions on emerging topics in computational intelligence , volume=

    A survey on neural network interpretability , author=. IEEE transactions on emerging topics in computational intelligence , volume=. 2021 , publisher=

  11. [11]

    arXiv preprint arXiv:2205.10487 , year=

    Scaling laws and interpretability of learning from repeated data , author=. arXiv preprint arXiv:2205.10487 , year=

  12. [12]

    Toy Models of Superposition

    Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

  13. [13]

    Transformer Circuits Thread , year=

    A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , year=

  14. [14]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  15. [15]

    2020 , month = Aug, note =

    nostalgebraist , title =. 2020 , month = Aug, note =

  16. [16]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Analyzing transformers in embedding space , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  17. [17]

    arXiv preprint arXiv:2305.13417 , year=

    VISIT: Visualizing and interpreting the semantic information flow of transformers , author=. arXiv preprint arXiv:2305.13417 , year=

  18. [18]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  19. [19]

    Emergent Abilities of Large Language Models

    Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  20. [20]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  21. [21]

    arXiv preprint arXiv:2502.01839 , year=

    Sample, scrutinize and scale: Effective inference-time search by scaling verification , author=. arXiv preprint arXiv:2502.01839 , year=

  22. [22]

    Segment the horses from the rest of the image and generate a new image where the horse regions are white and the other regions are black

    Teaching algorithmic reasoning via in-context learning , author=. arXiv preprint arXiv:2211.09066 , year=

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  24. [24]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  25. [25]

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

    The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity , author=. arXiv preprint arXiv:2506.06941 , year=

  26. [26]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

  27. [27]

    Deepseek-r1 thoughtology: Let's think about llm reasoning

    DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning , author=. arXiv preprint arXiv:2504.07128 , year=

  28. [28]

    The relationship between reasoning and performance in large language models–o3 (mini) thinks harder, not longer

    The Relationship Between Reasoning and Performance in Large Language Models--o3 (mini) Thinks Harder, Not Longer , author=. arXiv preprint arXiv:2502.15631 , year=

  29. [29]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  30. [30]

    Factual Shortcuts with Attribute Rate Ratio , author=

    Unveiling Internal Reasoning Modes in LLMs: A Deep Dive into Latent Reasoning vs. Factual Shortcuts with Attribute Rate Ratio , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  31. [31]

    International Conference on Innovative Techniques and Applications of Artificial Intelligence , pages=

    Rethinking the illusion of thinking , author=. International Conference on Innovative Techniques and Applications of Artificial Intelligence , pages=. 2025 , organization=

  32. [32]

    2023 , month = oct, journal =

    Dissecting recall of factual associations in auto-regressive language models , author=. arXiv preprint arXiv:2304.14767 , year=

  33. [33]

    Advances in neural information processing systems , volume=

    Investigating gender bias in language models using causal mediation analysis , author=. Advances in neural information processing systems , volume=

  34. [34]

    Advances in neural information processing systems , volume=

    Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

  35. [35]

    How to use and interpret activation patching.arXiv preprint arXiv:2404.15255,

    How to use and interpret activation patching , author=. arXiv preprint arXiv:2404.15255 , year=

  36. [36]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  37. [37]

    2025 , eprint=

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity , author=. 2025 , eprint=

  38. [38]

    Olmo 3

    Olmo 3 , author=. arXiv preprint arXiv:2512.13961 , year=

  39. [39]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  40. [40]

    2024 , howpublished =

    Mistral-7B-Instruct-v0.3 , author =. 2024 , howpublished =

  41. [41]

    Ministral 3

    Ministral 3 , author=. arXiv preprint arXiv:2601.08584 , year=

  42. [42]

    Jiang, Andy Lo, Gabrielle Berrada, Guillaume Lample, et al

    Magistral , author=. arXiv preprint arXiv:2506.10910 , year=

  43. [43]

    2026 , howpublished =

    Introducing Sarvam's Sovereign Models , author =. 2026 , howpublished =

  44. [44]

    Nvidia nemotron 3: Efficient and open intelligence, 2025

    NVIDIA Nemotron 3: Efficient and Open Intelligence , author=. arXiv preprint arXiv:2512.20856 , year=

  45. [45]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  46. [46]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , author=

  47. [47]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

  48. [48]

    OpenThoughts: Data Recipes for Reasoning Models

    Openthoughts: Data recipes for reasoning models , author=. arXiv preprint arXiv:2506.04178 , year=