Arithmetic Pedagogy for Language Models

Andhika Bernard Lumbantobing; Hokky Situngkir

arxiv: 2606.05106 · v1 · pith:UF7E2TXAnew · submitted 2026-06-03 · 💻 cs.CL · cs.AI· cs.CY

Arithmetic Pedagogy for Language Models

Andhika Bernard Lumbantobing , Hokky Situngkir This is my paper

Pith reviewed 2026-06-28 06:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords arithmetic reasoningchain-of-thought supervisionlanguage model trainingpedagogysmall-scale modelsnext-token predictionprocedural learning

0 comments

The pith

Serializing an Indonesian left-to-right arithmetic method into chain-of-thought text trains an 86-million-parameter model to exceed 80 percent accuracy on held-out problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether human mathematics teaching techniques can direct language model training for arithmetic skills. It converts the GASING procedure's step-by-step calculations into natural-language chain-of-thought examples and uses these to train a small decoder-only model from scratch with ordinary next-token prediction. The model passes through clear learning phases and ends up performing at over 80 percent accuracy on new problems while staying competitive with far larger systems. A reader would care because the result points to a route for building reliable arithmetic ability in compact models without extra optimization stages or large parameter counts.

Core claim

Operationalizing each arithmetic operation via the GASING left-to-right procedure and serializing its execution trace into natural-language chain-of-thought supervision allows a small GPT-2 model trained only on next-token prediction to internalize both a procedural pathway and an associative mental-arithmetic capacity, reaching over 80 percent accuracy on held-out problems and competitive performance against substantially larger models.

What carries the argument

The GASING method, a left-to-right arithmetic procedure whose execution trace is serialized into natural-language chain-of-thought supervision for next-token training.

If this is right

Targeted pedagogical data alone suffices to produce strong arithmetic performance at small scale without reinforcement learning or reward models.
The model develops an associative retrieval capacity that bypasses explicit step-by-step computation after initial procedural learning.
Attention-masking and probing experiments confirm that chain-of-thought information shapes the model's internal computation graph.
Economical arithmetic capability becomes attainable in models under 100 million parameters when training data follows human procedural order.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same serialization approach could be tested on other ordered procedural domains such as basic logic or simple coding tasks where step order matches generation order.
The shift from procedural to associative processing may generalize to other small-model training regimes that begin with explicit supervision and later allow direct retrieval.
If the alignment between teaching procedure and model generation order proves decisive, curricula for additional skills could be redesigned to match that order rather than human reading order.

Load-bearing premise

The GASING left-to-right procedure aligns with the causal order of token generation, so its execution trace can be serialized into chain-of-thought text that transfers effectively during training.

What would settle it

Retraining the identical 86M model on the same arithmetic problems but without the GASING-derived chain-of-thought supervision and measuring accuracy below 50 percent on the same held-out set.

Figures

Figures reproduced from arXiv: 2606.05106 by Andhika Bernard Lumbantobing, Hokky Situngkir.

**Figure 2.** Figure 2: (a) Plot of the information contrast value [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Heatmap of the prediction accuracy of the correct digit by the classifier based on the model’s residual stream at layers 3, 6, 9, and 12 for the analyzed training checkpoints. High values indicate that the internal representation of the correct digit becomes increasingly separated from the other tokens as a candidate output of the model’s inference. (b) Heatmap of the logit margin of the correct digit … view at source ↗

**Figure 4.** Figure 4: Plot of the computation accuracy value for each type of arithmetic operation over the course of [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Benchmarking of basic arithmetic capability against various other large language models (LLMs). 6 Conclusion Language models designed to provide cognitive capacity within a linguistic framework have made considerable progress with the chain-of-thought (CoT) approach, which endows them with reasoning capability. Providing arithmetic reasoning to the reasoning process of language models is a further challeng… view at source ↗

**Figure 6.** Figure 6: Plot of the number of parameters vs. computation accuracy value across various Transformerbased language models. The trained model attains competitive performance despite a far more limited number of parameters. final phase, the change in loss slows again, indicating a process of maturation of the representations that have been learned. Between the early and final phases lies a phase in which the learning… view at source ↗

read the original abstract

We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GASING-derived CoT trains an 86M model above 80% on held-out arithmetic with observed phases and probes, but lacks controls to credit the specific pedagogy.

read the letter

The main result here is that serializing the GASING left-to-right arithmetic procedure into natural-language CoT lets a small GPT-2 (86M params, TOBA tokenizer) trained from scratch with next-token prediction hit over 80% on held-out problems. They also document three learning phases and run attention-masking, residual probing, and logit-lens checks to argue the model shifts from procedural execution to associative retrieval.

What the paper does well is take an existing human pedagogy whose execution order matches causal token generation and turn it into usable supervision. The mechanistic analyses go beyond accuracy numbers and give some evidence on how the internal computation changes. Training only on next-token prediction without RL or rewards keeps the setup clean.

The soft spot is the missing ablation against non-GASING CoT or direct-answer targets on the same problem distribution. If generic step-by-step traces produce similar accuracy, the claim that the specific pedagogical alignment is load-bearing does not hold. The competitive performance statement also needs explicit larger-model baselines and exact task definitions to be evaluated. The Indonesian-specific tokenizer is another variable that could limit how far the result generalizes.

This is for researchers working on narrow capability injection into small models or on how supervision shapes representations. It deserves a serious referee because the empirical setup is concrete and the analysis is present, even if the causal attribution to GASING needs tightening with controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that serializing the GASING Indonesian left-to-right arithmetic pedagogy into natural-language CoT traces, then training an 86M-parameter GPT-2 decoder from scratch on next-token prediction, produces >80% accuracy on held-out arithmetic problems. It reports three distinct training phases and uses attention-masking interventions, residual-stream probing, and logit-lens analysis to argue that the model first learns a procedural pathway before shifting to associative retrieval of intermediate results. The work concludes that pedagogically grounded supervision yields strong, economical arithmetic capability at small scale without RL or reward modeling.

Significance. If the central result holds, the work shows that targeted CoT supervision derived from established human pedagogy can produce competitive arithmetic performance in an 86M model, outperforming expectations for its size and providing mechanistic evidence of a procedural-to-associative transition. The direct training runs on held-out data and the suite of intervention-based analyses constitute concrete strengths.

major comments (2)

[Experimental setup and results] The manuscript reports no ablation comparing GASING-derived CoT to standard left-to-right CoT or to direct-answer targets on the same arithmetic distribution. Without this control, the claim that the specific pedagogical alignment (rather than generic step-by-step supervision) drives the >80% held-out accuracy cannot be isolated; this directly affects the load-bearing interpretation of the headline result.
[Methods and evaluation] Details on data splits, problem generation procedure, exact baselines for the 'competitive performance against substantially larger models' claim, and error analysis by operation or difficulty are not provided. These omissions prevent verification that the reported accuracy figures fully support the phase-transition and mechanistic conclusions.

minor comments (2)

Clarify the precise tokenization details of the syllabic-agglutinative TOBA tokenizer and how it interacts with the serialized CoT traces.
Add explicit statements of the number of training examples, training steps per phase, and statistical significance of the accuracy figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each major point below and have revised the manuscript to improve clarity, completeness, and verifiability while preserving the core claims supported by the existing experiments and analyses.

read point-by-point responses

Referee: [Experimental setup and results] The manuscript reports no ablation comparing GASING-derived CoT to standard left-to-right CoT or to direct-answer targets on the same arithmetic distribution. Without this control, the claim that the specific pedagogical alignment (rather than generic step-by-step supervision) drives the >80% held-out accuracy cannot be isolated; this directly affects the load-bearing interpretation of the headline result.

Authors: We agree that a direct ablation would strengthen isolation of the GASING-specific left-to-right procedural alignment from generic step-by-step supervision. The manuscript does not report such an ablation, and our headline result is presented as the effectiveness of pedagogically grounded CoT rather than a claim of strict superiority over all alternative CoT formats. The mechanistic evidence (attention-masking on the CoT information graph and phase transitions) is tied to the structure of the GASING traces. In the revision we have added a dedicated limitations paragraph in Section 5 explicitly acknowledging this gap and identifying it as a priority for follow-up experiments. revision: partial
Referee: [Methods and evaluation] Details on data splits, problem generation procedure, exact baselines for the 'competitive performance against substantially larger models' claim, and error analysis by operation or difficulty are not provided. These omissions prevent verification that the reported accuracy figures fully support the phase-transition and mechanistic conclusions.

Authors: We thank the referee for highlighting these omissions. The revised manuscript expands the Methods section with: explicit data-split statistics (80/10/10 train/validation/test with no instance overlap), the precise problem-generation algorithm (random operand sampling with bounded difficulty per operation), the exact model sizes and reported accuracies used for the competitive-performance comparison, and a new appendix containing per-operation and per-difficulty error breakdowns. These additions directly support verification of the accuracy figures and the reported learning phases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical training and held-out evaluation

full rationale

The paper's central results derive from training a GPT-2 model from scratch on next-token prediction using GASING-serialized CoT traces, followed by accuracy measurement on held-out arithmetic problems and mechanistic probes (attention masking, residual probing, logit lens). No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the >80% held-out accuracy and phase observations are outputs of the training run itself rather than re-expressions of inputs. The GASING alignment assumption is stated as a modeling choice but does not create a definitional loop. This is a standard empirical setup with independent falsifiability on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, invented entities, or ad-hoc axioms beyond standard language-model training assumptions. The key domain assumption is that serializing an external pedagogical procedure into CoT traces will produce effective supervision under next-token prediction.

axioms (1)

domain assumption Next-token prediction on serialized GASING execution traces is sufficient for the model to internalize arithmetic procedures
Stated in the training description: the model is trained from scratch using only a next-token prediction objective on the CoT data.

pith-pipeline@v0.9.1-grok · 5734 in / 1403 out tokens · 62240 ms · 2026-06-28T06:34:55.688830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 21 canonical work pages · 12 internal anchors

[1]

M., Gebru, T., McMillan-Major, A., & Shmitchell, S

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623

2021
[2]

C., et al

Bogdan, P. C., et al. (2025). Thought Anchors: Which LLM Reasoning Steps Matter? arXiv:2506.19143

work page arXiv 2025
[3]

Charton, F. (2021). Linear algebra with transformers.arXiv:2112.01898

work page arXiv 2021
[4]

Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems.arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Dziri, N., et al. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv:2305.18654

work page arXiv 2023
[6]

Gunasekar, S., et al. (2023). Textbooks Are All You Need.arXiv:2306.11644

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Hupkes, D., et al. (2019). Compositionality decomposed: how do neural networks generalise? arXiv:1908.08351

work page arXiv 2019
[8]

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners.arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

C., Mitchell, M., & Krakauer, J

Krakauer, D. C., Mitchell, M., & Krakauer, J. W. (2026). Large language models and emergence: a complex systems perspective.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 384(2320)

2026
[10]

Lee, N., et al. (2023). Teaching Arithmetic to Small Transformers.arXiv:2307.03381

work page arXiv 2023
[11]

Li, K., et al. (2022). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.arXiv:2210.13382

work page arXiv 2022
[12]

Lightman, H., et al. (2023). Let’s Verify Step by Step.arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

B., & Situngkir, H

Lumbantobing, A. B., & Situngkir, H. (2026). Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago.BFI Working Paper Series

2026
[14]

Nanda, N., et al. (2023). Progress measures for grokking via mechanistic interpretability. arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Nye, M., et al. (2021). Show Your Work: Scratchpads for Intermediate Computation with Language Models.arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Power, A., et al. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.arXiv:2201.02177

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners.OpenAI

2019
[18]

Saha, S., et al. (2025). KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?arXiv:2507.11408

work page arXiv 2025
[19]

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage?arXiv:2304.15004

work page arXiv 2023
[20]

Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

B., & Surya, Y

Situngkir, H., Lumbantobing, A. B., & Surya, Y . (2026). Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System.BFI Working Paper Series. 17

2026
[22]

Situngkir, H., Siringo, K., & Lumbantobing, A. B. (2026). Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language. BFI Working Paper Series

2026
[23]

Vaswani, A., et al. (2017). Attention Is All You Need.arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

Wang, Z. (2025). LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models.arXiv:2503.11667

work page arXiv 2025
[25]

Wei, J., et al. (2022a). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Wei, J., et al. (2022b). Emergent Abilities of Large Language Models.arXiv:2206.07682

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601. 18

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

M., Gebru, T., McMillan-Major, A., & Shmitchell, S

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623

2021

[2] [2]

C., et al

Bogdan, P. C., et al. (2025). Thought Anchors: Which LLM Reasoning Steps Matter? arXiv:2506.19143

work page arXiv 2025

[3] [3]

Charton, F. (2021). Linear algebra with transformers.arXiv:2112.01898

work page arXiv 2021

[4] [4]

Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems.arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Dziri, N., et al. (2023). Faith and Fate: Limits of Transformers on Compositionality. arXiv:2305.18654

work page arXiv 2023

[6] [6]

Gunasekar, S., et al. (2023). Textbooks Are All You Need.arXiv:2306.11644

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Hupkes, D., et al. (2019). Compositionality decomposed: how do neural networks generalise? arXiv:1908.08351

work page arXiv 2019

[8] [8]

Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners.arXiv:2205.11916

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

C., Mitchell, M., & Krakauer, J

Krakauer, D. C., Mitchell, M., & Krakauer, J. W. (2026). Large language models and emergence: a complex systems perspective.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 384(2320)

2026

[10] [10]

Lee, N., et al. (2023). Teaching Arithmetic to Small Transformers.arXiv:2307.03381

work page arXiv 2023

[11] [11]

Li, K., et al. (2022). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.arXiv:2210.13382

work page arXiv 2022

[12] [12]

Lightman, H., et al. (2023). Let’s Verify Step by Step.arXiv:2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

B., & Situngkir, H

Lumbantobing, A. B., & Situngkir, H. (2026). Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago.BFI Working Paper Series

2026

[14] [14]

Nanda, N., et al. (2023). Progress measures for grokking via mechanistic interpretability. arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Nye, M., et al. (2021). Show Your Work: Scratchpads for Intermediate Computation with Language Models.arXiv:2112.00114

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Power, A., et al. (2022). Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.arXiv:2201.02177

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners.OpenAI

2019

[18] [18]

Saha, S., et al. (2025). KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?arXiv:2507.11408

work page arXiv 2025

[19] [19]

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage?arXiv:2304.15004

work page arXiv 2023

[20] [20]

Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

B., & Surya, Y

Situngkir, H., Lumbantobing, A. B., & Surya, Y . (2026). Syllabic Agglutinative Tokenizations for Indonesian LLM: A Study from Gasing Literacy Learning System.BFI Working Paper Series. 17

2026

[22] [22]

Situngkir, H., Siringo, K., & Lumbantobing, A. B. (2026). Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language. BFI Working Paper Series

2026

[23] [23]

Vaswani, A., et al. (2017). Attention Is All You Need.arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

Wang, Z. (2025). LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models.arXiv:2503.11667

work page arXiv 2025

[25] [25]

Wei, J., et al. (2022a). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Wei, J., et al. (2022b). Emergent Abilities of Large Language Models.arXiv:2206.07682

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601. 18

work page internal anchor Pith review Pith/arXiv arXiv 2023