Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

Benjamin Heinzerling; Go kamoda; Keisuke Sakaguchi; Kentaro Inui; Kosuke Sato; Mutsumi Sasaki; Ryosuke Takahashi

arxiv: 2606.03982 · v1 · pith:4NXNZQF4new · submitted 2026-06-02 · 💻 cs.CL

Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

Mutsumi Sasaki , Go kamoda , Ryosuke Takahashi , Kosuke Sato , Kentaro Inui , Keisuke Sakaguchi , Benjamin Heinzerling This is my paper

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelsquantity comparisonheuristicsmeasurement unitsnumeralssystematic errorssubspace interventions

0 comments

The pith

Language models compare quantities by applying separate heuristics to numerals and unit scales rather than converting both to a shared representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how language models handle comparisons such as 110 cm versus 1.2 m across different unit systems. Accuracy falls when the two quantities sit close to the decision boundary, and the resulting mistakes follow regular patterns that linear surrogate models can recover from the size of the numerical difference and the size of the unit-scale difference. Interventions that edit subspaces aligned with these two variables change the model's answers in the predicted direction. The authors conclude that models rely on a collection of simple cues tied to numbers and units instead of first mapping both quantities onto one common scale. Readers may care because the finding points to a concrete limitation in current models' quantitative reasoning.

Core claim

Language models compare quantities with measurement units through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation. Accuracy degrades near the comparison boundary, the resulting errors are systematic, linear surrogate models predict model preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift the model's output.

What carries the argument

Bag of heuristics over numerals and units, recovered through linear surrogate models on numerical and unit differences together with subspace interventions that causally affect outputs.

If this is right

Models do not perform an exact shared-scale conversion before deciding which quantity is larger.
Comparison decisions can be approximated by simple additive cues from numeral size and unit scale size.
Editing subspaces tied to those cues reliably alters the model's comparison outputs.
Errors concentrate where the two quantities are close after unit scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same heuristic strategy may cause failures on quantitative tasks that require chaining multiple conversions.
Training objectives that explicitly reward shared-scale representations could reduce the observed boundary errors.
The pattern may extend to other symbolic comparisons that involve named scales or conversion factors.

Load-bearing premise

The linear surrogate models and subspace interventions correctly isolate the actual decision mechanisms used by the LM rather than capturing correlated but non-causal features.

What would settle it

An experiment in which the identified subspaces are edited yet the model's quantity-comparison choices remain unchanged, or in which accuracy stays high even when quantities lie near the decision boundary.

Figures

Figures reproduced from arXiv: 2606.03982 by Benjamin Heinzerling, Go kamoda, Keisuke Sakaguchi, Kentaro Inui, Kosuke Sato, Mutsumi Sasaki, Ryosuke Takahashi.

**Figure 2.** Figure 2: Accuracy of comparisons between quantities [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise DAS intervention accuracy at the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy of comparisons between quantities with measurement units, grouped by Quantity Margin for [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy of comparisons between quantities with measurement units, grouped by Quantity Margin for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy of comparisons between quantities with measurement units, grouped by Quantity Margin for [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy of comparisons between quantities with measurement units, grouped by NumLogDiff and [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Surrogate Predictivity (Pred) across all unit settings in Tbl. 1 for Qwen3-4B-Base, Qwen3-8B-Base, and OLMo-3-1025-7B. Rows correspond to representative surrogate feature sets, and columns correspond to unit settings. Pred = R2 LM × (R2 LM/R2 Rule), where R2 LM measures prediction of the LM log-probability margin and R2 Rule measures prediction of Quantity Margin. Higher Pred means that the feature set be… view at source ↗

**Figure 11.** Figure 11: Surrogate Predictivity (Pred) across all unit settings in Tbl. 1 for Qwen3-4B-Base under two prompt templates from Tbl. 4. P0 is the larger-comparison postposed prompt, where the quantities appear after the comparison phrase, e.g., “Which is larger, q1 or q2?” P1 is the larger-comparison preposed prompt, where the quantities appear before the comparison phrase, e.g., “Between q1 or q2, which is larger?” R… view at source ↗

**Figure 12.** Figure 12: Surrogate Predictivity (Pred) across all unit settings in Tbl. 1 for Qwen3-4B-Base. Rows correspond to all feature sets in Tbl. 7, and columns correspond to unit settings. Pred = R2 LM × (R2 LM/R2 Rule), where R2 LM measures prediction of the LM log-probability margin and R2 Rule measures prediction of Quantity Margin. Higher Pred means that the feature set better predicts LM behavior and is not simply fi… view at source ↗

**Figure 13.** Figure 13: Layer-wise DAS intervention accuracy at the last token of u2 for Qwen3-4B-Base, Qwen3-8B-Base, and [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Layer-wise DAS intervention accuracy at the last token of [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Layer-wise DAS intervention accuracy at the last token of [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: DAS intervention accuracy for Qwen3-4B-Base on the metric length setting as the total dimensionality [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

read the original abstract

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LMs compare quantities with units via numeral-difference and unit-scale heuristics rather than shared-scale conversion, shown via surrogates and subspace interventions.

read the letter

The main thing to know is that these models appear to rely on simple difference-based shortcuts for quantity comparisons instead of converting both sides to a common scale. The work tests this across several unit systems in controlled settings and finds accuracy falls off near the decision boundary, with errors that line up with numeral size gaps and unit scale gaps.

What stands out is the combination of linear surrogate models that predict the LM's preferences from those two cues and causal interventions on aligned subspaces that actually shift the outputs. That moves beyond pure correlation and gives a concrete handle on the behavior. The experiments are systematic enough to make the pattern believable.

The soft spot is the limited detail on how the subspaces were located and validated. Without clearer description of the identification process, controls for alternative explanations, or checks against spurious correlations, it's possible the interventions are moving correlated signals rather than the actual decision rule. The stress-test concern lands here: the evidence supports heuristic use near boundaries but does not yet fully rule out an internal conversion process that happens to be sensitive to the same differences.

This is aimed at people studying numerical reasoning limits in LMs. A reader interested in probing methods or in fixing these shortcuts will get usable takeaways. The behavioral results are solid enough to merit a serious referee even if the causal claims need more scaffolding in revision.

Referee Report

1 major / 0 minor

Summary. The paper studies how language models compare quantities with units (e.g., 110 cm vs. 1.2 m) across unit systems. It reports degraded accuracy near comparison boundaries, systematic errors predictable by linear surrogate models using numerical-difference and unit-scale-difference cues, and output shifts from causal interventions on subspaces aligned with those variables. The central claim is that LMs rely on a bag of number-specific and unit-specific heuristics rather than first converting expressions to an exact shared-scale representation.

Significance. If the mechanistic claims hold after proper validation, the work would provide evidence against precise internal quantity conversion in LMs and in favor of heuristic strategies, with implications for interpretability and numerical reasoning in language models. The combination of behavioral error analysis, surrogate modeling, and subspace interventions represents a constructive attempt to move beyond surface-level accuracy metrics.

major comments (1)

[Abstract / Methods] Abstract and methods: the central claim that subspace interventions demonstrate heuristic use (rather than an internal conversion process sensitive to the same cues near boundaries) rests on the assumption that the subspaces isolate the decision mechanism. No details are provided on subspace identification (e.g., probes, patching, or clustering), validation against alternative subspaces, or counterfactual controls, which is load-bearing for the causal interpretation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point regarding insufficient methodological detail on subspace interventions is well-taken, and we address it directly below with a commitment to revision.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and methods: the central claim that subspace interventions demonstrate heuristic use (rather than an internal conversion process sensitive to the same cues near boundaries) rests on the assumption that the subspaces isolate the decision mechanism. No details are provided on subspace identification (e.g., probes, patching, or clustering), validation against alternative subspaces, or counterfactual controls, which is load-bearing for the causal interpretation.

Authors: We agree that the manuscript currently lacks sufficient detail on subspace identification and validation, which weakens the causal claims. In the revised manuscript we will expand the Methods section to specify: (1) subspace identification via linear probes trained to predict the numerical-difference and unit-scale-difference features from residual stream activations, followed by selection of the top principal components aligned with these features; (2) patching experiments that replace activations in the identified subspaces while holding other components fixed; and (3) explicit validation including comparison to randomly sampled subspaces of equal dimensionality, subspaces aligned with unrelated features (e.g., token frequency), and additional counterfactual controls that intervene on the same subspaces but with inverted feature signs. These additions will directly test whether the observed output shifts are specific to the heuristic cues rather than reflecting a general sensitivity near boundaries. revision: yes

Circularity Check

0 steps flagged

Empirical behavioral study with no derivations reducing to inputs

full rationale

The paper reports controlled experiments on LM quantity comparison, accuracy degradation near boundaries, linear surrogate models fitted to predict preferences from numerical/unit cues, and subspace interventions. These are observational and interventional analyses on model outputs, not a derivation chain with equations or self-citations that reduce the central claim to its own fitted inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The claim rests on external experimental observations rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical behavioral study with no mathematical derivations, free parameters, axioms, or invented entities; the central claim rests on experimental observations of model outputs.

pith-pipeline@v0.9.1-grok · 5664 in / 1056 out tokens · 17476 ms · 2026-06-28T10:17:24.168512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 12 canonical work pages

[1]

Hilal AlQuabeh, Velibor Bojkovic, Munachiso S Nwadike, Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, and Kentaro Inui. 2026. https://openreview.net/forum?id=yhEi1aeWCQ Number representations in LLMs : A computational parallel to human perception

2026
[2]

Minh Duc Bui, Kyung Eun Park, Goran Glava s , Fabian David Schmidt, and Katharina Von Der Wense. 2025. https://doi.org/10.18653/v1/2025.acl-long.1032 On generalization across measurement systems: LLM s entail more test-time compute for underrepresented cultures . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

work page doi:10.18653/v1/2025.acl-long.1032 2025
[3]

Xiaoman Delores Ding, Zifan Carl Guo, Eric J Michaud, Ziming Liu, and Max Tegmark. 2024. https://openreview.net/forum?id=2WfiYQlZDa Survival of the fittest representation: A case study with modular addition . In ICML 2024 Workshop on Mechanistic Interpretability

2024
[4]

Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh, Benjamin Heinzerling, and Kentaro Inui. 2025. https://doi.org/10.18653/v1/2025.naacl-short.47 The geometry of numerical reasoning: Language models compare numeric properties in linear subspaces . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

work page doi:10.18653/v1/2025.naacl-short.47 2025
[5]

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. 2024. https://proceedings.mlr.press/v236/geiger24a.html Finding alignments between interpretable causal variables and distributed neural representations . In Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 of Proceedings of Machine Learning Re...

2024
[6]

Jan G \"o pfert, Patrick Kuckertz, Jann Weinand, Leander Kotzur, and Detlef Stolten. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.161 Measurement extraction with natural language processing: A review . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2191--2215

work page doi:10.18653/v1/2022.findings-emnlp.161 2022
[7]

Michael Hanna, Ollie Liu, and Alexandre Variengien. 2023. https://openreview.net/forum?id=p4PckNQR8k How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[8]

Benjamin Heinzerling and Kentaro Inui. 2024. https://doi.org/10.18653/v1/2024.acl-short.18 Monotonic representation of numeric attributes in language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 175--195

work page doi:10.18653/v1/2024.acl-short.18 2024
[9]

Yuncheng Huang, Qianyu He, Jiaqing Liang, Sihang Jiang, Yanghua Xiao, and Yunwen Chen. 2024. https://doi.org/10.1109/ICDE60146.2024.00066 Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception . In 2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 789--802, Los Alamitos, CA, USA

work page doi:10.1109/icde60146.2024.00066 2024
[10]

Subhash Kantamneni and Max Tegmark. 2025. https://openreview.net/forum?id=CqViN4dQJk Language models use trigonometry to do addition . In ICLR 2025 Workshop on Building Trust in Language Models and Applications

2025
[11]

Amit Arnold Levy and Mor Geva. 2025. https://doi.org/10.18653/v1/2025.naacl-short.33 Language models encode numbers using digit representations in base 10 . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 385--395, Alb...

work page doi:10.18653/v1/2025.naacl-short.33 2025
[12]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/forum?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations

2023
[13]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/8c5f30296296d2ae402ebbd09aaa9c12-Paper-Conference.pdf Arithmetic without algorithms: Language models solve math with a bag of heuristics . In International Conference on Learning Representations, volume 2025, pages 55939--55965

2025
[14]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, and 50 others. 2026. https://arxiv.org/abs/2512.13961 Olmo 3 . Preprint, arXiv:2512.13961

Pith/arXiv arXiv 2026
[15]

Sungjin Park, Seungwoo Ryu, and Edward Choi. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.128 Do language models understand measurements? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1782--1792

work page doi:10.18653/v1/2022.findings-emnlp.128 2022
[16]

Philip Quirke and Fazl Barez. 2024. https://openreview.net/forum?id=rIx1YXVWZb Understanding addition in transformers . In The Twelfth International Conference on Learning Representations

2024
[17]

Philip Quirke, Clement Neo, and Fazl Barez. 2025. https://arxiv.org/abs/2402.02619 Understanding addition and subtraction in transformers . Preprint, arXiv:2402.02619. Preprint

arXiv 2025
[18]

Raj Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, and Sashank Varma. 2023. https://doi.org/10.18653/v1/2023.findings-acl.383 Numeric magnitude comparison effects in large language models . In Findings of the Association for Computational Linguistics: ACL 2023, pages 6147--6161

work page doi:10.18653/v1/2023.findings-acl.383 2023
[19]

Daniel Spokoyny, Ivan Lee, Zhao Jin, and Taylor Berg-Kirkpatrick. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.2 Masked measurement prediction: Learning to jointly predict quantities and units from textual context . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 17--29

work page doi:10.18653/v1/2022.findings-naacl.2 2022
[20]

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.435 A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035--7052

work page doi:10.18653/v1/2023.emnlp-main.435 2023
[21]

Ancheng Xu, Minghuan Tan, Lei Wang, Min Yang, and Ruifeng Xu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.848 NUMC o T : Numerals and units of measurement in chain-of-thought reasoning using large language models . In Findings of the Association for Computational Linguistics: ACL 2024, pages 14268--14290

work page doi:10.18653/v1/2024.findings-acl.848 2024
[22]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025
[23]

Fengting Yuchi, Li Du, and Jason Eisner. 2026. https://doi.org/10.18653/v1/2026.eacl-short.47 LLM s know more about numbers than they can say . In Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 2: Short Papers) , pages 659--673, Rabat, Morocco

work page doi:10.18653/v1/2026.eacl-short.47 2026
[24]

Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. 2024. https://proceedings.mlr.press/v235/zhang24bk.html Interpreting and improving large language models in arithmetic calculation . In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 59932-...

2024
[25]

Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. 2023. https://openreview.net/forum?id=S5wmbQc1We The clock and the pizza: Two stories in mechanistic explanation of neural networks . In Thirty-seventh Conference on Neural Information Processing Systems

2023
[26]

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. 2024. https://openreview.net/forum?id=i4MutM2TZb Pre-trained large language models use fourier features to compute addition . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024
[27]

Fangwei Zhu, Damai Dai, and Zhifang Sui. 2025. https://aclanthology.org/2025.coling-main.47/ Language models encode the value of numbers linearly . In Proceedings of the 31st International Conference on Computational Linguistics, pages 693--709

2025

[1] [1]

Hilal AlQuabeh, Velibor Bojkovic, Munachiso S Nwadike, Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, and Kentaro Inui. 2026. https://openreview.net/forum?id=yhEi1aeWCQ Number representations in LLMs : A computational parallel to human perception

2026

[2] [2]

Minh Duc Bui, Kyung Eun Park, Goran Glava s , Fabian David Schmidt, and Katharina Von Der Wense. 2025. https://doi.org/10.18653/v1/2025.acl-long.1032 On generalization across measurement systems: LLM s entail more test-time compute for underrepresented cultures . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V...

work page doi:10.18653/v1/2025.acl-long.1032 2025

[3] [3]

Xiaoman Delores Ding, Zifan Carl Guo, Eric J Michaud, Ziming Liu, and Max Tegmark. 2024. https://openreview.net/forum?id=2WfiYQlZDa Survival of the fittest representation: A case study with modular addition . In ICML 2024 Workshop on Mechanistic Interpretability

2024

[4] [4]

Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh, Benjamin Heinzerling, and Kentaro Inui. 2025. https://doi.org/10.18653/v1/2025.naacl-short.47 The geometry of numerical reasoning: Language models compare numeric properties in linear subspaces . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com...

work page doi:10.18653/v1/2025.naacl-short.47 2025

[5] [5]

Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. 2024. https://proceedings.mlr.press/v236/geiger24a.html Finding alignments between interpretable causal variables and distributed neural representations . In Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 of Proceedings of Machine Learning Re...

2024

[6] [6]

Jan G \"o pfert, Patrick Kuckertz, Jann Weinand, Leander Kotzur, and Detlef Stolten. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.161 Measurement extraction with natural language processing: A review . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2191--2215

work page doi:10.18653/v1/2022.findings-emnlp.161 2022

[7] [7]

Michael Hanna, Ollie Liu, and Alexandre Variengien. 2023. https://openreview.net/forum?id=p4PckNQR8k How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model . In Thirty-seventh Conference on Neural Information Processing Systems

2023

[8] [8]

Benjamin Heinzerling and Kentaro Inui. 2024. https://doi.org/10.18653/v1/2024.acl-short.18 Monotonic representation of numeric attributes in language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 175--195

work page doi:10.18653/v1/2024.acl-short.18 2024

[9] [9]

Yuncheng Huang, Qianyu He, Jiaqing Liang, Sihang Jiang, Yanghua Xiao, and Yunwen Chen. 2024. https://doi.org/10.1109/ICDE60146.2024.00066 Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception . In 2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 789--802, Los Alamitos, CA, USA

work page doi:10.1109/icde60146.2024.00066 2024

[10] [10]

Subhash Kantamneni and Max Tegmark. 2025. https://openreview.net/forum?id=CqViN4dQJk Language models use trigonometry to do addition . In ICLR 2025 Workshop on Building Trust in Language Models and Applications

2025

[11] [11]

Amit Arnold Levy and Mor Geva. 2025. https://doi.org/10.18653/v1/2025.naacl-short.33 Language models encode numbers using digit representations in base 10 . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 385--395, Alb...

work page doi:10.18653/v1/2025.naacl-short.33 2025

[12] [12]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. https://openreview.net/forum?id=9XFSbDPmdW Progress measures for grokking via mechanistic interpretability . In The Eleventh International Conference on Learning Representations

2023

[13] [13]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/8c5f30296296d2ae402ebbd09aaa9c12-Paper-Conference.pdf Arithmetic without algorithms: Language models solve math with a bag of heuristics . In International Conference on Learning Representations, volume 2025, pages 55939--55965

2025

[14] [14]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, and 50 others. 2026. https://arxiv.org/abs/2512.13961 Olmo 3 . Preprint, arXiv:2512.13961

Pith/arXiv arXiv 2026

[15] [15]

Sungjin Park, Seungwoo Ryu, and Edward Choi. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.128 Do language models understand measurements? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1782--1792

work page doi:10.18653/v1/2022.findings-emnlp.128 2022

[16] [16]

Philip Quirke and Fazl Barez. 2024. https://openreview.net/forum?id=rIx1YXVWZb Understanding addition in transformers . In The Twelfth International Conference on Learning Representations

2024

[17] [17]

Philip Quirke, Clement Neo, and Fazl Barez. 2025. https://arxiv.org/abs/2402.02619 Understanding addition and subtraction in transformers . Preprint, arXiv:2402.02619. Preprint

arXiv 2025

[18] [18]

Raj Shah, Vijay Marupudi, Reba Koenen, Khushi Bhardwaj, and Sashank Varma. 2023. https://doi.org/10.18653/v1/2023.findings-acl.383 Numeric magnitude comparison effects in large language models . In Findings of the Association for Computational Linguistics: ACL 2023, pages 6147--6161

work page doi:10.18653/v1/2023.findings-acl.383 2023

[19] [19]

Daniel Spokoyny, Ivan Lee, Zhao Jin, and Taylor Berg-Kirkpatrick. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.2 Masked measurement prediction: Learning to jointly predict quantities and units from textual context . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 17--29

work page doi:10.18653/v1/2022.findings-naacl.2 2022

[20] [20]

Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.435 A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035--7052

work page doi:10.18653/v1/2023.emnlp-main.435 2023

[21] [21]

Ancheng Xu, Minghuan Tan, Lei Wang, Min Yang, and Ruifeng Xu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.848 NUMC o T : Numerals and units of measurement in chain-of-thought reasoning using large language models . In Findings of the Association for Computational Linguistics: ACL 2024, pages 14268--14290

work page doi:10.18653/v1/2024.findings-acl.848 2024

[22] [22]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

Pith/arXiv arXiv 2025

[23] [23]

Fengting Yuchi, Li Du, and Jason Eisner. 2026. https://doi.org/10.18653/v1/2026.eacl-short.47 LLM s know more about numbers than they can say . In Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 2: Short Papers) , pages 659--673, Rabat, Morocco

work page doi:10.18653/v1/2026.eacl-short.47 2026

[24] [24]

Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. 2024. https://proceedings.mlr.press/v235/zhang24bk.html Interpreting and improving large language models in arithmetic calculation . In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 59932-...

2024

[25] [25]

Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. 2023. https://openreview.net/forum?id=S5wmbQc1We The clock and the pizza: Two stories in mechanistic explanation of neural networks . In Thirty-seventh Conference on Neural Information Processing Systems

2023

[26] [26]

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. 2024. https://openreview.net/forum?id=i4MutM2TZb Pre-trained large language models use fourier features to compute addition . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

2024

[27] [27]

Fangwei Zhu, Damai Dai, and Zhifang Sui. 2025. https://aclanthology.org/2025.coling-main.47/ Language models encode the value of numbers linearly . In Proceedings of the 31st International Conference on Computational Linguistics, pages 693--709

2025