Efficient numeracy in language models through single-token number embeddings

Daniel Rueckert; Georgios Kaissis; Jonathan Mengedoht; Linus Kreitner; Martin J. Menten; Paul Hager

arxiv: 2510.06824 · v2 · pith:HXXJ25ZAnew · submitted 2025-10-08 · 💻 cs.LG

Efficient numeracy in language models through single-token number embeddings

Linus Kreitner , Paul Hager , Jonathan Mengedoht , Georgios Kaissis , Daniel Rueckert , Martin J. Menten This is my paper

Pith reviewed 2026-05-21 21:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords language modelsnumeracytokenizationarithmeticBitTokensIEEE 754numerical reasoningsingle-token embeddings

0 comments

The pith

Representing numbers as single IEEE 754 binary tokens lets small language models perform basic arithmetic nearly perfectly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current language models split numbers across multiple tokens, which forces them to use long reasoning chains or external tools even for simple calculations. The paper proposes BitTokens, an encoding that packs any number into one token by directly using its IEEE 754 floating-point binary form. Experiments show that small models trained with this encoding can internally learn and execute exact algorithms for addition, subtraction, multiplication, and division. This removes a major source of numerical inefficiency and opens the door to solving longer problems without extra machinery. A reader would care because it targets a practical bottleneck in applying models to data-heavy scientific and engineering work.

Core claim

By mapping every number to a single token via its raw IEEE 754 binary floating-point representation, language models receive structured numerical input that lets even small models discover and apply exact arithmetic rules, achieving near-perfect performance on basic operations without multi-token splits, external tools, or corrections.

What carries the argument

BitTokens, the single-token encoding of numbers that uses their IEEE 754 binary representation to supply the model with compact, structured numerical input for learning arithmetic algorithms.

If this is right

Models require far fewer reasoning tokens for basic calculations, freeing capacity for longer problem sequences.
Small language models become capable of accurate arithmetic without relying on post-processing or tool calls.
Numerical tasks can be solved internally rather than through decomposition into multiple tokens.
The length and complexity of solvable problems increase because each number consumes only one token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-token structure might support learning of more advanced operations such as exponentiation or basic linear algebra if training data includes them.
Integration with existing tokenizers could allow hybrid models that switch between BitTokens for numbers and standard tokens for text.
Performance gains may compound in domains like scientific simulation where many numbers appear in sequence.
Testing the encoding on decoder-only models of varying sizes would reveal whether the benefit scales or saturates.

Load-bearing premise

The raw IEEE 754 binary representation of a number supplies enough internal structure for the model to learn exact arithmetic rules without external tools or multi-token workarounds.

What would settle it

Train a small model on BitTokens and test it on a held-out set of additions involving numbers with eight or more significant digits; systematic carry errors or accuracy below 95 percent would falsify the claim that the encoding enables near-perfect internal arithmetic.

Figures

Figures reproduced from arXiv: 2510.06824 by Daniel Rueckert, Georgios Kaissis, Jonathan Mengedoht, Linus Kreitner, Martin J. Menten, Paul Hager.

**Figure 1.** Figure 1: LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: While simple tasks such as addition and comparing numbers are almost perfectly solved by frontier LLMs, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Difficult numeracy tasks such as multiplication, division, exponentiation, and standard deviation can only be [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: While single digit is the superior multi-token strategy, our BitTokens outperforms it as well as all other [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The magnitude distribution of addition pairs in our dataset. Operands with similar exponents are oversampled [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: The magnitude distribution of multiplication pairs in our dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: The benchmark for the frontier LLMs is a 500 sample subset of the BitTokens test, but follows the same [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Due to cost constraints we had to subset the full set of 10,000 test samples to 500 samples per task when [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BitTokens, an encoding that maps any real number to a single token via its raw IEEE 754 binary floating-point representation. It argues that standard subword tokenization forces LLMs to expend excessive reasoning tokens on even simple arithmetic, and presents experiments claiming that small language models equipped with BitTokens can internally learn exact algorithms for basic operations (addition, etc.) and achieve near-perfect accuracy.

Significance. If the results demonstrate genuine acquisition of arithmetic algorithms rather than memorization, the approach would offer a practical route to more efficient numerical reasoning inside the model itself, potentially increasing the length and complexity of calculations feasible without external tools. The core idea of leveraging the fixed binary structure of floating-point numbers for token embeddings is simple and directly addresses a documented inefficiency in current LLM tokenizers.

major comments (2)

[§4] §4 (Experimental Setup): The manuscript provides no information on whether test operands were drawn from ranges, exponents, or mantissa distributions disjoint from the training data. Without this, high accuracy on held-out examples cannot distinguish between learning a general arithmetic procedure and rote association within the learned embedding space or attention patterns, which is load-bearing for the central claim that BitTokens enable internal algorithmic solutions.
[§5] §5 (Results): The abstract and results sections assert 'nearly perfect' performance but report neither exact error rates, per-operation accuracy tables, baseline comparisons against standard multi-token tokenization, nor ablation studies isolating the contribution of the IEEE 754 bit-pattern embedding. These omissions prevent quantitative evaluation of the claimed improvement.

minor comments (2)

[§2] The desiderata listed for number encodings in §2 are useful but would benefit from explicit mapping to which properties BitTokens satisfy versus prior methods, ideally in a table.
[§3] Notation for the BitToken embedding construction (how the 64-bit pattern is turned into a token ID and embedding) should be formalized with an equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and describe the revisions we will incorporate to improve clarity and completeness.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The manuscript provides no information on whether test operands were drawn from ranges, exponents, or mantissa distributions disjoint from the training data. Without this, high accuracy on held-out examples cannot distinguish between learning a general arithmetic procedure and rote association within the learned embedding space or attention patterns, which is load-bearing for the central claim that BitTokens enable internal algorithmic solutions.

Authors: We agree that explicit documentation of disjoint distributions is essential to substantiate claims of algorithmic learning rather than memorization. The data generation procedure in our experiments did enforce disjoint ranges, exponents, and mantissa distributions between train and test sets, but this was insufficiently detailed in the original manuscript. In the revision we will expand §4 with a precise description of the sampling method, including the specific ranges, exponent bounds, and mantissa constraints used to guarantee disjointness. This addition directly addresses the concern and strengthens the evidence for generalization. revision: yes
Referee: [§5] §5 (Results): The abstract and results sections assert 'nearly perfect' performance but report neither exact error rates, per-operation accuracy tables, baseline comparisons against standard multi-token tokenization, nor ablation studies isolating the contribution of the IEEE 754 bit-pattern embedding. These omissions prevent quantitative evaluation of the claimed improvement.

Authors: We concur that the current presentation would benefit from greater quantitative rigor. The manuscript will be revised to include exact per-operation error rates, full accuracy tables, direct baseline comparisons against standard multi-token tokenization, and ablation experiments that isolate the contribution of the IEEE 754 bit-pattern embedding. These results will be added to §5 (and referenced in the abstract) to enable precise evaluation of the claimed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical encoding proposal with independent experimental validation

full rationale

The paper introduces BitTokens as a single-token encoding based on raw IEEE 754 bit patterns, lists desiderata for number encodings, and reports experimental results showing small models achieve near-perfect accuracy on basic arithmetic tasks. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains. The derivation chain consists of a proposed representation followed by direct empirical measurement against held-out arithmetic examples; results are not tautological with the input encoding. This is the expected non-finding for an empirical methods paper whose claims rest on observable performance rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that binary floating-point bit patterns, when embedded as single tokens, enable transformer layers to discover arithmetic algorithms; no free parameters are introduced beyond standard model training, and no new entities are postulated.

axioms (1)

domain assumption IEEE 754 binary floating-point representation can be directly used as token embeddings for numerical values.
Invoked when the paper states that any number is represented as a single token using its IEEE 754 binary floating-point representation.

pith-pipeline@v0.9.0 · 5727 in / 1147 out tokens · 51381 ms · 2026-05-21T21:05:25.363912+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery from Law of Logic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BitTokens uses ... IEEE 754 ... sign, exponent, and significand ... bit-wise arithmetic over Z2 results in coefficient-wise operations reducing to Boolean gates: (x·y) mod 2 = x∧y, (x+y) mod 2 = x⊕y

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
cs.AI 2026-05 unverdicted novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
A Triadic Suffix Tokenization Scheme for Numerical Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

Triadic Suffix Tokenization groups digits into triads with fixed magnitude suffixes to make order-of-magnitude relationships explicit at the token level for LLMs.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Large language models for mathematical reasoning: Progresses and challenges

9 APREPRINT- OCTOBER9, 2025 Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225–237,

work page 2025
[3]

arXiv.org

Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: a preliminary study using GPT-4.arXiv preprint arXiv:2311.07361,

work page arXiv
[4]

Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data

Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, and Elena Simperl. Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page 2023
[5]

ICML 2024 Tutorial: Physics of Language Models, July

Zeyuan Allen-Zhu. ICML 2024 Tutorial: Physics of Language Models, July

work page 2024
[6]

allen-zhu.com/

Project page:https://physics. allen-zhu.com/. Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for LLMs.arXiv preprint arXiv:2502.19981,

work page arXiv
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeek-V3 Technical Report

URLhttps://arxiv.org/abs/2412.19437. Benito E Flores. A pragmatic view of accuracy measurement in forecasting.Omega, 14(2):93–98,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

xval: A continuous number encoding for large language models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, et al. xval: A continuous number encoding for large language models. InNeurIPS 2023 AI for Science Workshop,

work page 2023
[11]

Middleware for llms: Tools are instrumental for language agents in complex environments

10 APREPRINT- OCTOBER9, 2025 Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, and Yu Su. Middleware for llms: Tools are instrumental for language agents in complex environments. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7646–7663,

work page 2025
[12]

LUNA: language understanding with number augmentations on transformers via number plugins and pre-training.arXiv preprint arXiv:2212.02691,

Hongwei Han, Jialiang Xu, Mengyu Zhou, Yijia Shao, Shi Han, and Dongmei Zhang. LUNA: language understanding with number augmentations on transformers via number plugins and pre-training.arXiv preprint arXiv:2212.02691,

work page arXiv
[13]

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring AI ability to complete long tasks.arXiv preprint arXiv:2503.14499,

work page arXiv
[14]

Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models.arXiv preprint arXiv:2502.11075,

Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, et al. Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models.arXiv preprint arXiv:2502.11075,

work page arXiv
[15]

Investigating the limitations of transformers with simple arithmetic tasks

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019,

work page arXiv
[16]

Talm: Tool augmente d language models

URL https://openai.com/index/ introducing-gpt-5/. Aaron Parisi, Yao Zhao, and Noah Fiedel. TALM: Tool augmented language models.arXiv preprint arXiv:2205.12255,

work page arXiv
[17]

The FineWeb datasets: Decanting the web for the finest text data at scale

11 APREPRINT- OCTOBER9, 2025 Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page 2025
[18]

Impact of pretraining term frequencies on few-shot numerical reasoning

Yasaman Razeghi, Robert L Logan Iv, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854,

work page 2022
[19]

NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,

Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, and Assaf Arbelle. NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,

work page arXiv
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

work page arXiv
[22]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...

work page arXiv
[25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, and Muhan Zhang. Number cookbook: Number understanding of language models and how to improve it. InThe Thirteenth Intern...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Interpreting and improving large language models in arithmetic calculation

12 APREPRINT- OCTOBER9, 2025 Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, pages 59932–59950,

work page 2025
[27]

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, and Vatsal Sharan. FoNE: Precise single-token number embeddings via fourier features.arXiv preprint arXiv:2502.09741,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Transformers can achieve length generalization but not robustly

Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,

work page 2024
[29]

algorithm

13 APREPRINT- OCTOBER9, 2025 A Benchmarking dataset A.1 Tasks A.1.1 Comparing numbers The most elementary numeracy task is comparing two random numbers, n1 and n2. As this task is trivial for modern LLMs, we increase the difficulty to multiple operands. Determining the minimum and maximum Determining the minimum or maximum of a list of numbers is in essen...

work page 2025
[30]

We choose the individual signs using the following schema: (1) both positive in40% of cases, (2) only one operand negative in40%of cases and (3) both negative in20%of cases

We then round the operands by p1 and p2 respectively and randomly swap operands. We choose the individual signs using the following schema: (1) both positive in40% of cases, (2) only one operand negative in40%of cases and (3) both negative in20%of cases. We randomly choose an operatorop∈ {+,−}. MultiplicationWe handle precision of the operands and their s...

work page 2025
[31]

This directly correlates with the number of steps involved to solve the task and the numbers’ precision

MultiplicationGiven the fixed-point number representations of the operands in base-2 or base-10, the difficulty δMultiplication is given by the sum of their non-zero digits. This directly correlates with the number of steps involved to solve the task and the numbers’ precision. DivisionSimilar to Multiplication, the difficulty δMultiplication is given by ...

work page 2025
[32]

[-]?(?:(?:0(?!\.[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))

We employ a cosine learning rate scheduler and allocate 10% of the training tokens for warm-up. Additionally, the Muon optimizer uses 300 momentum warm-up steps. Model performance is evaluated every 32 steps for 2 steps on a small validation set, and the sampling ratio is dynamically adjusted based on current results. After training, we select the checkpo...

work page 2025
[34]

Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (Table 10)

yields clear improvements over direct multitask training, especially on arithmetic tasks, confirming its necessity for stable convergence. Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (Table 10). Metric Sum Product Concat Zero Pad Weighted Weighted + Sum Min/Max↑Exact match acc 0.999 0.996 0...

work page arXiv 2025

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Large language models for mathematical reasoning: Progresses and challenges

9 APREPRINT- OCTOBER9, 2025 Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225–237,

work page 2025

[3] [3]

arXiv.org

Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: a preliminary study using GPT-4.arXiv preprint arXiv:2311.07361,

work page arXiv

[4] [4]

Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data

Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, and Elena Simperl. Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page 2023

[5] [5]

ICML 2024 Tutorial: Physics of Language Models, July

Zeyuan Allen-Zhu. ICML 2024 Tutorial: Physics of Language Models, July

work page 2024

[6] [6]

allen-zhu.com/

Project page:https://physics. allen-zhu.com/. Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for LLMs.arXiv preprint arXiv:2502.19981,

work page arXiv

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepSeek-V3 Technical Report

URLhttps://arxiv.org/abs/2412.19437. Benito E Flores. A pragmatic view of accuracy measurement in forecasting.Omega, 14(2):93–98,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

xval: A continuous number encoding for large language models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, et al. xval: A continuous number encoding for large language models. InNeurIPS 2023 AI for Science Workshop,

work page 2023

[11] [11]

Middleware for llms: Tools are instrumental for language agents in complex environments

10 APREPRINT- OCTOBER9, 2025 Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, and Yu Su. Middleware for llms: Tools are instrumental for language agents in complex environments. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7646–7663,

work page 2025

[12] [12]

LUNA: language understanding with number augmentations on transformers via number plugins and pre-training.arXiv preprint arXiv:2212.02691,

Hongwei Han, Jialiang Xu, Mengyu Zhou, Yijia Shao, Shi Han, and Dongmei Zhang. LUNA: language understanding with number augmentations on transformers via number plugins and pre-training.arXiv preprint arXiv:2212.02691,

work page arXiv

[13] [13]

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring AI ability to complete long tasks.arXiv preprint arXiv:2503.14499,

work page arXiv

[14] [14]

Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models.arXiv preprint arXiv:2502.11075,

Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, et al. Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models.arXiv preprint arXiv:2502.11075,

work page arXiv

[15] [15]

Investigating the limitations of transformers with simple arithmetic tasks

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019,

work page arXiv

[16] [16]

Talm: Tool augmente d language models

URL https://openai.com/index/ introducing-gpt-5/. Aaron Parisi, Yao Zhao, and Noah Fiedel. TALM: Tool augmented language models.arXiv preprint arXiv:2205.12255,

work page arXiv

[17] [17]

The FineWeb datasets: Decanting the web for the finest text data at scale

11 APREPRINT- OCTOBER9, 2025 Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page 2025

[18] [18]

Impact of pretraining term frequencies on few-shot numerical reasoning

Yasaman Razeghi, Robert L Logan Iv, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854,

work page 2022

[19] [19]

NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,

Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, and Assaf Arbelle. NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,

work page arXiv

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,

work page arXiv

[22] [22]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...

work page arXiv

[25] [25]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, and Muhan Zhang. Number cookbook: Number understanding of language models and how to improve it. InThe Thirteenth Intern...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Interpreting and improving large language models in arithmetic calculation

12 APREPRINT- OCTOBER9, 2025 Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, pages 59932–59950,

work page 2025

[27] [27]

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, and Vatsal Sharan. FoNE: Precise single-token number embeddings via fourier features.arXiv preprint arXiv:2502.09741,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Transformers can achieve length generalization but not robustly

Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,

work page 2024

[29] [29]

algorithm

13 APREPRINT- OCTOBER9, 2025 A Benchmarking dataset A.1 Tasks A.1.1 Comparing numbers The most elementary numeracy task is comparing two random numbers, n1 and n2. As this task is trivial for modern LLMs, we increase the difficulty to multiple operands. Determining the minimum and maximum Determining the minimum or maximum of a list of numbers is in essen...

work page 2025

[30] [30]

We choose the individual signs using the following schema: (1) both positive in40% of cases, (2) only one operand negative in40%of cases and (3) both negative in20%of cases

We then round the operands by p1 and p2 respectively and randomly swap operands. We choose the individual signs using the following schema: (1) both positive in40% of cases, (2) only one operand negative in40%of cases and (3) both negative in20%of cases. We randomly choose an operatorop∈ {+,−}. MultiplicationWe handle precision of the operands and their s...

work page 2025

[31] [31]

This directly correlates with the number of steps involved to solve the task and the numbers’ precision

MultiplicationGiven the fixed-point number representations of the operands in base-2 or base-10, the difficulty δMultiplication is given by the sum of their non-zero digits. This directly correlates with the number of steps involved to solve the task and the numbers’ precision. DivisionSimilar to Multiplication, the difficulty δMultiplication is given by ...

work page 2025

[32] [32]

[-]?(?:(?:0(?!\.[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))

We employ a cosine learning rate scheduler and allocate 10% of the training tokens for warm-up. Additionally, the Muon optimizer uses 300 momentum warm-up steps. Model performance is evaluated every 32 steps for 2 steps on a small validation set, and the sampling ratio is dynamically adjusted based on current results. After training, we select the checkpo...

work page 2025

[33] [34]

Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (Table 10)

yields clear improvements over direct multitask training, especially on arithmetic tasks, confirming its necessity for stable convergence. Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (Table 10). Metric Sum Product Concat Zero Pad Weighted Weighted + Sum Min/Max↑Exact match acc 0.999 0.996 0...

work page arXiv 2025