Efficient numeracy in language models through single-token number embeddings
Pith reviewed 2026-05-21 21:05 UTC · model grok-4.3
The pith
Representing numbers as single IEEE 754 binary tokens lets small language models perform basic arithmetic nearly perfectly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mapping every number to a single token via its raw IEEE 754 binary floating-point representation, language models receive structured numerical input that lets even small models discover and apply exact arithmetic rules, achieving near-perfect performance on basic operations without multi-token splits, external tools, or corrections.
What carries the argument
BitTokens, the single-token encoding of numbers that uses their IEEE 754 binary representation to supply the model with compact, structured numerical input for learning arithmetic algorithms.
If this is right
- Models require far fewer reasoning tokens for basic calculations, freeing capacity for longer problem sequences.
- Small language models become capable of accurate arithmetic without relying on post-processing or tool calls.
- Numerical tasks can be solved internally rather than through decomposition into multiple tokens.
- The length and complexity of solvable problems increase because each number consumes only one token.
Where Pith is reading between the lines
- The same single-token structure might support learning of more advanced operations such as exponentiation or basic linear algebra if training data includes them.
- Integration with existing tokenizers could allow hybrid models that switch between BitTokens for numbers and standard tokens for text.
- Performance gains may compound in domains like scientific simulation where many numbers appear in sequence.
- Testing the encoding on decoder-only models of varying sizes would reveal whether the benefit scales or saturates.
Load-bearing premise
The raw IEEE 754 binary representation of a number supplies enough internal structure for the model to learn exact arithmetic rules without external tools or multi-token workarounds.
What would settle it
Train a small model on BitTokens and test it on a held-out set of additions involving numbers with eight or more significant digits; systematic carry errors or accuracy below 95 percent would falsify the claim that the encoding enables near-perfect internal arithmetic.
Figures
read the original abstract
To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes BitTokens, an encoding that maps any real number to a single token via its raw IEEE 754 binary floating-point representation. It argues that standard subword tokenization forces LLMs to expend excessive reasoning tokens on even simple arithmetic, and presents experiments claiming that small language models equipped with BitTokens can internally learn exact algorithms for basic operations (addition, etc.) and achieve near-perfect accuracy.
Significance. If the results demonstrate genuine acquisition of arithmetic algorithms rather than memorization, the approach would offer a practical route to more efficient numerical reasoning inside the model itself, potentially increasing the length and complexity of calculations feasible without external tools. The core idea of leveraging the fixed binary structure of floating-point numbers for token embeddings is simple and directly addresses a documented inefficiency in current LLM tokenizers.
major comments (2)
- [§4] §4 (Experimental Setup): The manuscript provides no information on whether test operands were drawn from ranges, exponents, or mantissa distributions disjoint from the training data. Without this, high accuracy on held-out examples cannot distinguish between learning a general arithmetic procedure and rote association within the learned embedding space or attention patterns, which is load-bearing for the central claim that BitTokens enable internal algorithmic solutions.
- [§5] §5 (Results): The abstract and results sections assert 'nearly perfect' performance but report neither exact error rates, per-operation accuracy tables, baseline comparisons against standard multi-token tokenization, nor ablation studies isolating the contribution of the IEEE 754 bit-pattern embedding. These omissions prevent quantitative evaluation of the claimed improvement.
minor comments (2)
- [§2] The desiderata listed for number encodings in §2 are useful but would benefit from explicit mapping to which properties BitTokens satisfy versus prior methods, ideally in a table.
- [§3] Notation for the BitToken embedding construction (how the 64-bit pattern is turned into a token ID and embedding) should be formalized with an equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major point below and describe the revisions we will incorporate to improve clarity and completeness.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): The manuscript provides no information on whether test operands were drawn from ranges, exponents, or mantissa distributions disjoint from the training data. Without this, high accuracy on held-out examples cannot distinguish between learning a general arithmetic procedure and rote association within the learned embedding space or attention patterns, which is load-bearing for the central claim that BitTokens enable internal algorithmic solutions.
Authors: We agree that explicit documentation of disjoint distributions is essential to substantiate claims of algorithmic learning rather than memorization. The data generation procedure in our experiments did enforce disjoint ranges, exponents, and mantissa distributions between train and test sets, but this was insufficiently detailed in the original manuscript. In the revision we will expand §4 with a precise description of the sampling method, including the specific ranges, exponent bounds, and mantissa constraints used to guarantee disjointness. This addition directly addresses the concern and strengthens the evidence for generalization. revision: yes
-
Referee: [§5] §5 (Results): The abstract and results sections assert 'nearly perfect' performance but report neither exact error rates, per-operation accuracy tables, baseline comparisons against standard multi-token tokenization, nor ablation studies isolating the contribution of the IEEE 754 bit-pattern embedding. These omissions prevent quantitative evaluation of the claimed improvement.
Authors: We concur that the current presentation would benefit from greater quantitative rigor. The manuscript will be revised to include exact per-operation error rates, full accuracy tables, direct baseline comparisons against standard multi-token tokenization, and ablation experiments that isolate the contribution of the IEEE 754 bit-pattern embedding. These results will be added to §5 (and referenced in the abstract) to enable precise evaluation of the claimed gains. revision: yes
Circularity Check
No circularity: empirical encoding proposal with independent experimental validation
full rationale
The paper introduces BitTokens as a single-token encoding based on raw IEEE 754 bit patterns, lists desiderata for number encodings, and reports experimental results showing small models achieve near-perfect accuracy on basic arithmetic tasks. No equations, predictions, or central claims reduce by construction to fitted parameters, self-definitions, or self-citation chains. The derivation chain consists of a proposed representation followed by direct empirical measurement against held-out arithmetic examples; results are not tautological with the input encoding. This is the expected non-finding for an empirical methods paper whose claims rest on observable performance rather than internal redefinitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption IEEE 754 binary floating-point representation can be directly used as token embeddings for numerical values.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery from Law of Logic unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BitTokens uses ... IEEE 754 ... sign, exponent, and significand ... bit-wise arithmetic over Z2 results in coefficient-wise operations reducing to Boolean gates: (x·y) mod 2 = x∧y, (x+y) mod 2 = x⊕y
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Triadic Suffix Tokenization groups digits into triads with fixed magnitude suffixes to make order-of-magnitude relationships explicit at the token level for LLMs.
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Large language models for mathematical reasoning: Progresses and challenges
9 APREPRINT- OCTOBER9, 2025 Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225–237,
work page 2025
- [3]
-
[4]
Mubashara Akhtar, Abhilash Shankarampeta, Vivek Gupta, Arpit Patil, Oana Cocarascu, and Elena Simperl. Exploring the numerical reasoning capabilities of language models: A comprehensive analysis on tabular data. InThe 2023 Conference on Empirical Methods in Natural Language Processing,
work page 2023
-
[5]
ICML 2024 Tutorial: Physics of Language Models, July
Zeyuan Allen-Zhu. ICML 2024 Tutorial: Physics of Language Models, July
work page 2024
-
[6]
Project page:https://physics. allen-zhu.com/. Tanja Baeumel, Josef van Genabith, and Simon Ostermann. The lookahead limitation: Why multi-operand addition is hard for LLMs.arXiv preprint arXiv:2502.19981,
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URLhttps://arxiv.org/abs/2412.19437. Benito E Flores. A pragmatic view of accuracy measurement in forecasting.Omega, 14(2):93–98,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
xval: A continuous number encoding for large language models
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, et al. xval: A continuous number encoding for large language models. InNeurIPS 2023 AI for Science Workshop,
work page 2023
-
[11]
Middleware for llms: Tools are instrumental for language agents in complex environments
10 APREPRINT- OCTOBER9, 2025 Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, and Yu Su. Middleware for llms: Tools are instrumental for language agents in complex environments. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7646–7663,
work page 2025
-
[12]
Hongwei Han, Jialiang Xu, Mengyu Zhou, Yijia Shao, Shi Han, and Dongmei Zhang. LUNA: language understanding with number augmentations on transformers via number plugins and pre-training.arXiv preprint arXiv:2212.02691,
-
[13]
Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney V on Arx, et al. Measuring AI ability to complete long tasks.arXiv preprint arXiv:2503.14499,
-
[14]
Haoyang Li, Xuejia Chen, Zhanchao Xu, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, et al. Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models.arXiv preprint arXiv:2502.11075,
-
[15]
Investigating the limitations of transformers with simple arithmetic tasks
Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019,
-
[16]
Talm: Tool augmente d language models
URL https://openai.com/index/ introducing-gpt-5/. Aaron Parisi, Yao Zhao, and Noah Fiedel. TALM: Tool augmented language models.arXiv preprint arXiv:2205.12255,
-
[17]
The FineWeb datasets: Decanting the web for the finest text data at scale
11 APREPRINT- OCTOBER9, 2025 Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
work page 2025
-
[18]
Impact of pretraining term frequencies on few-shot numerical reasoning
Yasaman Razeghi, Robert L Logan Iv, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854,
work page 2022
-
[19]
NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,
Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, and Assaf Arbelle. NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning.arXiv preprint arXiv:2404.00459,
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs.arXiv preprint arXiv:2402.14903,
-
[22]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
MMLU-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...
-
[25]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, and Muhan Zhang. Number cookbook: Number understanding of language models and how to improve it. InThe Thirteenth Intern...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Interpreting and improving large language models in arithmetic calculation
12 APREPRINT- OCTOBER9, 2025 Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, pages 59932–59950,
work page 2025
-
[27]
FoNE: Precise Single-Token Number Embeddings via Fourier Features
Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, and Vatsal Sharan. FoNE: Precise single-token number embeddings via fourier features.arXiv preprint arXiv:2502.09741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Transformers can achieve length generalization but not robustly
Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,
work page 2024
-
[29]
13 APREPRINT- OCTOBER9, 2025 A Benchmarking dataset A.1 Tasks A.1.1 Comparing numbers The most elementary numeracy task is comparing two random numbers, n1 and n2. As this task is trivial for modern LLMs, we increase the difficulty to multiple operands. Determining the minimum and maximum Determining the minimum or maximum of a list of numbers is in essen...
work page 2025
-
[30]
We then round the operands by p1 and p2 respectively and randomly swap operands. We choose the individual signs using the following schema: (1) both positive in40% of cases, (2) only one operand negative in40%of cases and (3) both negative in20%of cases. We randomly choose an operatorop∈ {+,−}. MultiplicationWe handle precision of the operands and their s...
work page 2025
-
[31]
MultiplicationGiven the fixed-point number representations of the operands in base-2 or base-10, the difficulty δMultiplication is given by the sum of their non-zero digits. This directly correlates with the number of steps involved to solve the task and the numbers’ precision. DivisionSimilar to Multiplication, the difficulty δMultiplication is given by ...
work page 2025
-
[32]
[-]?(?:(?:0(?!\.[0-9]))|(?:[0-9]*[.][0-9]+)|(?:[1-9][0-9]*))
We employ a cosine learning rate scheduler and allocate 10% of the training tokens for warm-up. Additionally, the Muon optimizer uses 300 momentum warm-up steps. Model performance is evaluated every 32 steps for 2 steps on a small validation set, and the sampling ratio is dynamically adjusted based on current results. After training, we select the checkpo...
work page 2025
-
[34]
yields clear improvements over direct multitask training, especially on arithmetic tasks, confirming its necessity for stable convergence. Finally, appending the reciprocal to the encoding improves performance with a negligible performance overhead (Table 10). Metric Sum Product Concat Zero Pad Weighted Weighted + Sum Min/Max↑Exact match acc 0.999 0.996 0...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.