pith. sign in

arxiv: 2502.09741 · v2 · submitted 2025-02-13 · 💻 cs.CL · cs.LG

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Pith reviewed 2026-05-23 03:04 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords number embeddingsFourier featuressingle-token representationarithmetic taskslarge language modelsnumerical reasoningtoken efficiency
0
0 comments X

The pith

Fourier features let models represent any number as a single token using two embedding dimensions per digit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models normally split numbers across several tokens, forcing the model to reassemble the value during both training and inference. The paper observes that pre-trained models already develop Fourier-like internal features for number tokens and shows these can be extracted and reused directly. FoNE therefore encodes each number as one token whose embedding is built from the corresponding Fourier components, using only two dimensions per digit. On six-digit addition this cuts the training data required for 99 percent accuracy by a factor of 64 while using three to six times fewer tokens than prior schemes. It is also the only method reported to reach 100 percent accuracy across more than 100,000 held-out examples for addition, subtraction, and multiplication.

Core claim

FoNE directly maps each scalar number into the embedding space by its Fourier features, producing a fixed single-token representation that requires only two embedding dimensions per digit. Because the representation is complete and non-fragmented, models trained with it converge faster, generalize better on arithmetic, and avoid the aggregation step that multi-token encodings demand.

What carries the argument

Fourier Number Embedding (FoNE), the fixed embedding that re-uses the Fourier-like features observed inside pre-trained LLMs as the direct encoding for each integer.

If this is right

  • Each number occupies only one token instead of the three or six required by subword or digit-wise schemes.
  • Training data volume for 99 percent accuracy on six-digit addition drops by a factor of 64.
  • Both training and inference run faster because the model never has to aggregate multiple tokens per number.
  • 100 percent accuracy is reached on more than 100,000 test examples for all three arithmetic operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the same Fourier construction works for other continuous quantities, single-token encodings could be applied to units, coordinates, or timestamps without lengthening context.
  • The success of fixed Fourier embeddings suggests that the numerical competence of LLMs may rest on frequency-based rather than purely positional mechanisms.
  • Because the embedding is parameter-free once the Fourier basis is chosen, the method could be ported to any transformer without adding trainable parameters for numbers.

Load-bearing premise

The Fourier-like features that appear inside pre-trained LLMs can be pulled out and reused as static embeddings in fresh models without discarding necessary numerical information.

What would settle it

Any new test set of more than 100,000 examples on which FoNE-trained models fall below 100 percent accuracy for addition, subtraction, or multiplication.

Figures

Figures reproduced from arXiv: 2502.09741 by Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Tianyi Zhou, Vatsal Sharan.

Figure 1
Figure 1. Figure 1: (a) We extract all the numbers from the input sequence. (b) For each number, we use FoNE to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We train Llama-3.2-1B from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of accuracy trends for various arithmetic tasks with respect to model size and data [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Average accuracy of an 8-layer transformer model on 60-digit addition tasks using FoNE for [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We train Llama-3.2-1B from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We train Llama-3.2-1B from scratch with random initialization using different number embed [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fourier analysis of the Pythia model’s number embeddings across pre-training checkpoints. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Number embedding in Fourier space for different pre-trained models. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy of an 8-layer transformer on 60-digit addition tasks, illustrating the effectiveness of [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Heatmaps of accuracy percentages for “FoNE+Abacus” (left column) and “Abacus” (right [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: We train GPT2-Large from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of R2 trends for 6-digit decimal addition with respect to model size and data size. (a) 6-digit integer addition: Model&Data size vs. Accuracy (b) 5-digit integer addition: Model&Data size vs. Accuracy (c) 5-digit integer subtraction: Model&Data size vs. Accu￾racy (d) 3-digit integer multiplication: Model&Data size vs. Ac￾curacy [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of R2 trends for various arithmetic tasks with respect to model size and data size. 27 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Fourier Number Embedding (FoNE), which encodes each number as a single token using Fourier features with only two embedding dimensions per digit. Inspired by Fourier-like features observed in pre-trained LLMs, the method claims to avoid token fragmentation, reduce computational overhead, and deliver higher accuracy on arithmetic tasks. Key claims include requiring 64× less data than subword or digit-wise baselines to reach 99% accuracy on 6-digit addition while using 3× and 6× fewer tokens, respectively, and being the only method to achieve 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.

Significance. If the performance claims are reproducible, FoNE would represent a meaningful advance in numerical representation for LLMs by enabling compact, precise single-token embeddings that improve both efficiency and accuracy on arithmetic operations. The reported data-efficiency gains and unique attainment of perfect accuracy on large test sets would be notable contributions to the literature on number handling in language models.

major comments (3)
  1. [Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.
  2. [Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.
  3. [Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.
minor comments (1)
  1. The manuscript provides a GitHub link for code and visualizations but does not include pseudocode, explicit frequency-selection procedure, or embedding-dimension equations in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate to enhance clarity in the abstract and supporting sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.

    Authors: We agree the abstract is concise and lacks explicit pointers to these details. The transformer architecture is specified in Section 3.2, training procedure in Section 4.1, and test-set construction (distinct ranges, >100k examples) in Section 4.3. The embeddings are fixed and task-agnostic per Section 3.1; the 100% accuracy reported in Section 5.1 is obtained with these frozen embeddings and empirically confirms support for carry propagation and cross-digit operations. We will revise the abstract to reference these sections and note the empirical verification. revision: yes

  2. Referee: [Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.

    Authors: The baselines (subword BPE and digit-wise) are implemented exactly as described in Section 4.2, using the identical transformer backbone and hyper-parameters across all conditions to match capacity. Data leakage is controlled via non-overlapping train/test splits generated from separate numerical ranges. We will add a brief clarifying clause to the abstract referencing these controls. revision: yes

  3. Referee: [Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.

    Authors: The mapping is formalized in Equation (1) of Section 3.1, with frequencies set to powers of two to ensure unique per-digit encoding. No closed-form theoretical proof of exact recovery for all arithmetic operations under frozen embeddings is supplied in the manuscript; performance is demonstrated empirically via 100% accuracy on large held-out sets. We will expand the method section with a short rationale for frequency selection and note the empirical nature of the lossless claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity; FoNE is a proposed direct embedding construction with empirical results.

full rationale

The paper presents FoNE as an explicit construction that directly maps numbers to fixed Fourier-like features (two dimensions per digit) inspired by prior LLM observations, without any derivation that reduces a claimed prediction or uniqueness result back to parameters fitted on the evaluation data or to a self-citation chain. The reported 100% accuracy on >100k examples for arithmetic tasks is an empirical outcome of training and testing, not a quantity forced by definition or by renaming an input. No load-bearing self-citation, ansatz smuggling, or self-definitional loop is present in the abstract or described method; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that pre-trained LLMs develop Fourier-like internal features for numbers; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pre-trained LLMs internally learn Fourier-like features for number tokens
    This observation is cited as the inspiration for FoNE but is not derived within the paper.

pith-pipeline@v0.9.0 · 5769 in / 1033 out tokens · 22277 ms · 2026-05-23T03:04:54.955795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel matches
    ?
    matches

    MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

    Definition 3.1 (Circular embedding). Let T be a given period. We define function ϕ : R → R² ϕ(x, T) := (cos(2π/T x), sin(2π/T x)). Lemma 3.3: Given the pair (cos(2π/T x), sin(2π/T x)), we can recover x mod T.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Lemma 3.5 (Necessity of different periods): When T becomes very large... one must choose T across a broad range of scales... we choose T as 10^i

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

  2. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  3. Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

    cs.AI 2026-05 unverdicted novelty 7.0

    Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

  4. Efficient numeracy in language models through single-token number embeddings

    cs.LG 2025-10 unverdicted novelty 7.0

    BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.

  5. Convergent Evolution: How Different Language Models Learn Similar Number Representations

    cs.CL 2026-04 unverdicted novelty 6.0

    Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 5 Pith papers · 11 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

  3. [3]

    Improving vision transformers by revisiting high-frequency components

    Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022

  4. [4]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7889–7901, 2023

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

  6. [6]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

  7. [7]

    Faith and fate: Limits of trans- formers on compositionality

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of trans- formers on compositionality. Advances in Neural Information Processing Systems , 36, 2024. 11

  8. [8]

    Large language models on tabular data–a survey

    Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models on tabular data–a survey. arXiv e-prints, pages arXiv–2402, 2024

  9. [9]

    How numerical precision affects mathematical reasoning capabilities of llms

    Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, and Liwei Wang. How numerical precision affects mathematical reasoning capabilities of llms. arXiv preprint arXiv:2410.13857, 2024

  10. [10]

    A polar prediction model for learning to represent visual transformations

    Pierre- ´Etienne Fiquet and Eero Simoncelli. A polar prediction model for learning to represent visual transformations. Advances in Neural Information Processing Systems , 36, 2024

  11. [11]

    Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications? arXiv preprint arXiv:2408.11854, 2024

  12. [12]

    Cramming: Training a language model on a single gpu in one day

    Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning , pages 11117–11143. PMLR, 2023

  13. [13]

    xval: A continuous number encoding for large language models

    Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989 , 2023

  14. [14]

    Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

    Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Tianyi Zhou. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469 , 2024

  15. [15]

    Frequency-enhanced data augmentation for vision-and-language navigation

    Keji He, Chenyang Si, Zhihe Lu, Yan Huang, Liang Wang, and Xinchao Wang. Frequency-enhanced data augmentation for vision-and-language navigation. Advances in Neural Information Processing Systems, 36, 2024

  16. [16]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023

  17. [17]

    Learning numeral embeddings

    Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. Learning numeral embeddings. arXiv preprint arXiv:2001.00003 , 2019

  18. [18]

    Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

    Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 , 2023

  19. [19]

    Cladder: A benchmark to assess causal reasoning capabilities of language models

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fer- nando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024

  20. [20]

    Teach- ing arithmetic to small transformers

    Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teach- ing arithmetic to small transformers. arXiv preprint arXiv:2307.03381 , 2023

  21. [21]

    Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation

    Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, and Shu-Tao Xia. Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation. arXiv preprint arXiv:2403.07300 , 2024. 12

  22. [22]

    Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

    Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644 , 2024

  23. [23]

    A survey on time-series pre-trained models

    Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on time-series pre-trained models. IEEE Transactions on Knowledge and Data Engineering, 2024

  24. [24]

    Transformers can do arithmetic with the right embeddings

    Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, et al. Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399 , 2024

  25. [25]

    Benchmarking chatgpt on algorithmic rea- soning

    Sean McLeish, Avi Schwarzschild, and Tom Goldstein. Benchmarking chatgpt on algorithmic rea- soning. arXiv preprint arXiv:2404.03441 , 2024

  26. [26]

    Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

    Kazem Meidani, Parshin Shojaee, Chandan K Reddy, and Amir Barati Farimani. Snip: Bridging mathematical symbolic and numeric realms with unified pre-training. arXiv preprint arXiv:2310.02227, 2023

  27. [27]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems , 35:17359–17372, 2022

  28. [28]

    Language models still struggle to zero-shot reason about time series

    Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. arXiv preprint arXiv:2404.11757 , 2024

  29. [29]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217 , 2023

  30. [30]

    Investigating the limitations of transformers with simple arithmetic tasks

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019 , 2021

  31. [31]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021

  32. [32]

    An overview of early vision in inceptionv1

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. An overview of early vision in inceptionv1. Distill, 5(4):e00024–002, 2020

  33. [33]

    Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

    Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

  34. [34]

    Compositional semantic parsing on semi-structured tables

    Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1470–1480, Beijing, China, July 2015. Association for Computational ...

  35. [35]

    Impact of pretraining term frequencies on few-shot reasoning

    Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206 , 2022

  36. [36]

    Explainable artificial intelligence for tabular data: A survey

    Maria Sahakyan, Zeyar Aung, and Talal Rahwan. Explainable artificial intelligence for tabular data: A survey. IEEE access, 9:135392–135422, 2021

  37. [37]

    Analysing Mathematical Reasoning Abilities of Neural Models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical rea- soning abilities of neural models. arXiv preprint arXiv:1904.01557 , 2019. 13

  38. [38]

    Positional description matters for transformers arithmetic

    Ruoqi Shen, S´ ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737 , 2023

  39. [39]

    How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

    Jasivan Alex Sivakumar and Nafise Sadat Moosavi. How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

  40. [40]

    Methods for numeracy-preserving word embeddings

    Dhanasekar Sundararaman, Shijing Si, Vivek Subramanian, Guoyin Wang, Devamanyu Hazarika, and Lawrence Carin. Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4742–4753, 2020

  41. [41]

    Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

    Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

  42. [42]

    Fourier features let networks learn high frequency functions in low dimensional domains

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020

  43. [43]

    Representing numbers in nlp: a survey and a vision

    Avijit Thawani, Jay Pujara, Pedro A Szekely, and Filip Ilievski. Representing numbers in nlp: a survey and a vision. arXiv preprint arXiv:2103.13136 , 2021

  44. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

  45. [45]

    Knowledge circuits in pretrained transformers

    Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. arXiv preprint arXiv:2405.17969 , 2024

  46. [46]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023

  47. [47]

    What algorithms can transformers learn? a study in length generalization

    Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028 , 2023

  48. [48]

    One fits all: Power general time series analysis by pretrained lm

    Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems , 36:43322–43355, 2023

  49. [49]

    Pre-trained large language models use fourier features to compute addition

    Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. arXiv preprint arXiv:2406.03445 , 2024

  50. [50]

    FoNE+Abacus

    Zhejian Zhou, Jiayu Wang, Dahua Lin, and Kai Chen. Scaling behavior for large language models regarding numeral systems: An example using pythia. arXiv preprint arXiv:2409.17391 , 2024. 14 Appendix Roadmap In Appendix A, we provide the detailed algorithm for computing the final loss and making number predictions. In Appendix B, we present the results of t...