FoNE: Precise Single-Token Number Embeddings via Fourier Features

Deqing Fu; Mahdi Soltanolkotabi; Robin Jia; Tianyi Zhou; Vatsal Sharan

arxiv: 2502.09741 · v2 · submitted 2025-02-13 · 💻 cs.CL · cs.LG

FoNE: Precise Single-Token Number Embeddings via Fourier Features

Tianyi Zhou , Deqing Fu , Mahdi Soltanolkotabi , Robin Jia , Vatsal Sharan This is my paper

Pith reviewed 2026-05-23 03:04 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords number embeddingsFourier featuressingle-token representationarithmetic taskslarge language modelsnumerical reasoningtoken efficiency

0 comments

The pith

Fourier features let models represent any number as a single token using two embedding dimensions per digit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models normally split numbers across several tokens, forcing the model to reassemble the value during both training and inference. The paper observes that pre-trained models already develop Fourier-like internal features for number tokens and shows these can be extracted and reused directly. FoNE therefore encodes each number as one token whose embedding is built from the corresponding Fourier components, using only two dimensions per digit. On six-digit addition this cuts the training data required for 99 percent accuracy by a factor of 64 while using three to six times fewer tokens than prior schemes. It is also the only method reported to reach 100 percent accuracy across more than 100,000 held-out examples for addition, subtraction, and multiplication.

Core claim

FoNE directly maps each scalar number into the embedding space by its Fourier features, producing a fixed single-token representation that requires only two embedding dimensions per digit. Because the representation is complete and non-fragmented, models trained with it converge faster, generalize better on arithmetic, and avoid the aggregation step that multi-token encodings demand.

What carries the argument

Fourier Number Embedding (FoNE), the fixed embedding that re-uses the Fourier-like features observed inside pre-trained LLMs as the direct encoding for each integer.

If this is right

Each number occupies only one token instead of the three or six required by subword or digit-wise schemes.
Training data volume for 99 percent accuracy on six-digit addition drops by a factor of 64.
Both training and inference run faster because the model never has to aggregate multiple tokens per number.
100 percent accuracy is reached on more than 100,000 test examples for all three arithmetic operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same Fourier construction works for other continuous quantities, single-token encodings could be applied to units, coordinates, or timestamps without lengthening context.
The success of fixed Fourier embeddings suggests that the numerical competence of LLMs may rest on frequency-based rather than purely positional mechanisms.
Because the embedding is parameter-free once the Fourier basis is chosen, the method could be ported to any transformer without adding trainable parameters for numbers.

Load-bearing premise

The Fourier-like features that appear inside pre-trained LLMs can be pulled out and reused as static embeddings in fresh models without discarding necessary numerical information.

What would settle it

Any new test set of more than 100,000 examples on which FoNE-trained models fall below 100 percent accuracy for addition, subtraction, or multiplication.

Figures

Figures reproduced from arXiv: 2502.09741 by Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Tianyi Zhou, Vatsal Sharan.

**Figure 2.** Figure 2: We train Llama-3.2-1B from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of accuracy trends for various arithmetic tasks with respect to model size and data [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Average accuracy of an 8-layer transformer model on 60-digit addition tasks using FoNE for [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: We train Llama-3.2-1B from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: We train Llama-3.2-1B from scratch with random initialization using different number embed [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Fourier analysis of the Pythia model’s number embeddings across pre-training checkpoints. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Number embedding in Fourier space for different pre-trained models. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy of an 8-layer transformer on 60-digit addition tasks, illustrating the effectiveness of [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Heatmaps of accuracy percentages for “FoNE+Abacus” (left column) and “Abacus” (right [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: We train GPT2-Large from scratch with random initialization using different number em [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of R2 trends for 6-digit decimal addition with respect to model size and data size. (a) 6-digit integer addition: Model&Data size vs. Accuracy (b) 5-digit integer addition: Model&Data size vs. Accuracy (c) 5-digit integer subtraction: Model&Data size vs. Accuracy (d) 3-digit integer multiplication: Model&Data size vs. Accuracy [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of R2 trends for various arithmetic tasks with respect to model size and data size. 27 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FoNE offers a concrete single-token Fourier embedding for numbers that cuts token count and data needs, but the 100% accuracy claim on large arithmetic tests needs verification on whether the fixed mapping preserves exact operations.

read the letter

The main point is that this paper takes Fourier-like features seen inside pre-trained LLMs and turns them into a fixed embedding that puts each number into one token using two dimensions per digit. That construction is new relative to the cited prior work on number tokenization. It reports clear efficiency wins: 64 times less data to hit 99 percent on six-digit addition, plus three to six times fewer tokens than subword or digit baselines. The 100 percent accuracy on more than 100,000 test cases for addition, subtraction, and multiplication is the strongest result they highlight. Those numbers, if they hold, would matter for anyone training models on arithmetic or other structured sequences. The soft spot is the lack of detail on how the frequencies and phases are chosen and whether the mapping is lossless for carry propagation or cross-digit multiplication when the embeddings stay frozen. The abstract says the method is inspired by the observed features rather than derived from a proof that the chosen dimensions recover exact integer values under arithmetic. Without controls for data leakage or task-specific tuning, it is hard to know how much of the 100 percent result comes from the embedding itself versus other factors in the training setup. The paper is aimed at people working on numerical reasoning inside transformers or on compact embeddings for structured inputs. A reader who wants to test fixed Fourier-style encodings on their own arithmetic benchmarks would find the construction useful to try. It deserves peer review because the efficiency claims are large enough to check in detail and the method is simple enough to reproduce.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Fourier Number Embedding (FoNE), which encodes each number as a single token using Fourier features with only two embedding dimensions per digit. Inspired by Fourier-like features observed in pre-trained LLMs, the method claims to avoid token fragmentation, reduce computational overhead, and deliver higher accuracy on arithmetic tasks. Key claims include requiring 64× less data than subword or digit-wise baselines to reach 99% accuracy on 6-digit addition while using 3× and 6× fewer tokens, respectively, and being the only method to achieve 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.

Significance. If the performance claims are reproducible, FoNE would represent a meaningful advance in numerical representation for LLMs by enabling compact, precise single-token embeddings that improve both efficiency and accuracy on arithmetic operations. The reported data-efficiency gains and unique attainment of perfect accuracy on large test sets would be notable contributions to the literature on number handling in language models.

major comments (3)

[Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.
[Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.
[Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.

minor comments (1)

The manuscript provides a GitHub link for code and visualizations but does not include pseudocode, explicit frequency-selection procedure, or embedding-dimension equations in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate to enhance clarity in the abstract and supporting sections.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.

Authors: We agree the abstract is concise and lacks explicit pointers to these details. The transformer architecture is specified in Section 3.2, training procedure in Section 4.1, and test-set construction (distinct ranges, >100k examples) in Section 4.3. The embeddings are fixed and task-agnostic per Section 3.1; the 100% accuracy reported in Section 5.1 is obtained with these frozen embeddings and empirically confirms support for carry propagation and cross-digit operations. We will revise the abstract to reference these sections and note the empirical verification. revision: yes
Referee: [Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.

Authors: The baselines (subword BPE and digit-wise) are implemented exactly as described in Section 4.2, using the identical transformer backbone and hyper-parameters across all conditions to match capacity. Data leakage is controlled via non-overlapping train/test splits generated from separate numerical ranges. We will add a brief clarifying clause to the abstract referencing these controls. revision: yes
Referee: [Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.

Authors: The mapping is formalized in Equation (1) of Section 3.1, with frequencies set to powers of two to ensure unique per-digit encoding. No closed-form theoretical proof of exact recovery for all arithmetic operations under frozen embeddings is supplied in the manuscript; performance is demonstrated empirically via 100% accuracy on large held-out sets. We will expand the method section with a short rationale for frequency selection and note the empirical nature of the lossless claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity; FoNE is a proposed direct embedding construction with empirical results.

full rationale

The paper presents FoNE as an explicit construction that directly maps numbers to fixed Fourier-like features (two dimensions per digit) inspired by prior LLM observations, without any derivation that reduces a claimed prediction or uniqueness result back to parameters fitted on the evaluation data or to a self-citation chain. The reported 100% accuracy on >100k examples for arithmetic tasks is an empirical outcome of training and testing, not a quantity forced by definition or by renaming an input. No load-bearing self-citation, ansatz smuggling, or self-definitional loop is present in the abstract or described method; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that pre-trained LLMs develop Fourier-like internal features for numbers; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Pre-trained LLMs internally learn Fourier-like features for number tokens
This observation is cited as the inspiration for FoNE but is not derived within the paper.

pith-pipeline@v0.9.0 · 5769 in / 1033 out tokens · 22277 ms · 2026-05-23T03:04:54.955795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Definition 3.1 (Circular embedding). Let T be a given period. We define function ϕ : R → R² ϕ(x, T) := (cos(2π/T x), sin(2π/T x)). Lemma 3.3: Given the pair (cos(2π/T x), sin(2π/T x)), we can recover x mod T.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Lemma 3.5 (Necessity of different periods): When T becomes very large... one must choose T across a broad range of scales... we choose T as 10^i

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
cs.AI 2026-05 unverdicted novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
cs.AI 2026-05 unverdicted novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Efficient numeracy in language models through single-token number embeddings
cs.LG 2025-10 unverdicted novelty 7.0

BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
Convergent Evolution: How Different Language Models Learn Similar Number Representations
cs.CL 2026-04 unverdicted novelty 6.0

Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 5 Pith papers · 11 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Improving vision transformers by revisiting high-frequency components

Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022

work page 2022
[4]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7889–7901, 2023

work page 2023
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Faith and fate: Limits of trans- formers on compositionality

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of trans- formers on compositionality. Advances in Neural Information Processing Systems , 36, 2024. 11

work page 2024
[8]

Large language models on tabular data–a survey

Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models on tabular data–a survey. arXiv e-prints, pages arXiv–2402, 2024

work page 2024
[9]

How numerical precision affects mathematical reasoning capabilities of llms

Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, and Liwei Wang. How numerical precision affects mathematical reasoning capabilities of llms. arXiv preprint arXiv:2410.13857, 2024

work page arXiv 2024
[10]

A polar prediction model for learning to represent visual transformations

Pierre- ´Etienne Fiquet and Eero Simoncelli. A polar prediction model for learning to represent visual transformations. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[11]

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications? arXiv preprint arXiv:2408.11854, 2024

work page arXiv 2024
[12]

Cramming: Training a language model on a single gpu in one day

Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning , pages 11117–11143. PMLR, 2023

work page 2023
[13]

xval: A continuous number encoding for large language models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989 , 2023

work page arXiv 2023
[14]

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Tianyi Zhou. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469 , 2024

work page arXiv 2024
[15]

Frequency-enhanced data augmentation for vision-and-language navigation

Keji He, Chenyang Si, Zhihe Lu, Yan Huang, Liang Wang, and Xinchao Wang. Frequency-enhanced data augmentation for vision-and-language navigation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[16]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Learning numeral embeddings

Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. Learning numeral embeddings. arXiv preprint arXiv:2001.00003 , 2019

work page arXiv 2001
[18]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Cladder: A benchmark to assess causal reasoning capabilities of language models

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fer- nando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[20]

Teach- ing arithmetic to small transformers

Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teach- ing arithmetic to small transformers. arXiv preprint arXiv:2307.03381 , 2023

work page arXiv 2023
[21]

Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation

Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, and Shu-Tao Xia. Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation. arXiv preprint arXiv:2403.07300 , 2024. 12

work page arXiv 2024
[22]

Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644 , 2024

work page arXiv 2024
[23]

A survey on time-series pre-trained models

Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on time-series pre-trained models. IEEE Transactions on Knowledge and Data Engineering, 2024

work page 2024
[24]

Transformers can do arithmetic with the right embeddings

Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, et al. Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399 , 2024

work page arXiv 2024
[25]

Benchmarking chatgpt on algorithmic rea- soning

Sean McLeish, Avi Schwarzschild, and Tom Goldstein. Benchmarking chatgpt on algorithmic rea- soning. arXiv preprint arXiv:2404.03441 , 2024

work page arXiv 2024
[26]

Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

Kazem Meidani, Parshin Shojaee, Chandan K Reddy, and Amir Barati Farimani. Snip: Bridging mathematical symbolic and numeric realms with unified pre-training. arXiv preprint arXiv:2310.02227, 2023

work page arXiv 2023
[27]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems , 35:17359–17372, 2022

work page 2022
[28]

Language models still struggle to zero-shot reason about time series

Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. arXiv preprint arXiv:2404.11757 , 2024

work page arXiv 2024
[29]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Investigating the limitations of transformers with simple arithmetic tasks

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019 , 2021

work page arXiv 2021
[31]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

An overview of early vision in inceptionv1

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. An overview of early vision in inceptionv1. Distill, 5(4):e00024–002, 2020

work page 2020
[33]

Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

work page 1997
[34]

Compositional semantic parsing on semi-structured tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1470–1480, Beijing, China, July 2015. Association for Computational ...

work page doi:10.3115/v1/ 2015
[35]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206 , 2022

work page arXiv 2022
[36]

Explainable artificial intelligence for tabular data: A survey

Maria Sahakyan, Zeyar Aung, and Talal Rahwan. Explainable artificial intelligence for tabular data: A survey. IEEE access, 9:135392–135422, 2021

work page 2021
[37]

Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical rea- soning abilities of neural models. arXiv preprint arXiv:1904.01557 , 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1904
[38]

Positional description matters for transformers arithmetic

Ruoqi Shen, S´ ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737 , 2023

work page arXiv 2023
[39]

How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

Jasivan Alex Sivakumar and Nafise Sadat Moosavi. How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

work page arXiv 2024
[40]

Methods for numeracy-preserving word embeddings

Dhanasekar Sundararaman, Shijing Si, Vivek Subramanian, Guoyin Wang, Devamanyu Hazarika, and Lawrence Carin. Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4742–4753, 2020

work page 2020
[41]

Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

work page arXiv 2024
[42]

Fourier features let networks learn high frequency functions in low dimensional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020

work page 2020
[43]

Representing numbers in nlp: a survey and a vision

Avijit Thawani, Jay Pujara, Pedro A Szekely, and Filip Ilievski. Representing numbers in nlp: a survey and a vision. arXiv preprint arXiv:2103.13136 , 2021

work page arXiv 2021
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. arXiv preprint arXiv:2405.17969 , 2024

work page arXiv 2024
[46]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

What algorithms can transformers learn? a study in length generalization

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028 , 2023

work page arXiv 2023
[48]

One fits all: Power general time series analysis by pretrained lm

Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems , 36:43322–43355, 2023

work page 2023
[49]

Pre-trained large language models use fourier features to compute addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. arXiv preprint arXiv:2406.03445 , 2024

work page arXiv 2024
[50]

FoNE+Abacus

Zhejian Zhou, Jiayu Wang, Dahua Lin, and Kai Chen. Scaling behavior for large language models regarding numeral systems: An example using pythia. arXiv preprint arXiv:2409.17391 , 2024. 14 Appendix Roadmap In Appendix A, we provide the detailed algorithm for computing the final loss and making number predictions. In Appendix B, we present the results of t...

work page arXiv 2024

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Improving vision transformers by revisiting high-frequency components

Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022

work page 2022

[4] [4]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7889–7901, 2023

work page 2023

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Faith and fate: Limits of trans- formers on compositionality

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of trans- formers on compositionality. Advances in Neural Information Processing Systems , 36, 2024. 11

work page 2024

[8] [8]

Large language models on tabular data–a survey

Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models on tabular data–a survey. arXiv e-prints, pages arXiv–2402, 2024

work page 2024

[9] [9]

How numerical precision affects mathematical reasoning capabilities of llms

Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, and Liwei Wang. How numerical precision affects mathematical reasoning capabilities of llms. arXiv preprint arXiv:2410.13857, 2024

work page arXiv 2024

[10] [10]

A polar prediction model for learning to represent visual transformations

Pierre- ´Etienne Fiquet and Eero Simoncelli. A polar prediction model for learning to represent visual transformations. Advances in Neural Information Processing Systems , 36, 2024

work page 2024

[11] [11]

Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications? arXiv preprint arXiv:2408.11854, 2024

work page arXiv 2024

[12] [12]

Cramming: Training a language model on a single gpu in one day

Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning , pages 11117–11143. PMLR, 2023

work page 2023

[13] [13]

xval: A continuous number encoding for large language models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989 , 2023

work page arXiv 2023

[14] [14]

Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic

Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Tianyi Zhou. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469 , 2024

work page arXiv 2024

[15] [15]

Frequency-enhanced data augmentation for vision-and-language navigation

Keji He, Chenyang Si, Zhihe Lu, Yan Huang, Liang Wang, and Xinchao Wang. Frequency-enhanced data augmentation for vision-and-language navigation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[16] [16]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Learning numeral embeddings

Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. Learning numeral embeddings. arXiv preprint arXiv:2001.00003 , 2019

work page arXiv 2001

[18] [18]

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Cladder: A benchmark to assess causal reasoning capabilities of language models

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fer- nando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[20] [20]

Teach- ing arithmetic to small transformers

Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teach- ing arithmetic to small transformers. arXiv preprint arXiv:2307.03381 , 2023

work page arXiv 2023

[21] [21]

Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation

Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, and Shu-Tao Xia. Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation. arXiv preprint arXiv:2403.07300 , 2024. 12

work page arXiv 2024

[22] [22]

Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644 , 2024

work page arXiv 2024

[23] [23]

A survey on time-series pre-trained models

Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on time-series pre-trained models. IEEE Transactions on Knowledge and Data Engineering, 2024

work page 2024

[24] [24]

Transformers can do arithmetic with the right embeddings

Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, et al. Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399 , 2024

work page arXiv 2024

[25] [25]

Benchmarking chatgpt on algorithmic rea- soning

Sean McLeish, Avi Schwarzschild, and Tom Goldstein. Benchmarking chatgpt on algorithmic rea- soning. arXiv preprint arXiv:2404.03441 , 2024

work page arXiv 2024

[26] [26]

Snip: Bridging mathematical symbolic and numeric realms with unified pre-training

Kazem Meidani, Parshin Shojaee, Chandan K Reddy, and Amir Barati Farimani. Snip: Bridging mathematical symbolic and numeric realms with unified pre-training. arXiv preprint arXiv:2310.02227, 2023

work page arXiv 2023

[27] [27]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems , 35:17359–17372, 2022

work page 2022

[28] [28]

Language models still struggle to zero-shot reason about time series

Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. arXiv preprint arXiv:2404.11757 , 2024

work page arXiv 2024

[29] [29]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Investigating the limitations of transformers with simple arithmetic tasks

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019 , 2021

work page arXiv 2021

[31] [31]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

An overview of early vision in inceptionv1

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. An overview of early vision in inceptionv1. Distill, 5(4):e00024–002, 2020

work page 2020

[33] [33]

Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

work page 1997

[34] [34]

Compositional semantic parsing on semi-structured tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1470–1480, Beijing, China, July 2015. Association for Computational ...

work page doi:10.3115/v1/ 2015

[35] [35]

Impact of pretraining term frequencies on few-shot reasoning

Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206 , 2022

work page arXiv 2022

[36] [36]

Explainable artificial intelligence for tabular data: A survey

Maria Sahakyan, Zeyar Aung, and Talal Rahwan. Explainable artificial intelligence for tabular data: A survey. IEEE access, 9:135392–135422, 2021

work page 2021

[37] [37]

Analysing Mathematical Reasoning Abilities of Neural Models

David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical rea- soning abilities of neural models. arXiv preprint arXiv:1904.01557 , 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1904

[38] [38]

Positional description matters for transformers arithmetic

Ruoqi Shen, S´ ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737 , 2023

work page arXiv 2023

[39] [39]

How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

Jasivan Alex Sivakumar and Nafise Sadat Moosavi. How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024

work page arXiv 2024

[40] [40]

Methods for numeracy-preserving word embeddings

Dhanasekar Sundararaman, Shijing Si, Vivek Subramanian, Guoyin Wang, Devamanyu Hazarika, and Lawrence Carin. Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4742–4753, 2020

work page 2020

[41] [41]

Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024

work page arXiv 2024

[42] [42]

Fourier features let networks learn high frequency functions in low dimensional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020

work page 2020

[43] [43]

Representing numbers in nlp: a survey and a vision

Avijit Thawani, Jay Pujara, Pedro A Szekely, and Filip Ilievski. Representing numbers in nlp: a survey and a vision. arXiv preprint arXiv:2103.13136 , 2021

work page arXiv 2021

[44] [44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. arXiv preprint arXiv:2405.17969 , 2024

work page arXiv 2024

[46] [46]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

What algorithms can transformers learn? a study in length generalization

Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028 , 2023

work page arXiv 2023

[48] [48]

One fits all: Power general time series analysis by pretrained lm

Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems , 36:43322–43355, 2023

work page 2023

[49] [49]

Pre-trained large language models use fourier features to compute addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. arXiv preprint arXiv:2406.03445 , 2024

work page arXiv 2024

[50] [50]

FoNE+Abacus

Zhejian Zhou, Jiayu Wang, Dahua Lin, and Kai Chen. Scaling behavior for large language models regarding numeral systems: An example using pythia. arXiv preprint arXiv:2409.17391 , 2024. 14 Appendix Roadmap In Appendix A, we provide the detailed algorithm for computing the final loss and making number predictions. In Appendix B, we present the results of t...

work page arXiv 2024