FoNE: Precise Single-Token Number Embeddings via Fourier Features
Pith reviewed 2026-05-23 03:04 UTC · model grok-4.3
The pith
Fourier features let models represent any number as a single token using two embedding dimensions per digit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FoNE directly maps each scalar number into the embedding space by its Fourier features, producing a fixed single-token representation that requires only two embedding dimensions per digit. Because the representation is complete and non-fragmented, models trained with it converge faster, generalize better on arithmetic, and avoid the aggregation step that multi-token encodings demand.
What carries the argument
Fourier Number Embedding (FoNE), the fixed embedding that re-uses the Fourier-like features observed inside pre-trained LLMs as the direct encoding for each integer.
If this is right
- Each number occupies only one token instead of the three or six required by subword or digit-wise schemes.
- Training data volume for 99 percent accuracy on six-digit addition drops by a factor of 64.
- Both training and inference run faster because the model never has to aggregate multiple tokens per number.
- 100 percent accuracy is reached on more than 100,000 test examples for all three arithmetic operations.
Where Pith is reading between the lines
- If the same Fourier construction works for other continuous quantities, single-token encodings could be applied to units, coordinates, or timestamps without lengthening context.
- The success of fixed Fourier embeddings suggests that the numerical competence of LLMs may rest on frequency-based rather than purely positional mechanisms.
- Because the embedding is parameter-free once the Fourier basis is chosen, the method could be ported to any transformer without adding trainable parameters for numbers.
Load-bearing premise
The Fourier-like features that appear inside pre-trained LLMs can be pulled out and reused as static embeddings in fresh models without discarding necessary numerical information.
What would settle it
Any new test set of more than 100,000 examples on which FoNE-trained models fall below 100 percent accuracy for addition, subtraction, or multiplication.
Figures
read the original abstract
Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Fourier Number Embedding (FoNE), which encodes each number as a single token using Fourier features with only two embedding dimensions per digit. Inspired by Fourier-like features observed in pre-trained LLMs, the method claims to avoid token fragmentation, reduce computational overhead, and deliver higher accuracy on arithmetic tasks. Key claims include requiring 64× less data than subword or digit-wise baselines to reach 99% accuracy on 6-digit addition while using 3× and 6× fewer tokens, respectively, and being the only method to achieve 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.
Significance. If the performance claims are reproducible, FoNE would represent a meaningful advance in numerical representation for LLMs by enabling compact, precise single-token embeddings that improve both efficiency and accuracy on arithmetic operations. The reported data-efficiency gains and unique attainment of perfect accuracy on large test sets would be notable contributions to the literature on number handling in language models.
major comments (3)
- [Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.
- [Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.
- [Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.
minor comments (1)
- The manuscript provides a GitHub link for code and visualizations but does not include pseudocode, explicit frequency-selection procedure, or embedding-dimension equations in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating revisions where appropriate to enhance clarity in the abstract and supporting sections.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that FoNE is the only method yielding 100% accuracy on >100k examples for addition, subtraction, and multiplication is presented without any description of the model architecture, training procedure, test-set construction, or verification that the fixed two-dim-per-digit Fourier embeddings support exact arithmetic (e.g., carry propagation or cross-digit multiplication) when kept frozen.
Authors: We agree the abstract is concise and lacks explicit pointers to these details. The transformer architecture is specified in Section 3.2, training procedure in Section 4.1, and test-set construction (distinct ranges, >100k examples) in Section 4.3. The embeddings are fixed and task-agnostic per Section 3.1; the 100% accuracy reported in Section 5.1 is obtained with these frozen embeddings and empirically confirms support for carry propagation and cross-digit operations. We will revise the abstract to reference these sections and note the empirical verification. revision: yes
-
Referee: [Abstract] Abstract: the data-efficiency claim (64× less data to reach 99% accuracy on 6-digit addition) is stated without specification of the exact baseline implementations, hyper-parameter matching, or controls for data leakage or differences in effective model capacity between FoNE and the subword/digit-wise conditions.
Authors: The baselines (subword BPE and digit-wise) are implemented exactly as described in Section 4.2, using the identical transformer backbone and hyper-parameters across all conditions to match capacity. Data leakage is controlled via non-overlapping train/test splits generated from separate numerical ranges. We will add a brief clarifying clause to the abstract referencing these controls. revision: yes
-
Referee: [Abstract] Abstract: the construction is described as a direct mapping 'inspired by' LLM-internal Fourier features, yet no equation or derivation is supplied showing that the chosen frequencies and phases permit exact integer recovery or lossless arithmetic when the embeddings remain task-agnostic and frozen.
Authors: The mapping is formalized in Equation (1) of Section 3.1, with frequencies set to powers of two to ensure unique per-digit encoding. No closed-form theoretical proof of exact recovery for all arithmetic operations under frozen embeddings is supplied in the manuscript; performance is demonstrated empirically via 100% accuracy on large held-out sets. We will expand the method section with a short rationale for frequency selection and note the empirical nature of the lossless claim. revision: partial
Circularity Check
No significant circularity; FoNE is a proposed direct embedding construction with empirical results.
full rationale
The paper presents FoNE as an explicit construction that directly maps numbers to fixed Fourier-like features (two dimensions per digit) inspired by prior LLM observations, without any derivation that reduces a claimed prediction or uniqueness result back to parameters fitted on the evaluation data or to a self-citation chain. The reported 100% accuracy on >100k examples for arithmetic tasks is an empirical outcome of training and testing, not a quantity forced by definition or by renaming an input. No load-bearing self-citation, ansatz smuggling, or self-definitional loop is present in the abstract or described method; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained LLMs internally learn Fourier-like features for number tokens
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel matches?
matchesMATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.
Definition 3.1 (Circular embedding). Let T be a given period. We define function ϕ : R → R² ϕ(x, T) := (cos(2π/T x), sin(2π/T x)). Lemma 3.3: Given the pair (cos(2π/T x), sin(2π/T x)), we can recover x mod T.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Lemma 3.5 (Necessity of different periods): When T becomes very large... one must choose T across a broad range of scales... we choose T as 10^i
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
-
Efficient numeracy in language models through single-token number embeddings
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
-
Convergent Evolution: How Different Language Models Learn Similar Number Representations
Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Improving vision transformers by revisiting high-frequency components
Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Improving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022
work page 2022
-
[4]
Theoremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 7889–7901, 2023
work page 2023
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Faith and fate: Limits of trans- formers on compositionality
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, et al. Faith and fate: Limits of trans- formers on compositionality. Advances in Neural Information Processing Systems , 36, 2024. 11
work page 2024
-
[8]
Large language models on tabular data–a survey
Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, and Christos Faloutsos. Large language models on tabular data–a survey. arXiv e-prints, pages arXiv–2402, 2024
work page 2024
-
[9]
How numerical precision affects mathematical reasoning capabilities of llms
Guhao Feng, Kai Yang, Yuntian Gu, Xinyue Ai, Shengjie Luo, Jiacheng Sun, Di He, Zhenguo Li, and Liwei Wang. How numerical precision affects mathematical reasoning capabilities of llms. arXiv preprint arXiv:2410.13857, 2024
-
[10]
A polar prediction model for learning to represent visual transformations
Pierre- ´Etienne Fiquet and Eero Simoncelli. A polar prediction model for learning to represent visual transformations. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[11]
Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, and Majid Afshar. When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications? arXiv preprint arXiv:2408.11854, 2024
-
[12]
Cramming: Training a language model on a single gpu in one day
Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day. In International Conference on Machine Learning , pages 11117–11143. PMLR, 2023
work page 2023
-
[13]
xval: A continuous number encoding for large language models
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989 , 2023
-
[14]
Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, and Tianyi Zhou. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469 , 2024
-
[15]
Frequency-enhanced data augmentation for vision-and-language navigation
Keji He, Chenyang Si, Zhihe Lu, Yan Huang, Liang Wang, and Xinchao Wang. Frequency-enhanced data augmentation for vision-and-language navigation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[16]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. Learning numeral embeddings. arXiv preprint arXiv:2001.00003 , 2019
-
[18]
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Cladder: A benchmark to assess causal reasoning capabilities of language models
Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fer- nando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[20]
Teach- ing arithmetic to small transformers
Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teach- ing arithmetic to small transformers. arXiv preprint arXiv:2307.03381 , 2023
-
[21]
Peiyuan Liu, Hang Guo, Tao Dai, Naiqi Li, Jigang Bao, Xudong Ren, Yong Jiang, and Shu-Tao Xia. Taming pre-trained llms for generalised time series forecasting via cross-modal knowledge distillation. arXiv preprint arXiv:2403.07300 , 2024. 12
-
[22]
Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, and Yansong Feng. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv preprint arXiv:2402.17644 , 2024
-
[23]
A survey on time-series pre-trained models
Qianli Ma, Zhen Liu, Zhenjing Zheng, Ziyang Huang, Siying Zhu, Zhongzhong Yu, and James T Kwok. A survey on time-series pre-trained models. IEEE Transactions on Knowledge and Data Engineering, 2024
work page 2024
-
[24]
Transformers can do arithmetic with the right embeddings
Sean McLeish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, et al. Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399 , 2024
-
[25]
Benchmarking chatgpt on algorithmic rea- soning
Sean McLeish, Avi Schwarzschild, and Tom Goldstein. Benchmarking chatgpt on algorithmic rea- soning. arXiv preprint arXiv:2404.03441 , 2024
-
[26]
Snip: Bridging mathematical symbolic and numeric realms with unified pre-training
Kazem Meidani, Parshin Shojaee, Chandan K Reddy, and Amir Barati Farimani. Snip: Bridging mathematical symbolic and numeric realms with unified pre-training. arXiv preprint arXiv:2310.02227, 2023
-
[27]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems , 35:17359–17372, 2022
work page 2022
-
[28]
Language models still struggle to zero-shot reason about time series
Mike A Merrill, Mingtian Tan, Vinayak Gupta, Tom Hartvigsen, and Tim Althoff. Language models still struggle to zero-shot reason about time series. arXiv preprint arXiv:2404.11757 , 2024
-
[29]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Investigating the limitations of transformers with simple arithmetic tasks
Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019 , 2021
-
[31]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
An overview of early vision in inceptionv1
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. An overview of early vision in inceptionv1. Distill, 5(4):e00024–002, 2020
work page 2020
-
[33]
Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997
work page 1997
-
[34]
Compositional semantic parsing on semi-structured tables
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1470–1480, Beijing, China, July 2015. Association for Computational ...
work page doi:10.3115/v1/ 2015
-
[35]
Impact of pretraining term frequencies on few-shot reasoning
Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206 , 2022
-
[36]
Explainable artificial intelligence for tabular data: A survey
Maria Sahakyan, Zeyar Aung, and Talal Rahwan. Explainable artificial intelligence for tabular data: A survey. IEEE access, 9:135392–135422, 2021
work page 2021
-
[37]
Analysing Mathematical Reasoning Abilities of Neural Models
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical rea- soning abilities of neural models. arXiv preprint arXiv:1904.01557 , 2019. 13
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[38]
Positional description matters for transformers arithmetic
Ruoqi Shen, S´ ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, and Yi Zhang. Positional description matters for transformers arithmetic. arXiv preprint arXiv:2311.14737 , 2023
-
[39]
How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024
Jasivan Alex Sivakumar and Nafise Sadat Moosavi. How to leverage digit embeddings to represent numbers? arXiv preprint arXiv:2407.00894 , 2024
-
[40]
Methods for numeracy-preserving word embeddings
Dhanasekar Sundararaman, Shijing Si, Vivek Subramanian, Guoyin Wang, Devamanyu Hazarika, and Lawrence Carin. Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4742–4753, 2020
work page 2020
-
[41]
Mingtian Tan, Mike A Merrill, Vinayak Gupta, Tim Althoff, and Thomas Hartvigsen. Are language models actually useful for time series forecasting? arXiv preprint arXiv:2406.16964 , 2024
-
[42]
Fourier features let networks learn high frequency functions in low dimensional domains
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020
work page 2020
-
[43]
Representing numbers in nlp: a survey and a vision
Avijit Thawani, Jay Pujara, Pedro A Szekely, and Filip Ilievski. Representing numbers in nlp: a survey and a vision. arXiv preprint arXiv:2103.13136 , 2021
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Knowledge circuits in pretrained transformers
Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. arXiv preprint arXiv:2405.17969 , 2024
-
[46]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
What algorithms can transformers learn? a study in length generalization
Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028 , 2023
-
[48]
One fits all: Power general time series analysis by pretrained lm
Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems , 36:43322–43355, 2023
work page 2023
-
[49]
Pre-trained large language models use fourier features to compute addition
Tianyi Zhou, Deqing Fu, Vatsal Sharan, and Robin Jia. Pre-trained large language models use fourier features to compute addition. arXiv preprint arXiv:2406.03445 , 2024
-
[50]
Zhejian Zhou, Jiayu Wang, Dahua Lin, and Kai Chen. Scaling behavior for large language models regarding numeral systems: An example using pythia. arXiv preprint arXiv:2409.17391 , 2024. 14 Appendix Roadmap In Appendix A, we provide the detailed algorithm for computing the final loss and making number predictions. In Appendix B, we present the results of t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.