pith. machine review for the scientific record. sign in

arxiv: 2605.06683 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Toeplitz matrixMLP mixersequence modelingtransformer alternativecomputational efficiencyin-context learninginformation retentioninvertibility
0
0 comments X

The pith

Toeplitz MLP Mixers replace attention with triangular-masked matrix multiplications to achieve lower complexity while improving training efficiency and input information retention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Toeplitz MLP Mixer as a sequence model that swaps the quadratic attention mechanism for triangular-masked Toeplitz matrix multiplication over the sequence dimension. This substitution delivers O(dn log n) training time and O(dn) space complexity. Despite omitting sophisticated input modulation or recurrent state, the resulting models reach lower loss per unit of compute and memory. The architecture retains more of the original input, which produces stronger results on copying tasks, information retrieval, and in-context learning benchmarks. Operator-index analysis further shows that trained causal Toeplitz layers are more likely to be invertible than layers explicitly designed to be invertible.

Core claim

TMMs swap attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension, yielding O(dn log n) time and O(dn) space complexity during training and O(dn) time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. TMMs are capable of retaining more input information resulting in improved copying ability, which the authors argue results from a lack of architectural biases. Consistent with higher input information retention, TMMs exhibit superior information retrieval and in-ctxt

What carries the argument

Triangular-masked Toeplitz matrix multiplication over the sequence dimension, which mixes token information with linearithmic time and linear space while minimizing inductive biases against exact input preservation.

If this is right

  • TMMs can handle longer sequences than attention models at the same memory and compute budget.
  • Sequence models can achieve higher copying and retrieval accuracy without explicit mechanisms for preserving input details.
  • Trained causal non-invertible Toeplitz layers tend to become invertible or nearly invertible.
  • Reduced architectural bias can be more effective than added modulation for information-rich tasks.
  • Inference prefill cost drops to linear in sequence length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The invertibility observation may motivate new regularization techniques that encourage or exploit reversibility in linear layers.
  • TMM-style mixing could be tested on algorithmic or symbolic reasoning benchmarks where exact token recall is required.
  • The same low-bias linear mixing might improve efficiency in other modalities such as time-series forecasting or graph sequence tasks.
  • If the bias-reduction explanation holds, further simplification of mixing operators could yield additional gains.

Load-bearing premise

The performance gains arise specifically from the reduced architectural bias of the Toeplitz structure rather than from differences in training procedure, optimizer, or data distribution.

What would settle it

Training a TMM and a matched transformer on identical data with the same optimizer and compute budget and finding that the transformer reaches equal or lower loss per FLOP and equal or higher in-context learning accuracy would falsify the central efficiency and retention claims.

Figures

Figures reproduced from arXiv: 2605.06683 by Benjamin L. Badger, Ethan Roland.

Figure 1
Figure 1. Figure 1: TMM architecture examples. (a) Illustration of a single Toeplitz mixer layer and (b) a TMM [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training Efficiency on FineWeb. (a) Compute- and memory-matched examples of each architecture. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Information Retention and Capacity (a) Informational retention experimental design, (b) Informa [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Copy task training efficiency. (a) Copy task depiction and Legends: left for (b), center (c), right [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: Toeplitz Symbols for trained next-token-prediction TMMs, color-coded per layer. Right: [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Transformer-based large language models are in some respects limited by the quadratic time and space computational complexity of attention. We introduce the Toeplitz MLP Mixer (TMM), a transformer-like architecture that swaps attention for triangular-masked Toeplitz matrix multiplication over the sequence dimension resulting in $\mathcal{O} (dn \log n)$ time and $\mathcal O(dn)$ space complexity during training and $\mathcal O(dn)$ time and space at inference prefill. Despite the lack of sophisticated input modulation or state maintenance present in other sub-quadratic architectures, TMMs yield greater training efficiency in terms of loss achieved per compute and device memory. We demonstrate that TMMs are capable of retaining more input information resulting in improved copying ability, which we argue results from a lack of architectural biases. Consistent with higher input information retention, TMMs exhibit superior information retrieval and in-context learning benchmark accuracy compared to comparable architectures. We conclude with an analysis from the perspective of operator index theory and show that, counterintuitively, trained Toeplitz layers of causal non-invertible models are more likely to be invertible or nearly so than models that are actually invertible over their inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Toeplitz MLP Mixer (TMM), a Transformer-like sequence model that replaces attention with triangular-masked Toeplitz matrix multiplication over the sequence dimension. This yields O(dn log n) training time and O(dn) space, with O(dn) inference prefill. Despite lacking input modulation or state maintenance, TMMs are claimed to achieve better loss-per-compute and memory efficiency, retain more input information (improved copying), and deliver higher accuracy on information retrieval and in-context learning benchmarks. The work concludes with an operator-index-theory analysis asserting that trained causal non-invertible Toeplitz layers are more likely to be invertible than explicitly invertible models.

Significance. If the efficiency and information-retention advantages are confirmed under matched training conditions, TMMs would represent a meaningful low-complexity alternative to attention-based models, supporting the broader hypothesis that reduced architectural bias can improve information flow in sequence modeling. The invertibility result, if rigorously grounded, would add an interesting theoretical angle on trained linear operators. The work supplies no machine-checked proofs or parameter-free derivations, but the complexity claims are analytically clear.

major comments (3)
  1. [Experiments / Results] Experimental sections (training curves, benchmark tables): the claims of superior loss-per-compute, memory usage, copying fidelity, and ICL accuracy are attributed to the replacement of attention by triangular-masked Toeplitz multiplication, yet no evidence is provided that optimizer state, learning-rate schedules, initialization, regularization, or data ordering were held identical across TMM, Transformer, and other sub-quadratic baselines. Without these controls the observed gains cannot be credited to the operator itself.
  2. [Operator Index Theory] Invertibility analysis (operator index theory section): the metric used to declare a trained causal Toeplitz layer 'invertible or nearly so' is not shown to be independent of the training objective; if the metric is computed from the same fitted weights that define the model, the counter-intuitive comparison to explicitly invertible models risks circularity and requires an explicit, pre-training definition of the index.
  3. [Method / Complexity Analysis] Complexity claims (introduction and method): while the O(dn log n) training time and O(dn) space are analytically plausible for triangular Toeplitz multiplication, the paper does not report wall-clock or memory measurements on the same hardware with identical batch sizes and sequence lengths, leaving the practical efficiency advantage unverified against strong sub-quadratic baselines.
minor comments (2)
  1. [Method] Notation for the triangular mask and Toeplitz construction should be introduced with an explicit equation before the complexity statements.
  2. [Figures] Figure captions for training curves and benchmark tables should state the exact hyperparameter settings and number of runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thoughtful review of our manuscript. We have carefully considered each of the major comments and provide point-by-point responses below. Where appropriate, we indicate revisions that will be incorporated in the updated version of the paper.

read point-by-point responses
  1. Referee: [Experiments / Results] Experimental sections (training curves, benchmark tables): the claims of superior loss-per-compute, memory usage, copying fidelity, and ICL accuracy are attributed to the replacement of attention by triangular-masked Toeplitz multiplication, yet no evidence is provided that optimizer state, learning-rate schedules, initialization, regularization, or data ordering were held identical across TMM, Transformer, and other sub-quadratic baselines. Without these controls the observed gains cannot be credited to the operator itself.

    Authors: We thank the referee for highlighting this important aspect of experimental rigor. In our experiments, we utilized the same training codebase, data splits, and batch sizes for all models. Learning rates were tuned individually for each architecture to achieve fair comparison, but initialization, regularization, and optimizer states were set according to standard recommendations for each model type. Data ordering was identical as it was determined by the dataset. To make this transparent, we will add a comprehensive experimental setup section detailing all hyperparameters and controls used, including any differences, in the revised manuscript. revision: yes

  2. Referee: [Operator Index Theory] Invertibility analysis (operator index theory section): the metric used to declare a trained causal Toeplitz layer 'invertible or nearly so' is not shown to be independent of the training objective; if the metric is computed from the same fitted weights that define the model, the counter-intuitive comparison to explicitly invertible models risks circularity and requires an explicit, pre-training definition of the index.

    Authors: The operator index is a mathematical property derived from the structure of the Toeplitz matrix and its symbol function, which is independent of how the weights were obtained. The comparison is between the index values of trained TMM layers versus those of models explicitly constructed to be invertible (e.g., via orthogonal initialization or specific constraints). However, we acknowledge the potential for misinterpretation regarding circularity. In the revision, we will include an explicit definition of the index computed on the operator prior to training and provide additional theoretical justification showing why the trained causal non-invertible models exhibit favorable index properties. revision: partial

  3. Referee: [Method / Complexity Analysis] Complexity claims (introduction and method): while the O(dn log n) training time and O(dn) space are analytically plausible for triangular Toeplitz multiplication, the paper does not report wall-clock or memory measurements on the same hardware with identical batch sizes and sequence lengths, leaving the practical efficiency advantage unverified against strong sub-quadratic baselines.

    Authors: The complexity analysis in the paper is based on asymptotic big-O notation derived from the properties of fast Toeplitz matrix-vector multiplication using FFT. We agree that empirical validation would strengthen the claims. We will revise the method and experiments sections to include wall-clock time and memory usage benchmarks conducted on identical hardware, with matched batch sizes and sequence lengths, comparing TMM against the Transformer and relevant sub-quadratic baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity in architectural proposal or empirical claims.

full rationale

The paper introduces the TMM architecture with explicit O(dn log n) training complexity derived from triangular-masked Toeplitz multiplication, then reports empirical outcomes on loss-per-compute, copying fidelity, ICL accuracy, and an operator-index-theory observation on invertibility of trained layers. These results are presented as measurements from experiments rather than quantities obtained by fitting a parameter to the target metric and relabeling it a prediction. No self-citation chain is invoked to justify uniqueness or to close a derivation loop; the invertibility finding is an observed statistical tendency across trained models, not a definitional identity. The central efficiency and information-retention claims rest on external benchmark comparisons and are therefore falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard neural-network training assumptions plus the domain assumption that reduced architectural bias directly causes higher input information retention; no new entities are postulated.

axioms (1)
  • domain assumption Gradient-based optimization on the described architecture converges to models that retain more input information than attention-based models under comparable compute.
    Invoked to explain why TMMs show better copying and in-context performance despite simpler structure.

pith-pipeline@v0.9.0 · 5508 in / 1238 out tokens · 33252 ms · 2026-05-11T01:01:42.450347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 31 canonical work pages · 12 internal anchors

  1. [1]

    First conference on language modeling , year=

    Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

  2. [2]

    2024 , eprint=

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality , author=. 2024 , eprint=

  3. [3]

    2020 , eprint=

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , author=. 2020 , eprint=

  4. [4]

    CoRR , volume =

    Zhuoran Shen and Mingyuan Zhang and Haiyu Zhao and Shuai Yi and Hongsheng Li , title =. CoRR , volume =. 2018 , url =

  5. [5]

    2023 , eprint=

    Hungry Hungry Hippos: Towards Language Modeling with State Space Models , author=. 2023 , eprint=

  6. [6]

    2023 , eprint=

    Hyena Hierarchy: Towards Larger Convolutional Language Models , author=. 2023 , eprint=

  7. [7]

    2023 , eprint=

    Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture , author=. 2023 , eprint=

  8. [8]

    Long Short-Term Memory , year=

    Hochreiter, Sepp and Schmidhuber, Jürgen , journal=. Long Short-Term Memory , year=

  9. [9]

    2025 , eprint=

    Masked Mixers for Language Generation and Retrieval , author=. 2025 , eprint=

  10. [10]

    2024 , eprint=

    Repeat After Me: Transformers are Better than State Space Models at Copying , author=. 2024 , eprint=

  11. [11]

    2023 , eprint=

    RWKV: Reinventing RNNs for the Transformer Era , author=. 2023 , eprint=

  12. [12]

    2024 , eprint=

    An Empirical Study of Mamba-based Language Models , author=. 2024 , eprint=

  13. [13]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  14. [14]

    2022 , eprint=

    CKConv: Continuous Kernel Convolution For Sequential Data , author=. 2022 , eprint=

  15. [15]

    2023 , eprint=

    Language Model Inversion , author=. 2023 , eprint=

  16. [16]

    2024 , eprint=

    Scaling Laws for Precision , author=. 2024 , eprint=

  17. [17]

    2025 , eprint=

    Parallel Scaling Law for Language Models , author=. 2025 , eprint=

  18. [18]

    2024 , eprint=

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. 2024 , eprint=

  19. [19]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  20. [20]

    2022 , eprint=

    LU decomposition and Toeplitz decomposition of a neural network , author=. 2022 , eprint=

  21. [21]

    2021 , url=

    Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture , author=. 2021 , url=

  22. [22]

    2024 , eprint=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

  23. [23]

    2019 , eprint=

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

  24. [24]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  25. [25]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  26. [26]

    2025 , eprint=

    2 OLMo 2 Furious , author=. 2025 , eprint=

  27. [27]

    Datasets: A Community Library for Natural Language Processing

    Lhoest, Quentin and Villanova del Moral, Albert and Jernite, Yacine and Thakur, Abhishek and von Platen, Patrick and Patil, Suraj and Chaumond, Julien and Drame, Mariama and Plu, Julien and Tunstall, Lewis and Davison, Joe and S a s ko, Mario and Chhablani, Gunjan and Malik, Bhavitvya and Brandeis, Simon and Le Scao, Teven and Sanh, Victor and Xu, Canwen ...

  28. [28]

    2016 , eprint=

    SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. 2016 , eprint=

  29. [30]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  30. [32]

    2024 , eprint=

    Simple linear attention language models balance the recall-throughput tradeoff , author=. 2024 , eprint=

  31. [33]

    ArXiv , year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

  32. [35]

    2016 , eprint=

    The LAMBADA dataset: Word prediction requiring a broad discourse context , author=. 2016 , eprint=

  33. [36]

    2021 , eprint=

    It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning , author=. 2021 , eprint=

  34. [37]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  35. [38]

    2020 , eprint=

    Reformer: The Efficient Transformer , author=. 2020 , eprint=

  36. [39]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  37. [40]

    2021 , eprint=

    MLP-Mixer: An all-MLP Architecture for Vision , author=. 2021 , eprint=

  38. [41]

    2021 , eprint=

    Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet , author=. 2021 , eprint=

  39. [42]

    2015 , eprint=

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , author=. 2015 , eprint=

  40. [43]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  41. [44]

    Nature methods , volume=

    SciPy 1.0: fundamental algorithms for scientific computing in Python , author=. Nature methods , volume=. 2020 , publisher=

  42. [45]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  43. [46]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. URL https://arxiv.org/abs/1706.03762

  44. [47]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  45. [48]

    Efficient attention: Attention with linear complexities.arXiv preprint arXiv:1812.01243, 2018

    Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. CoRR, abs/1812.01243, 2018. URL http://arxiv.org/abs/1812.01243

  46. [49]

    Transformers are

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention, 2020. URL https://arxiv.org/abs/2006.16236

  47. [50]

    and Dao, Tri and Baccus, Stephen and Bengio, Yoshua and Ermon, Stefano and R

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866

  48. [51]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

  49. [52]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. 2024. URL https://arxiv.org/abs/2405.21060

  50. [53]

    Jelassi, D

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying, 2024. URL https://arxiv.org/abs/2402.01032

  51. [54]

    Waleffe, W

    Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models. 2024. URL https://arxiv.org/abs/2406.07887

  52. [55]

    Hungry hungry hippos: Towards language modeling with state space models,

    Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models, 2023 a . URL https://arxiv.org/abs/2212.14052

  53. [56]

    Y., Arora, S., Grogan, J., Johnson, I., Eyuboglu, S., Thomas, A

    Daniel Y. Fu, Simran Arora, Jessica Grogan, Isys Johnson, Sabri Eyuboglu, Armin W. Thomas, Benjamin Spector, Michael Poli, Atri Rudra, and Christopher Ré. Monarch mixer: A simple sub-quadratic gemm-based architecture, 2023 b . URL https://arxiv.org/abs/2310.12109

  54. [57]

    Benjamin L. Badger. Masked mixers for language generation and retrieval. 2025. URL https://arxiv.org/abs/2409.01482

  55. [58]

    Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet, 2021

    Luke Melas-Kyriazi. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet, 2021. URL https://arxiv.org/abs/2105.02723

  56. [59]

    CoRR, abs/2105.01601

    Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision, 2021. URL https://arxiv.org/abs/2105.01601

  57. [60]

    Scipy 1.0: fundamental algorithms for scientific computing in python

    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17 0 (3): 0 261--272, 2020

  58. [61]

    Lu decomposition and toeplitz decomposition of a neural network, 2022

    Yucong Liu, Simiao Jiao, and Lek-Heng Lim. Lu decomposition and toeplitz decomposition of a neural network, 2022. URL https://arxiv.org/abs/2211.13935

  59. [62]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150

  60. [63]

    Reformer: The Efficient Transformer

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020. URL https://arxiv.org/abs/2001.04451

  61. [64]

    Long short-term memory

    Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

  62. [65]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

  63. [66]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

  64. [67]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  65. [68]

    Datasets: A community library for natural language processing

    Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugg...

  66. [69]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101

  67. [70]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

  68. [71]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015. URL https://arxiv.org/abs/1502.01852

  69. [72]

    Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  70. [73]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URL https://arxiv.org/abs/1606.05250

  71. [74]

    Know What You Don 't Know : Unanswerable Questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don ' t know: Unanswerable questions for SQ u AD . In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784--789, Melbourne, Australia, July 2018. Association for Computational Linguist...

  72. [75]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. L ong B ench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  73. [76]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911

  74. [77]

    Simple linear attention language models balance the recall-throughput tradeoff, 2024

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff, 2024

  75. [78]

    It's all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning, 2021

    Alexey Tikhonov and Max Ryabinin. It's all in the heads: Using attention heads as a baseline for cross-lingual transfer in commonsense reasoning, 2021

  76. [79]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018

  77. [80]

    doi: 10.18653/v1/W18-5446

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels, Belgium, November 2018. Association for Comput...

  78. [81]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context, 2016. URL https://arxiv.org/abs/1606.06031

  79. [82]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830