Towards Understanding Self-Pretraining for Sequence Classification

Antonio Orvieto; Loredana Zollo; Omar Coser; Paolo Soda

arxiv: 2605.21070 · v1 · pith:KR4H55MInew · submitted 2026-05-20 · 💻 cs.LG

Towards Understanding Self-Pretraining for Sequence Classification

Omar Coser , Loredana Zollo , Paolo Soda , Antonio Orvieto This is my paper

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-pretrainingattention patternstransformersequence classificationmasked reconstructionproximity interactionslong-range arena

0 comments

The pith

Self-pretraining lets Transformers learn proximity-biased attention that label supervision misses from random initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates self-pretraining results on sequence classification and isolates why it helps. Ablations show the core problem is not model depth or generalization but the difficulty of learning effective query-key attention patterns under direct label supervision. The authors trace the gains to learning proximity interactions that convert absolute positional encodings into proximity-biased attention scores. A minimal theoretical model demonstrates that label supervision remains locally blind to certain attention-score directions while masked reconstruction can detect them.

Core claim

In the studied Transformer settings for sequence classification, label supervision from random initialization cannot learn useful query-key Attention patterns. Self-pretraining with masked token prediction supplies a signal that reveals proximity interactions, turning absolute positional encodings into proximity-biased Attention scores and thereby reaching better optimization points.

What carries the argument

Learning proximity interactions that turn absolute positional encodings into proximity-biased Attention scores.

If this is right

Standard supervised training fails to optimize attention patterns that self-pretraining can reach.
Proximity-biased attention is the main driver of the observed performance lift on long-range sequence tasks.
Masked reconstruction provides an optimization signal for attention that the classification loss lacks locally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local blindness may appear in other attention-based models when supervision is sparse or indirect.
Architectures could incorporate explicit proximity regularization to reduce reliance on pretraining.
The theoretical view opens analysis of attention landscapes in terms of detectable versus blind directions.

Load-bearing premise

The ablations and simplified theoretical model correctly isolate learning of proximity interactions as the main source of self-pretraining gains without confounding optimization or generalization factors.

What would settle it

In the theoretical setup, check whether label supervision can reach the same attention-score directions as masked reconstruction or remains confined to a different local optimum.

Figures

Figures reproduced from arXiv: 2605.21070 by Antonio Orvieto, Loredana Zollo, Omar Coser, Paolo Soda.

**Figure 2.** Figure 2: SPT duration ablation. We vary self-pretraining epochs before 100 epochs of finetuning. CIFAR10 and ListOps benefit after only a few epochs, PathFinder needs longer pretraining. 1 layer 2 layers 3 layers Model Depth 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy CIFAR 10 Performance 1 layer 2 layers 4 layers Model Depth 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Pathfinder Performance From scratch (Train) From scratch (… view at source ↗

**Figure 4.** Figure 4: Performance on the toy task (1-layer model) described in § 4. Train and test accuracies are averaged over 10 random seeds (max-min interval is shown) for different learning rates. SPT consistently outperforms the no-SPT baseline, achieving higher peak accuracy at intermediate learning rates. In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Toy Attention evolution. Left: training loss over iterations for self-pretraining (SPT) and From-Scratch training (SC). Right: Attention matrices (L × L) at initialization and after training. SPT rapidly develops structure during pretraining, exhibiting a proximity bias. Compared to random initialization, this structure can develop into a richer sequence mixer after finetuning (“trained”). Crucial role of … view at source ↗

**Figure 6.** Figure 6: Visualization of Attention components with random initialization (top) and after SPT training (bottom). While random weights fail to recover positional structure, SPT learns weights that effectively undo positional encoding, producing coherent, position-aligned Attention after softmax. Here we set the input content X = 0 and feed only positional embeddings pos through Q/K to isolate the effect of positiona… view at source ↗

**Figure 7.** Figure 7: Visualization of raw QK⊤ (top) and softmax-normalized attention weights (bottom) for a single-layer self-pretrained Transformer. QK⊤ matrices show clear diagonal structure, with noisier patterns for CIFAR10 and more coherent structure for PathFinder. Softmax normalization yields sparse, predominantly diagonal attention. Weights are taken after SPT (before finetuning) for the 1-layer setting described in Tb… view at source ↗

**Figure 8.** Figure 8: Layer-wise parameter displacement across training trajectories. R→SC displacement is consistently smaller than R→SPT and SPT→FT, showing that supervised training from random initialization induces limited movement. MLP layers move substantially more than Attention projections, while Attention layers move little unless initialized from SPT. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Delta between ∥Qrandom∥ and ∥Qscratch∥ across layers and components. Norm displacement is largest for MLP All blocks and grows with depth, while normalization layers stay near zero (and even turn slightly negative in deeper layers). Attention projections move noticeably less than MLP blocks, consistent with the smaller-table observations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of the toy task for a random seed. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Complementing [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Weight distributions after SPT remain close to random initialization, except for mild broadening toward a Gaussian-like shape. Thus, the main structure is not visible in the marginal distributions of WQ or WK, but in their product and its interaction with positional encodings. 0.0003 0.001 0.003 0.01 0.03 0.1 Learning Rate Index 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Train SPT no SPT 0.0003 0.001 0.003 0.01 0.0… view at source ↗

**Figure 13.** Figure 13: Same setting as [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Verification of Proposition 1 on 2 example patterns under randomly sampled tokens. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

read the original abstract

Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a set of ablations plus a simplified theory showing that supervised loss can miss useful attention directions that masked reconstruction picks up, especially proximity bias from positional encodings.

read the letter

The central claim here is that self-pretraining helps mainly because label supervision is locally blind to certain attention-score directions that a masked objective can detect. The ablations point to learning proximity interactions as the practical driver in the LRA setups they study. That lines up with the original Amos et al. result but adds more targeted experiments on attention patterns rather than just reporting the accuracy lift. The theoretical part is a minimal construction that demonstrates the local blindness under a simplified loss, which is a clean way to isolate the idea without claiming it explains the full training run. Those pieces are the actual new content, and the ablations look reasonably systematic for what they set out to test. Credit for trying to move past the black-box improvement and name a mechanism. The soft spot is the gap between the simplified theory and the real multi-layer, multi-head, AdamW training they use in the experiments. The stress-test note is fair: once a few steps of proximity-biased attention form, the label gradients might stop being blind, so the local analysis may not survive in the non-convex landscape. Without tighter controls or a direct check that the same directions remain invisible after initialization in the full model, the explanation stays suggestive rather than conclusive. The paper is aimed at people who already work on long-range sequence models and want a mechanistic story for why a cheap pretraining trick works. It is not a broad rethinking of pretraining, but the combination of replication, ablation, and theory is enough to merit a serious referee. I would send it out for review with the expectation that the authors tighten the link between the toy theory and the actual optimization trajectory.

Referee Report

2 major / 2 minor

Summary. The manuscript examines why self-pretraining (SPT) via masked token prediction improves Transformer accuracy on sequence classification tasks from the Long-Range Arena. Replicating Amos et al. (2024), the authors perform systematic ablations and identify that label supervision struggles to learn useful query-key attention patterns from random initialization. Using a minimal setup, they isolate learning proximity interactions (converting absolute positional encodings into proximity-biased attention scores) as a central source of SPT gains. In a simplified theoretical setup, they show that label supervision is locally blind to certain attention-score directions that masked reconstruction can detect.

Significance. If the claims hold, the work supplies a mechanistic account of why supervised training can fail to discover useful attention patterns while a self-supervised objective succeeds, even without external data. The replication, ablations, and simplified theoretical analysis are explicit strengths that could inform initialization strategies and training curricula for attention-based models on long sequences.

major comments (2)

[Simplified theoretical setup] The simplified theoretical setup demonstrates local blindness of label supervision to certain attention directions, but the manuscript does not show that this blindness persists once the model is placed in the full non-convex loss landscape with multiple layers, heads, and the AdamW + cross-entropy dynamics used in the LRA experiments. If informative gradients from labels appear only after proximity-biased attention has already formed, the local-blindness result may be an artifact of the linearised or single-layer analysis rather than a property of the training procedure studied.
[Ablations and minimal setup] The ablations attribute SPT gains primarily to learning proximity interactions, yet the minimal setup does not fully isolate this factor from confounding optimization or generalization effects that appear in the complete multi-head attention model. A direct comparison of gradient norms or attention-score trajectories with and without SPT after the first few epochs would strengthen the causal link.

minor comments (2)

[Notation and setup] Notation for attention scores and positional encodings could be introduced earlier and used consistently across the theoretical and experimental sections to improve readability.
[Abstract] The abstract states the central findings but does not mention the specific LRA tasks or model sizes used in the replication; adding one sentence would help readers assess scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the scope of our contributions. We address each major comment below with clarifications and indicate where revisions will be made.

read point-by-point responses

Referee: [Simplified theoretical setup] The simplified theoretical setup demonstrates local blindness of label supervision to certain attention directions, but the manuscript does not show that this blindness persists once the model is placed in the full non-convex loss landscape with multiple layers, heads, and the AdamW + cross-entropy dynamics used in the LRA experiments. If informative gradients from labels appear only after proximity-biased attention has already formed, the local-blindness result may be an artifact of the linearised or single-layer analysis rather than a property of the training procedure studied.

Authors: We agree that the theoretical analysis is intentionally simplified to a linearized single-layer setting to analytically isolate the local blindness of the supervised loss to certain attention-score directions. This does not constitute a direct simulation of the full non-convex, multi-layer, multi-head dynamics under AdamW. However, the result is presented as a mechanistic illustration of why label supervision may fail to discover proximity-biased patterns from random initialization, which is consistent with the empirical ablations on the full LRA models. We will revise the manuscript to add an explicit limitations paragraph discussing the gap between the simplified analysis and the full training procedure, while emphasizing that the theoretical finding motivates the observed empirical benefits of SPT. revision: partial
Referee: [Ablations and minimal setup] The ablations attribute SPT gains primarily to learning proximity interactions, yet the minimal setup does not fully isolate this factor from confounding optimization or generalization effects that appear in the complete multi-head attention model. A direct comparison of gradient norms or attention-score trajectories with and without SPT after the first few epochs would strengthen the causal link.

Authors: The minimal setup was constructed precisely to remove multi-head and other confounding factors so that proximity interaction learning could be studied in isolation. We acknowledge that direct evidence on early-epoch dynamics in the full model would strengthen causality. We will add new figures in the revised manuscript showing attention-score trajectories (and, where feasible, gradient norm comparisons) over the first few epochs for SPT versus supervised-only training on the LRA tasks. This will provide additional support for the claim that proximity-biased patterns emerge more readily under the self-supervised objective. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent ablations and separate theoretical construction

full rationale

The paper replicates Amos et al. (2024) and performs systematic ablations to isolate the role of learning proximity interactions in attention patterns, then presents a distinct simplified theoretical model showing local blindness of label supervision to certain attention directions. Neither the ablations nor the theoretical analysis reduce by the paper's own equations to quantities fitted on the same data or to self-referential definitions. External citations provide background but carry no load-bearing uniqueness or ansatz that collapses the new results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the validity of the ablation design and the assumption that the simplified theoretical model captures the essential dynamics of full Transformer training; no new free parameters or invented entities are introduced.

axioms (1)

standard math Standard Transformer attention and absolute positional encoding mechanics
The analysis presupposes the usual query-key attention formulation and sinusoidal or learned absolute position encodings.

pith-pipeline@v0.9.0 · 5735 in / 1234 out tokens · 51251 ms · 2026-05-21T05:26:44.043349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · 16 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

2025 American Control Conference (ACC) , pages=

State space models as foundation models: A control theoretic overview , author=. 2025 American Control Conference (ACC) , pages=. 2025 , organization=

work page 2025
[3]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[4]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. arXiv , booktitle =:2010.11929 , journal =

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. doi:10.48550/arxiv.2205.14135 , editor =. arXiv , booktitle =:2205.14135 , journal =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135
[7]

arXiv.org , month =

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. arXiv.org , month =

work page
[8]

International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

work page 2023
[9]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv.org , month =. arXiv , doi =:2203.15556 , issn =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv.org , month =

Scaling Laws for Neural Language Models , author =. arXiv.org , month =

work page
[11]

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52729.2023.01499 , eprint =

work page doi:10.1109/cvpr52729.2023.01499 2023
[12]

doi:10.48550/arxiv.2206.08164 , editor =

Long Range Graph Benchmark , author =. doi:10.48550/arxiv.2206.08164 , editor =. arXiv , booktitle =:2206.08164 , journal =

work page doi:10.48550/arxiv.2206.08164
[13]

International Conference on Machine Learning , year=

The CLRS Algorithmic Reasoning Benchmark , author=. International Conference on Machine Learning , year=

work page
[14]

Neural Information Processing Systems , month =

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author =. Neural Information Processing Systems , month =

work page
[15]

Nature , author =

Highly accurate protein structure prediction with AlphaFold , author =. Nature , month =. doi:10.1038/s41586-021-03819-2 , issn =

work page doi:10.1038/s41586-021-03819-2
[16]

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

Understanding the difficulty of training deep feedforward neural networks , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

work page
[17]

Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

A Simple Framework for Contrastive Learning of Visual Representations , author =. Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

work page 2020
[18]

Neural Information Processing Systems , month =

HiPPO: Recurrent Memory with Optimal Polynomial Projections , author =. Neural Information Processing Systems , month =

work page
[19]

doi:10.48550/arxiv.2302.06646 , editor =

Simple Hardware-Efficient Long Convolutions for Sequence Modeling , author =. doi:10.48550/arxiv.2302.06646 , editor =. arXiv , booktitle =:2302.06646 , journal =

work page doi:10.48550/arxiv.2302.06646
[20]

arXiv.org , month =

RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv.org , month =

work page
[21]

NIPS , month =

Attention Is All You Need , author =. NIPS , month =

work page
[22]

doi:10.48550/arxiv.2206.11893 , editor =

On the Parameterization and Initialization of Diagonal State Space Models , author =. doi:10.48550/arxiv.2206.11893 , editor =. arXiv , booktitle =:2206.11893 , journal =

work page doi:10.48550/arxiv.2206.11893
[23]

arXiv.org , month =

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. arXiv.org , month =

work page
[24]

International Conference on Machine Learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[26]

The Eleventh International Conference on Learning Representations , year=

The Curious Case of Benign Memorization , author=. The Eleventh International Conference on Learning Representations , year=

work page
[27]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[28]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page
[29]

science , volume=

Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

work page 2006
[30]

The Journal of physiology , volume=

Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , author=. The Journal of physiology , volume=. 1962 , publisher=

work page 1962
[31]

International Conference on Learning Representations , month =

What Makes Convolutional Models Great on Long Sequence Modeling? , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2210.09298 , eprint =

work page doi:10.48550/arxiv.2210.09298
[33]

arXiv.org , month =

Efficient Long Sequence Modeling via State Space Augmented Transformer , author =. arXiv.org , month =. arXiv , doi =:2212.08136 , issn =

work page arXiv
[34]

International Conference on Learning Representations , month =

Mega: Moving Average Equipped Gated Attention , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2209.10655 , eprint =

work page doi:10.48550/arxiv.2209.10655
[35]

International Conference on Machine Learning , month =

The CLRS Algorithmic Reasoning Benchmark , author =. International Conference on Machine Learning , month =

work page
[36]

International Conference on Learning Representations , month =

Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2207.02098 , eprint =

work page doi:10.48550/arxiv.2207.02098
[37]

Masked-attention mask transformer for universal image segmenta- tion,in:2022IEEE/CVFConferenceonComputerVisionandPattern Recognition (CVPR), pp

Scaling Vision Transformers , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52688.2022.01179 , eprint =

work page doi:10.1109/cvpr52688.2022.01179 2022
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv.org , month =. arXiv , doi =:2307.09288 , issn =

work page internal anchor Pith review Pith/arXiv arXiv
[39]

arXiv.org , month =

Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author =. arXiv.org , month =

work page
[40]

Journal of machine learning research , month =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of machine learning research , month =

work page
[41]

arXiv preprint arXiv:2310.04418 , year=

Functional interpolation for relative positions improves long context transformers , author=. arXiv preprint arXiv:2310.04418 , year=

work page arXiv
[42]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024
[43]

Neural Information Processing Systems , month =

Diagonal State Spaces are as Effective as Structured State Spaces , author =. Neural Information Processing Systems , month =

work page
[44]

The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

Efficiently Modeling Long Sequences with Structured State Spaces , author =. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

work page 2022
[45]

9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

Long Range Arena: A Benchmark for Efficient Transformers , author =. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

work page 2021
[46]

doi:10.18653/v1/2023.acl-long.682 , editor =

Downstream Datasets Make Surprisingly Good Pretraining Corpora , author =. doi:10.18653/v1/2023.acl-long.682 , editor =. arXiv , booktitle =:2209.14389 , journal =

work page doi:10.18653/v1/2023.acl-long.682 2023
[47]

Neural Information Processing Systems , month =

Language Models are Few-Shot Learners , author =. Neural Information Processing Systems , month =

work page
[48]

Computer Vision and Pattern Recognition , month =

Masked Autoencoders Are Scalable Vision Learners , author =. Computer Vision and Pattern Recognition , month =

work page
[49]

North American Chapter of the Association for Computational Linguistics , month =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. North American Chapter of the Association for Computational Linguistics , month =

work page
[50]

Neural Computation , month =

A Fast Learning Algorithm for Deep Belief Nets , author =. Neural Computation , month =. doi:10.1162/neco.2006.18.7.1527 , issn =

work page doi:10.1162/neco.2006.18.7.1527 2006
[51]

ArXiv , year=

A Cookbook of Self-Supervised Learning , author=. ArXiv , year=

work page
[52]

Large Scale Kernel Machines , publisher =

Scaling learning algorithms towards AI , author =. Large Scale Kernel Machines , publisher =

work page
[53]

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

Deep learning , author =. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

work page
[54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

SoundStream: An End-to-End Neural Audio Codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

work page
[55]

ArXiv , year=

AudioLM: a Language Modeling Approach to Audio Generation , author=. ArXiv , year=

work page
[56]

ArXiv , year=

MusicLM: Generating Music From Text , author=. ArXiv , year=

work page
[57]

A Generalist Agent , author=. Trans. Mach. Learn. Res. , year=

work page
[58]

SCROLLS : Standardized C ompa R ison Over Long Language Sequences

Shaham, Uri and Segal, Elad and Ivgi, Maor and Efrat, Avia and Yoran, Ori and Haviv, Adi and Gupta, Ankit and Xiong, Wenhan and Geva, Mor and Berant, Jonathan and Levy, Omer. SCROLLS : Standardized C ompa R ison Over Long Language Sequences. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

work page 2022
[59]

ACM Computing Surveys , year=

Efficient Transformers: A Survey , author=. ACM Computing Surveys , year=

work page
[60]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[61]

The Eleventh International Conference on Learning Representations , year=

Simplified State Space Layers for Sequence Modeling , author=. The Eleventh International Conference on Learning Representations , year=

work page
[62]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Long Range Language Modeling via Gated State Spaces , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

work page
[63]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2016
[64]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[65]

Language Models are Unsupervised Multitask Learners , author=

work page
[66]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[67]

International Conference on Machine Learning , year=

Exphormer: Sparse Transformers for Graphs , author=. International Conference on Machine Learning , year=

work page
[68]

The Eleventh International Conference on Learning Representations , year=

Relational Attention: Generalizing Transformers for Graph-Structured Tasks , author=. The Eleventh International Conference on Learning Representations , year=

work page
[69]

Advances in neural information processing systems , volume=

S4nd: Modeling images and videos as multidimensional signals with state spaces , author=. Advances in neural information processing systems , volume=

work page
[70]

doi:10.18653/v1/N18-4013 , pages =

Nangia, Nikita and Bowman, Samuel , booktitle =. doi:10.18653/v1/N18-4013 , pages =

work page doi:10.18653/v1/n18-4013
[71]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

work page
[72]

Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =

Dragomir R. Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =. The

work page
[73]

Learning long-range spatial dependencies with horizontal gated recurrent units , year =

Drew Linsley and Junkyung Kim and Vijay Veerabadran and Charles Windolf and Thomas Serre , booktitle =. Learning long-range spatial dependencies with horizontal gated recurrent units , year =

work page
[74]

Disentangling neural mechanisms for perceptual grouping , url =

Junkyung Kim and Drew Linsley and Kalpit Thakkar and Thomas Serre , bibsource =. Disentangling neural mechanisms for perceptual grouping , url =. 8th International Conference on Learning Representations,

work page
[75]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

Pete Warden , journal =. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

work page
[76]

The Twelfth International Conference on Learning Representations , year=

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[77]

arXiv preprint arXiv:2305.10517 , year=

Improving speaker verification with self-pretrained transformer models , author=. arXiv preprint arXiv:2305.10517 , year=

work page arXiv
[78]

2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=

Self pre-training with masked autoencoders for medical image classification and segmentation , author=. 2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=. 2023 , organization=

work page 2023
[79]

doi: 10.18653/v1/2020.acl-main.703

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020
[80]

International Conference on Learning Representations , year=

Parallelizing Linear Recurrent Neural Nets Over Sequence Length , author=. International Conference on Learning Representations , year=

work page
[81]

International Conference on Learning Representations , year=

Long Range Arena: A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page
[82]

Advances in Neural Information Processing Systems , year=

Hippo: Recurrent memory with optimal polynomial projections , author=. Advances in Neural Information Processing Systems , year=

work page

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

2025 American Control Conference (ACC) , pages=

State space models as foundation models: A control theoretic overview , author=. 2025 American Control Conference (ACC) , pages=. 2025 , organization=

work page 2025

[3] [3]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[4] [4]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. arXiv , booktitle =:2010.11929 , journal =

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. doi:10.48550/arxiv.2205.14135 , editor =. arXiv , booktitle =:2205.14135 , journal =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135

[7] [7]

arXiv.org , month =

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. arXiv.org , month =

work page

[8] [8]

International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

work page 2023

[9] [9]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv.org , month =. arXiv , doi =:2203.15556 , issn =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv.org , month =

Scaling Laws for Neural Language Models , author =. arXiv.org , month =

work page

[11] [11]

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52729.2023.01499 , eprint =

work page doi:10.1109/cvpr52729.2023.01499 2023

[12] [12]

doi:10.48550/arxiv.2206.08164 , editor =

Long Range Graph Benchmark , author =. doi:10.48550/arxiv.2206.08164 , editor =. arXiv , booktitle =:2206.08164 , journal =

work page doi:10.48550/arxiv.2206.08164

[13] [13]

International Conference on Machine Learning , year=

The CLRS Algorithmic Reasoning Benchmark , author=. International Conference on Machine Learning , year=

work page

[14] [14]

Neural Information Processing Systems , month =

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author =. Neural Information Processing Systems , month =

work page

[15] [15]

Nature , author =

Highly accurate protein structure prediction with AlphaFold , author =. Nature , month =. doi:10.1038/s41586-021-03819-2 , issn =

work page doi:10.1038/s41586-021-03819-2

[16] [16]

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

Understanding the difficulty of training deep feedforward neural networks , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

work page

[17] [17]

Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

A Simple Framework for Contrastive Learning of Visual Representations , author =. Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

work page 2020

[18] [18]

Neural Information Processing Systems , month =

HiPPO: Recurrent Memory with Optimal Polynomial Projections , author =. Neural Information Processing Systems , month =

work page

[19] [19]

doi:10.48550/arxiv.2302.06646 , editor =

Simple Hardware-Efficient Long Convolutions for Sequence Modeling , author =. doi:10.48550/arxiv.2302.06646 , editor =. arXiv , booktitle =:2302.06646 , journal =

work page doi:10.48550/arxiv.2302.06646

[20] [20]

arXiv.org , month =

RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv.org , month =

work page

[21] [21]

NIPS , month =

Attention Is All You Need , author =. NIPS , month =

work page

[22] [22]

doi:10.48550/arxiv.2206.11893 , editor =

On the Parameterization and Initialization of Diagonal State Space Models , author =. doi:10.48550/arxiv.2206.11893 , editor =. arXiv , booktitle =:2206.11893 , journal =

work page doi:10.48550/arxiv.2206.11893

[23] [23]

arXiv.org , month =

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. arXiv.org , month =

work page

[24] [24]

International Conference on Machine Learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[25] [26]

The Eleventh International Conference on Learning Representations , year=

The Curious Case of Benign Memorization , author=. The Eleventh International Conference on Learning Representations , year=

work page

[26] [27]

Proceedings of the National Academy of Sciences , volume=

Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020

[27] [28]

International Conference on Learning Representations , year=

Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

work page

[28] [29]

science , volume=

Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

work page 2006

[29] [30]

The Journal of physiology , volume=

Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , author=. The Journal of physiology , volume=. 1962 , publisher=

work page 1962

[30] [31]

International Conference on Learning Representations , month =

What Makes Convolutional Models Great on Long Sequence Modeling? , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2210.09298 , eprint =

work page doi:10.48550/arxiv.2210.09298

[31] [33]

arXiv.org , month =

Efficient Long Sequence Modeling via State Space Augmented Transformer , author =. arXiv.org , month =. arXiv , doi =:2212.08136 , issn =

work page arXiv

[32] [34]

International Conference on Learning Representations , month =

Mega: Moving Average Equipped Gated Attention , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2209.10655 , eprint =

work page doi:10.48550/arxiv.2209.10655

[33] [35]

International Conference on Machine Learning , month =

The CLRS Algorithmic Reasoning Benchmark , author =. International Conference on Machine Learning , month =

work page

[34] [36]

International Conference on Learning Representations , month =

Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2207.02098 , eprint =

work page doi:10.48550/arxiv.2207.02098

[35] [37]

Masked-attention mask transformer for universal image segmenta- tion,in:2022IEEE/CVFConferenceonComputerVisionandPattern Recognition (CVPR), pp

Scaling Vision Transformers , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52688.2022.01179 , eprint =

work page doi:10.1109/cvpr52688.2022.01179 2022

[36] [38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv.org , month =. arXiv , doi =:2307.09288 , issn =

work page internal anchor Pith review Pith/arXiv arXiv

[37] [39]

arXiv.org , month =

Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author =. arXiv.org , month =

work page

[38] [40]

Journal of machine learning research , month =

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of machine learning research , month =

work page

[39] [41]

arXiv preprint arXiv:2310.04418 , year=

Functional interpolation for relative positions improves long context transformers , author=. arXiv preprint arXiv:2310.04418 , year=

work page arXiv

[40] [42]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

work page 2024

[41] [43]

Neural Information Processing Systems , month =

Diagonal State Spaces are as Effective as Structured State Spaces , author =. Neural Information Processing Systems , month =

work page

[42] [44]

The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

Efficiently Modeling Long Sequences with Structured State Spaces , author =. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

work page 2022

[43] [45]

9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

Long Range Arena: A Benchmark for Efficient Transformers , author =. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

work page 2021

[44] [46]

doi:10.18653/v1/2023.acl-long.682 , editor =

Downstream Datasets Make Surprisingly Good Pretraining Corpora , author =. doi:10.18653/v1/2023.acl-long.682 , editor =. arXiv , booktitle =:2209.14389 , journal =

work page doi:10.18653/v1/2023.acl-long.682 2023

[45] [47]

Neural Information Processing Systems , month =

Language Models are Few-Shot Learners , author =. Neural Information Processing Systems , month =

work page

[46] [48]

Computer Vision and Pattern Recognition , month =

Masked Autoencoders Are Scalable Vision Learners , author =. Computer Vision and Pattern Recognition , month =

work page

[47] [49]

North American Chapter of the Association for Computational Linguistics , month =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. North American Chapter of the Association for Computational Linguistics , month =

work page

[48] [50]

Neural Computation , month =

A Fast Learning Algorithm for Deep Belief Nets , author =. Neural Computation , month =. doi:10.1162/neco.2006.18.7.1527 , issn =

work page doi:10.1162/neco.2006.18.7.1527 2006

[49] [51]

ArXiv , year=

A Cookbook of Self-Supervised Learning , author=. ArXiv , year=

work page

[50] [52]

Large Scale Kernel Machines , publisher =

Scaling learning algorithms towards AI , author =. Large Scale Kernel Machines , publisher =

work page

[51] [53]

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

Deep learning , author =. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

work page

[52] [54]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

SoundStream: An End-to-End Neural Audio Codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

work page

[53] [55]

ArXiv , year=

AudioLM: a Language Modeling Approach to Audio Generation , author=. ArXiv , year=

work page

[54] [56]

ArXiv , year=

MusicLM: Generating Music From Text , author=. ArXiv , year=

work page

[55] [57]

A Generalist Agent , author=. Trans. Mach. Learn. Res. , year=

work page

[56] [58]

SCROLLS : Standardized C ompa R ison Over Long Language Sequences

Shaham, Uri and Segal, Elad and Ivgi, Maor and Efrat, Avia and Yoran, Ori and Haviv, Adi and Gupta, Ankit and Xiong, Wenhan and Geva, Mor and Berant, Jonathan and Levy, Omer. SCROLLS : Standardized C ompa R ison Over Long Language Sequences. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

work page 2022

[57] [59]

ACM Computing Surveys , year=

Efficient Transformers: A Survey , author=. ACM Computing Surveys , year=

work page

[58] [60]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page

[59] [61]

The Eleventh International Conference on Learning Representations , year=

Simplified State Space Layers for Sequence Modeling , author=. The Eleventh International Conference on Learning Representations , year=

work page

[60] [62]

The Eleventh International Conference on Learning Representations (ICLR) , year=

Long Range Language Modeling via Gated State Spaces , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

work page

[61] [63]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2016

[62] [64]

Proceedings of the 40th International Conference on Machine Learning , pages =

Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[63] [65]

Language Models are Unsupervised Multitask Learners , author=

work page

[64] [66]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020

[65] [67]

International Conference on Machine Learning , year=

Exphormer: Sparse Transformers for Graphs , author=. International Conference on Machine Learning , year=

work page

[66] [68]

The Eleventh International Conference on Learning Representations , year=

Relational Attention: Generalizing Transformers for Graph-Structured Tasks , author=. The Eleventh International Conference on Learning Representations , year=

work page

[67] [69]

Advances in neural information processing systems , volume=

S4nd: Modeling images and videos as multidimensional signals with state spaces , author=. Advances in neural information processing systems , volume=

work page

[68] [70]

doi:10.18653/v1/N18-4013 , pages =

Nangia, Nikita and Bowman, Samuel , booktitle =. doi:10.18653/v1/N18-4013 , pages =

work page doi:10.18653/v1/n18-4013

[69] [71]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

work page

[70] [72]

Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =

Dragomir R. Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =. The

work page

[71] [73]

Learning long-range spatial dependencies with horizontal gated recurrent units , year =

Drew Linsley and Junkyung Kim and Vijay Veerabadran and Charles Windolf and Thomas Serre , booktitle =. Learning long-range spatial dependencies with horizontal gated recurrent units , year =

work page

[72] [74]

Disentangling neural mechanisms for perceptual grouping , url =

Junkyung Kim and Drew Linsley and Kalpit Thakkar and Thomas Serre , bibsource =. Disentangling neural mechanisms for perceptual grouping , url =. 8th International Conference on Learning Representations,

work page

[73] [75]

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

Pete Warden , journal =. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

work page

[74] [76]

The Twelfth International Conference on Learning Representations , year=

Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors , author=. The Twelfth International Conference on Learning Representations , year=

work page

[75] [77]

arXiv preprint arXiv:2305.10517 , year=

Improving speaker verification with self-pretrained transformer models , author=. arXiv preprint arXiv:2305.10517 , year=

work page arXiv

[76] [78]

2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=

Self pre-training with masked autoencoders for medical image classification and segmentation , author=. 2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=. 2023 , organization=

work page 2023

[77] [79]

doi: 10.18653/v1/2020.acl-main.703

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020

[78] [80]

International Conference on Learning Representations , year=

Parallelizing Linear Recurrent Neural Nets Over Sequence Length , author=. International Conference on Learning Representations , year=

work page

[79] [81]

International Conference on Learning Representations , year=

Long Range Arena: A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

work page

[80] [82]

Advances in Neural Information Processing Systems , year=

Hippo: Recurrent memory with optimal polynomial projections , author=. Advances in Neural Information Processing Systems , year=

work page