pith. sign in

arxiv: 2605.21070 · v1 · pith:KR4H55MInew · submitted 2026-05-20 · 💻 cs.LG

Towards Understanding Self-Pretraining for Sequence Classification

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-pretrainingattention patternstransformersequence classificationmasked reconstructionproximity interactionslong-range arena
0
0 comments X

The pith

Self-pretraining lets Transformers learn proximity-biased attention that label supervision misses from random initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates self-pretraining results on sequence classification and isolates why it helps. Ablations show the core problem is not model depth or generalization but the difficulty of learning effective query-key attention patterns under direct label supervision. The authors trace the gains to learning proximity interactions that convert absolute positional encodings into proximity-biased attention scores. A minimal theoretical model demonstrates that label supervision remains locally blind to certain attention-score directions while masked reconstruction can detect them.

Core claim

In the studied Transformer settings for sequence classification, label supervision from random initialization cannot learn useful query-key Attention patterns. Self-pretraining with masked token prediction supplies a signal that reveals proximity interactions, turning absolute positional encodings into proximity-biased Attention scores and thereby reaching better optimization points.

What carries the argument

Learning proximity interactions that turn absolute positional encodings into proximity-biased Attention scores.

If this is right

  • Standard supervised training fails to optimize attention patterns that self-pretraining can reach.
  • Proximity-biased attention is the main driver of the observed performance lift on long-range sequence tasks.
  • Masked reconstruction provides an optimization signal for attention that the classification loss lacks locally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local blindness may appear in other attention-based models when supervision is sparse or indirect.
  • Architectures could incorporate explicit proximity regularization to reduce reliance on pretraining.
  • The theoretical view opens analysis of attention landscapes in terms of detectable versus blind directions.

Load-bearing premise

The ablations and simplified theoretical model correctly isolate learning of proximity interactions as the main source of self-pretraining gains without confounding optimization or generalization factors.

What would settle it

In the theoretical setup, check whether label supervision can reach the same attention-score directions as masked reconstruction or remains confined to a different local optimum.

Figures

Figures reproduced from arXiv: 2605.21070 by Antonio Orvieto, Loredana Zollo, Omar Coser, Paolo Soda.

Figure 1
Figure 1. Figure 1: Summary of training pipelines for classification. SPT is the method we study in this work [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SPT duration ablation. We vary self-pretraining epochs before 100 epochs of finetuning. CIFAR10 and ListOps benefit after only a few epochs, PathFinder needs longer pretraining. 1 layer 2 layers 3 layers Model Depth 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy CIFAR 10 Performance 1 layer 2 layers 4 layers Model Depth 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Pathfinder Performance From scratch (Train) From scratch (… view at source ↗
Figure 4
Figure 4. Figure 4: Performance on the toy task (1-layer model) described in § 4. Train and test accuracies are averaged over 10 random seeds (max-min interval is shown) for different learning rates. SPT consistently outperforms the no-SPT baseline, achieving higher peak accuracy at intermediate learning rates. In [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Toy Attention evolution. Left: training loss over iterations for self-pretraining (SPT) and From-Scratch training (SC). Right: Attention matrices (L × L) at initialization and after training. SPT rapidly develops structure during pretraining, exhibiting a proximity bias. Compared to random initialization, this structure can develop into a richer sequence mixer after finetuning (“trained”). Crucial role of … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of Attention components with random initialization (top) and after SPT training (bottom). While random weights fail to recover positional structure, SPT learns weights that effectively undo positional encoding, producing coherent, position-aligned Attention after softmax. Here we set the input content X = 0 and feed only positional embeddings pos through Q/K to isolate the effect of positiona… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of raw QK⊤ (top) and softmax-normalized attention weights (bottom) for a single-layer self-pretrained Transformer. QK⊤ matrices show clear diagonal structure, with noisier patterns for CIFAR10 and more coherent structure for PathFinder. Softmax normalization yields sparse, predominantly diagonal attention. Weights are taken after SPT (before finetuning) for the 1-layer setting described in Tb… view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise parameter displacement across training trajectories. R→SC displacement is consistently smaller than R→SPT and SPT→FT, showing that supervised training from random initialization induces limited movement. MLP layers move substantially more than Attention projections, while Attention layers move little unless initialized from SPT. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Delta between ∥Qrandom∥ and ∥Qscratch∥ across layers and components. Norm displacement is largest for MLP All blocks and grows with depth, while normalization layers stay near zero (and even turn slightly negative in deeper layers). Attention projections move noticeably less than MLP blocks, consistent with the smaller-table observations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the toy task for a random seed. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Complementing [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Weight distributions after SPT remain close to random initialization, except for mild broadening toward a Gaussian-like shape. Thus, the main structure is not visible in the marginal distributions of WQ or WK, but in their product and its interaction with positional encodings. 0.0003 0.001 0.003 0.01 0.03 0.1 Learning Rate Index 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Train SPT no SPT 0.0003 0.001 0.003 0.01 0.0… view at source ↗
Figure 13
Figure 13. Figure 13: Same setting as [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Verification of Proposition 1 on 2 example patterns under randomly sampled tokens. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
read the original abstract

Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines why self-pretraining (SPT) via masked token prediction improves Transformer accuracy on sequence classification tasks from the Long-Range Arena. Replicating Amos et al. (2024), the authors perform systematic ablations and identify that label supervision struggles to learn useful query-key attention patterns from random initialization. Using a minimal setup, they isolate learning proximity interactions (converting absolute positional encodings into proximity-biased attention scores) as a central source of SPT gains. In a simplified theoretical setup, they show that label supervision is locally blind to certain attention-score directions that masked reconstruction can detect.

Significance. If the claims hold, the work supplies a mechanistic account of why supervised training can fail to discover useful attention patterns while a self-supervised objective succeeds, even without external data. The replication, ablations, and simplified theoretical analysis are explicit strengths that could inform initialization strategies and training curricula for attention-based models on long sequences.

major comments (2)
  1. [Simplified theoretical setup] The simplified theoretical setup demonstrates local blindness of label supervision to certain attention directions, but the manuscript does not show that this blindness persists once the model is placed in the full non-convex loss landscape with multiple layers, heads, and the AdamW + cross-entropy dynamics used in the LRA experiments. If informative gradients from labels appear only after proximity-biased attention has already formed, the local-blindness result may be an artifact of the linearised or single-layer analysis rather than a property of the training procedure studied.
  2. [Ablations and minimal setup] The ablations attribute SPT gains primarily to learning proximity interactions, yet the minimal setup does not fully isolate this factor from confounding optimization or generalization effects that appear in the complete multi-head attention model. A direct comparison of gradient norms or attention-score trajectories with and without SPT after the first few epochs would strengthen the causal link.
minor comments (2)
  1. [Notation and setup] Notation for attention scores and positional encodings could be introduced earlier and used consistently across the theoretical and experimental sections to improve readability.
  2. [Abstract] The abstract states the central findings but does not mention the specific LRA tasks or model sizes used in the replication; adding one sentence would help readers assess scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the scope of our contributions. We address each major comment below with clarifications and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Simplified theoretical setup] The simplified theoretical setup demonstrates local blindness of label supervision to certain attention directions, but the manuscript does not show that this blindness persists once the model is placed in the full non-convex loss landscape with multiple layers, heads, and the AdamW + cross-entropy dynamics used in the LRA experiments. If informative gradients from labels appear only after proximity-biased attention has already formed, the local-blindness result may be an artifact of the linearised or single-layer analysis rather than a property of the training procedure studied.

    Authors: We agree that the theoretical analysis is intentionally simplified to a linearized single-layer setting to analytically isolate the local blindness of the supervised loss to certain attention-score directions. This does not constitute a direct simulation of the full non-convex, multi-layer, multi-head dynamics under AdamW. However, the result is presented as a mechanistic illustration of why label supervision may fail to discover proximity-biased patterns from random initialization, which is consistent with the empirical ablations on the full LRA models. We will revise the manuscript to add an explicit limitations paragraph discussing the gap between the simplified analysis and the full training procedure, while emphasizing that the theoretical finding motivates the observed empirical benefits of SPT. revision: partial

  2. Referee: [Ablations and minimal setup] The ablations attribute SPT gains primarily to learning proximity interactions, yet the minimal setup does not fully isolate this factor from confounding optimization or generalization effects that appear in the complete multi-head attention model. A direct comparison of gradient norms or attention-score trajectories with and without SPT after the first few epochs would strengthen the causal link.

    Authors: The minimal setup was constructed precisely to remove multi-head and other confounding factors so that proximity interaction learning could be studied in isolation. We acknowledge that direct evidence on early-epoch dynamics in the full model would strengthen causality. We will add new figures in the revised manuscript showing attention-score trajectories (and, where feasible, gradient norm comparisons) over the first few epochs for SPT versus supervised-only training on the LRA tasks. This will provide additional support for the claim that proximity-biased patterns emerge more readily under the self-supervised objective. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent ablations and separate theoretical construction

full rationale

The paper replicates Amos et al. (2024) and performs systematic ablations to isolate the role of learning proximity interactions in attention patterns, then presents a distinct simplified theoretical model showing local blindness of label supervision to certain attention directions. Neither the ablations nor the theoretical analysis reduce by the paper's own equations to quantities fitted on the same data or to self-referential definitions. External citations provide background but carry no load-bearing uniqueness or ansatz that collapses the new results. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on the validity of the ablation design and the assumption that the simplified theoretical model captures the essential dynamics of full Transformer training; no new free parameters or invented entities are introduced.

axioms (1)
  • standard math Standard Transformer attention and absolute positional encoding mechanics
    The analysis presupposes the usual query-key attention formulation and sinusoidal or learned absolute position encodings.

pith-pipeline@v0.9.0 · 5735 in / 1234 out tokens · 51251 ms · 2026-05-21T05:26:44.043349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

202 extracted references · 202 canonical work pages · 16 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    2025 American Control Conference (ACC) , pages=

    State space models as foundation models: A control theoretic overview , author=. 2025 American Control Conference (ACC) , pages=. 2025 , organization=

  3. [3]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  4. [4]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author =. arXiv , booktitle =:2010.11929 , journal =

  6. [6]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. doi:10.48550/arxiv.2205.14135 , editor =. arXiv , booktitle =:2205.14135 , journal =

  7. [7]

    arXiv.org , month =

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. arXiv.org , month =

  8. [8]

    International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , journal =

  9. [9]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author =. arXiv.org , month =. arXiv , doi =:2203.15556 , issn =

  10. [10]

    arXiv.org , month =

    Scaling Laws for Neural Language Models , author =. arXiv.org , month =

  11. [11]

    Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,

    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52729.2023.01499 , eprint =

  12. [12]

    doi:10.48550/arxiv.2206.08164 , editor =

    Long Range Graph Benchmark , author =. doi:10.48550/arxiv.2206.08164 , editor =. arXiv , booktitle =:2206.08164 , journal =

  13. [13]

    International Conference on Machine Learning , year=

    The CLRS Algorithmic Reasoning Benchmark , author=. International Conference on Machine Learning , year=

  14. [14]

    Neural Information Processing Systems , month =

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author =. Neural Information Processing Systems , month =

  15. [15]

    Nature , author =

    Highly accurate protein structure prediction with AlphaFold , author =. Nature , month =. doi:10.1038/s41586-021-03819-2 , issn =

  16. [16]

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

    Understanding the difficulty of training deep feedforward neural networks , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , editor =

  17. [17]

    Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

    A Simple Framework for Contrastive Learning of Visual Representations , author =. Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event , journal =

  18. [18]

    Neural Information Processing Systems , month =

    HiPPO: Recurrent Memory with Optimal Polynomial Projections , author =. Neural Information Processing Systems , month =

  19. [19]

    doi:10.48550/arxiv.2302.06646 , editor =

    Simple Hardware-Efficient Long Convolutions for Sequence Modeling , author =. doi:10.48550/arxiv.2302.06646 , editor =. arXiv , booktitle =:2302.06646 , journal =

  20. [20]

    arXiv.org , month =

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv.org , month =

  21. [21]

    NIPS , month =

    Attention Is All You Need , author =. NIPS , month =

  22. [22]

    doi:10.48550/arxiv.2206.11893 , editor =

    On the Parameterization and Initialization of Diagonal State Space Models , author =. doi:10.48550/arxiv.2206.11893 , editor =. arXiv , booktitle =:2206.11893 , journal =

  23. [23]

    arXiv.org , month =

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. arXiv.org , month =

  24. [24]

    International Conference on Machine Learning , pages=

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  25. [26]

    The Eleventh International Conference on Learning Representations , year=

    The Curious Case of Benign Memorization , author=. The Eleventh International Conference on Learning Representations , year=

  26. [27]

    Proceedings of the National Academy of Sciences , volume=

    Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  27. [28]

    International Conference on Learning Representations , year=

    Sharpness-aware Minimization for Efficiently Improving Generalization , author=. International Conference on Learning Representations , year=

  28. [29]

    science , volume=

    Reducing the dimensionality of data with neural networks , author=. science , volume=. 2006 , publisher=

  29. [30]

    The Journal of physiology , volume=

    Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , author=. The Journal of physiology , volume=. 1962 , publisher=

  30. [31]

    International Conference on Learning Representations , month =

    What Makes Convolutional Models Great on Long Sequence Modeling? , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2210.09298 , eprint =

  31. [33]

    arXiv.org , month =

    Efficient Long Sequence Modeling via State Space Augmented Transformer , author =. arXiv.org , month =. arXiv , doi =:2212.08136 , issn =

  32. [34]

    International Conference on Learning Representations , month =

    Mega: Moving Average Equipped Gated Attention , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2209.10655 , eprint =

  33. [35]

    International Conference on Machine Learning , month =

    The CLRS Algorithmic Reasoning Benchmark , author =. International Conference on Machine Learning , month =

  34. [36]

    International Conference on Learning Representations , month =

    Neural Networks and the Chomsky Hierarchy , author =. International Conference on Learning Representations , month =. doi:10.48550/arxiv.2207.02098 , eprint =

  35. [37]

    Masked-attention mask transformer for universal image segmenta- tion,in:2022IEEE/CVFConferenceonComputerVisionandPattern Recognition (CVPR), pp

    Scaling Vision Transformers , author =. Computer Vision and Pattern Recognition , month =. doi:10.1109/cvpr52688.2022.01179 , eprint =

  36. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author =. arXiv.org , month =. arXiv , doi =:2307.09288 , issn =

  37. [39]

    arXiv.org , month =

    Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author =. arXiv.org , month =

  38. [40]

    Journal of machine learning research , month =

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author =. Journal of machine learning research , month =

  39. [41]

    arXiv preprint arXiv:2310.04418 , year=

    Functional interpolation for relative positions improves long context transformers , author=. arXiv preprint arXiv:2310.04418 , year=

  40. [42]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  41. [43]

    Neural Information Processing Systems , month =

    Diagonal State Spaces are as Effective as Structured State Spaces , author =. Neural Information Processing Systems , month =

  42. [44]

    The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

    Efficiently Modeling Long Sequences with Structured State Spaces , author =. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , journal =

  43. [45]

    9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

    Long Range Arena: A Benchmark for Efficient Transformers , author =. 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 , journal =

  44. [46]

    doi:10.18653/v1/2023.acl-long.682 , editor =

    Downstream Datasets Make Surprisingly Good Pretraining Corpora , author =. doi:10.18653/v1/2023.acl-long.682 , editor =. arXiv , booktitle =:2209.14389 , journal =

  45. [47]

    Neural Information Processing Systems , month =

    Language Models are Few-Shot Learners , author =. Neural Information Processing Systems , month =

  46. [48]

    Computer Vision and Pattern Recognition , month =

    Masked Autoencoders Are Scalable Vision Learners , author =. Computer Vision and Pattern Recognition , month =

  47. [49]

    North American Chapter of the Association for Computational Linguistics , month =

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. North American Chapter of the Association for Computational Linguistics , month =

  48. [50]

    Neural Computation , month =

    A Fast Learning Algorithm for Deep Belief Nets , author =. Neural Computation , month =. doi:10.1162/neco.2006.18.7.1527 , issn =

  49. [51]

    ArXiv , year=

    A Cookbook of Self-Supervised Learning , author=. ArXiv , year=

  50. [52]

    Large Scale Kernel Machines , publisher =

    Scaling learning algorithms towards AI , author =. Large Scale Kernel Machines , publisher =

  51. [53]

    Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

    Deep learning , author =. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , editor =

  52. [54]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

    SoundStream: An End-to-End Neural Audio Codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

  53. [55]

    ArXiv , year=

    AudioLM: a Language Modeling Approach to Audio Generation , author=. ArXiv , year=

  54. [56]

    ArXiv , year=

    MusicLM: Generating Music From Text , author=. ArXiv , year=

  55. [57]

    A Generalist Agent , author=. Trans. Mach. Learn. Res. , year=

  56. [58]

    SCROLLS : Standardized C ompa R ison Over Long Language Sequences

    Shaham, Uri and Segal, Elad and Ivgi, Maor and Efrat, Avia and Yoran, Ori and Haviv, Adi and Gupta, Ankit and Xiong, Wenhan and Geva, Mor and Berant, Jonathan and Levy, Omer. SCROLLS : Standardized C ompa R ison Over Long Language Sequences. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022

  57. [59]

    ACM Computing Surveys , year=

    Efficient Transformers: A Survey , author=. ACM Computing Surveys , year=

  58. [60]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  59. [61]

    The Eleventh International Conference on Learning Representations , year=

    Simplified State Space Layers for Sequence Modeling , author=. The Eleventh International Conference on Learning Representations , year=

  60. [62]

    The Eleventh International Conference on Learning Representations (ICLR) , year=

    Long Range Language Modeling via Gated State Spaces , author=. The Eleventh International Conference on Learning Representations (ICLR) , year=

  61. [63]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  62. [64]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Robust Speech Recognition via Large-Scale Weak Supervision , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  63. [65]

    Language Models are Unsupervised Multitask Learners , author=

  64. [66]

    BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  65. [67]

    International Conference on Machine Learning , year=

    Exphormer: Sparse Transformers for Graphs , author=. International Conference on Machine Learning , year=

  66. [68]

    The Eleventh International Conference on Learning Representations , year=

    Relational Attention: Generalizing Transformers for Graph-Structured Tasks , author=. The Eleventh International Conference on Learning Representations , year=

  67. [69]

    Advances in neural information processing systems , volume=

    S4nd: Modeling images and videos as multidimensional signals with state spaces , author=. Advances in neural information processing systems , volume=

  68. [70]

    doi:10.18653/v1/N18-4013 , pages =

    Nangia, Nikita and Bowman, Samuel , booktitle =. doi:10.18653/v1/N18-4013 , pages =

  69. [71]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

  70. [72]

    Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =

    Dragomir R. Radev and Pradeep Muthukrishnan and Vahed Qazvinian and Amjad Abu-Jbara , journal =. The

  71. [73]

    Learning long-range spatial dependencies with horizontal gated recurrent units , year =

    Drew Linsley and Junkyung Kim and Vijay Veerabadran and Charles Windolf and Thomas Serre , booktitle =. Learning long-range spatial dependencies with horizontal gated recurrent units , year =

  72. [74]

    Disentangling neural mechanisms for perceptual grouping , url =

    Junkyung Kim and Drew Linsley and Kalpit Thakkar and Thomas Serre , bibsource =. Disentangling neural mechanisms for perceptual grouping , url =. 8th International Conference on Learning Representations,

  73. [75]

    Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

    Pete Warden , journal =. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , volume =

  74. [76]

    The Twelfth International Conference on Learning Representations , year=

    Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors , author=. The Twelfth International Conference on Learning Representations , year=

  75. [77]

    arXiv preprint arXiv:2305.10517 , year=

    Improving speaker verification with self-pretrained transformer models , author=. arXiv preprint arXiv:2305.10517 , year=

  76. [78]

    2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=

    Self pre-training with masked autoencoders for medical image classification and segmentation , author=. 2023 IEEE 20th international symposium on biomedical imaging (ISBI) , pages=. 2023 , organization=

  77. [79]

    doi: 10.18653/v1/2020.acl-main.703

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

  78. [80]

    International Conference on Learning Representations , year=

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length , author=. International Conference on Learning Representations , year=

  79. [81]

    International Conference on Learning Representations , year=

    Long Range Arena: A Benchmark for Efficient Transformers , author=. International Conference on Learning Representations , year=

  80. [82]

    Advances in Neural Information Processing Systems , year=

    Hippo: Recurrent memory with optimal polynomial projections , author=. Advances in Neural Information Processing Systems , year=

Showing first 80 references.