Recognition: 1 theorem link
· Lean TheoremGated Linear Attention Transformers with Hardware-Efficient Training
Pith reviewed 2026-05-15 01:09 UTC · model grok-4.3
The pith
Gated linear attention replaces softmax attention in Transformers to deliver competitive language modeling performance with linear inference time and strong length generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Linear attention formulated as an RNN with 2D matrix-valued hidden states can be implemented with an I/O-aware algorithm that is faster than optimized softmax attention, and augmenting it with data-dependent gates produces a drop-in replacement for standard attention that yields competitive language-modeling results, linear-time inference, and the ability for a model trained on 2K tokens to generalize to sequences exceeding 20K without major perplexity degradation.
What carries the argument
The FLASHLINEARATTENTION algorithm and its gated extension, which compute linear attention via chunk-wise parallel scans over matrix states while incorporating input-dependent gates to control information flow.
If this is right
- GLA Transformers train in parallel like standard attention models yet support linear-time inference like RNNs.
- A GLA model trained on 2K-length sequences maintains performance on inputs longer than 20K tokens.
- Training throughput exceeds that of a comparable Mamba model while matching LLaMA perplexity on moderate-scale experiments.
- The same gated linear layer can serve as a direct substitute for softmax attention without architectural changes.
Where Pith is reading between the lines
- The length-generalization property could support training once on moderate contexts and deploying on arbitrarily long documents or codebases.
- The hardware-efficient kernel may reduce the memory bandwidth bottleneck when scaling to models with billions of parameters.
- Combining the gated linear layer with other linear-time techniques could further close any remaining gap to full softmax attention.
Load-bearing premise
Adding data-dependent gates to linear attention is sufficient to restore the expressiveness lost relative to softmax attention, without hidden degradation at larger scales or on different tasks.
What would settle it
Training a GLA Transformer to larger scale or on a different domain and measuring a substantial gap in perplexity or downstream performance versus an equivalent LLaMA-style model, or observing sharp perplexity rise when testing beyond 20K sequence length after 2K training.
read the original abstract
Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet (Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale language modeling experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to sequences longer than 20K without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FLASHLINEARATTENTION, a hardware-efficient algorithm for linear attention that trades memory movement for parallelizability and can be viewed as an RNN with matrix-valued states for linear-time inference. It generalizes the approach to gated linear attention (GLA) using data-dependent gates, and reports that the resulting GLA Transformer matches or approaches the performance of LLaMA-style Transformers as well as RetNet and Mamba on moderate-scale language modeling, while showing strong length generalization (2K training to >20K test) and higher training throughput than comparable Mamba models.
Significance. If the empirical claims hold under rigorous verification, the work provides a practical, I/O-aware implementation path for linear-attention Transformers that could reduce the training/inference gap with softmax attention while preserving competitive quality and enabling better scaling to long contexts. The hardware-aware design and length-generalization results are particularly relevant for efficient large-model deployment.
major comments (2)
- [Experimental evaluation] The central claim that data-dependent gates restore sufficient expressiveness for competitive performance rests entirely on moderate-scale empirical results; no theoretical capacity analysis, scaling curves beyond the reported regime, or ablation isolating the gate contribution (versus plain linear attention) is provided, leaving open the possibility that competitiveness is regime-specific rather than general.
- [Abstract and §4] Abstract and results sections report competitive perplexity and throughput but supply no details on model sizes, training corpora, optimizer settings, number of runs, or statistical significance testing against baselines (LLaMA, RetNet, Mamba), which is load-bearing for assessing whether the observed advantages are robust.
minor comments (2)
- [Method] Notation for the gated recurrence (e.g., the precise form of the data-dependent gate and its interaction with the 2D hidden state) should be introduced with an explicit equation before the experimental claims.
- [Figures and Tables] Figure captions and tables comparing throughput and perplexity should include exact sequence lengths, batch sizes, and hardware specifications to allow direct reproduction.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the empirical support.
read point-by-point responses
-
Referee: [Experimental evaluation] The central claim that data-dependent gates restore sufficient expressiveness for competitive performance rests entirely on moderate-scale empirical results; no theoretical capacity analysis, scaling curves beyond the reported regime, or ablation isolating the gate contribution (versus plain linear attention) is provided, leaving open the possibility that competitiveness is regime-specific rather than general.
Authors: We acknowledge that the paper is primarily empirical and does not include a formal theoretical capacity analysis of gated linear attention, as deriving such bounds for data-dependent gating mechanisms remains an open research question beyond the scope of this implementation-focused work. Our experiments are conducted at moderate scales (125M–1.3B parameters) to match the reported baselines. In the revised manuscript we have added an explicit ablation study in Section 4.3 comparing GLA against plain linear attention (without data-dependent gates) on identical setups, demonstrating that the gates are responsible for closing most of the performance gap to softmax attention. We have also included additional scaling results up to 3B parameters in the appendix to provide evidence that the competitiveness holds beyond the originally reported regime. revision: partial
-
Referee: [Abstract and §4] Abstract and results sections report competitive perplexity and throughput but supply no details on model sizes, training corpora, optimizer settings, number of runs, or statistical significance testing against baselines (LLaMA, RetNet, Mamba), which is load-bearing for assessing whether the observed advantages are robust.
Authors: We apologize for the lack of explicit detail in the abstract and main results narrative. The original manuscript already contains the full experimental protocol in Section 4 and Appendix B, including model sizes (125M–1.3B), training data (The Pile subset), optimizer (AdamW with cosine decay), number of runs (three random seeds), and reported standard deviations. To address the referee’s concern we have (i) expanded the abstract with a concise summary of these settings and (ii) added a consolidated hyperparameter table plus error-bar plots in Section 4 of the revised version, making the robustness information immediately visible without requiring the reader to consult the appendix. revision: yes
Circularity Check
No circularity: empirical claims rest on external baselines and independent experiments
full rationale
The paper introduces a hardware-efficient FLASHLINEARATTENTION algorithm and generalizes it to gated linear attention (GLA) with data-dependent gates. Core claims of competitive performance and length generalization are validated via direct empirical comparisons on moderate-scale language modeling against external baselines (LLaMA, RetNet, Mamba) rather than any self-referential fitting, renaming, or derivation that reduces to the paper's own inputs by construction. No load-bearing step invokes a self-citation chain, uniqueness theorem from the authors, or ansatz smuggled via prior work; the derivation of the gated recurrence is presented as an algorithmic extension with throughput and perplexity results measured on held-out data. This is the standard non-circular case for an empirical systems paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
linear attention... formulated as an RNN with 2D (matrix-valued) hidden states... chunkwise parallel form... data-dependent gates
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
TIDES: Implicit Time-Awareness in Selective State Space Models
TIDES reconciles selective SSM expressivity with continuous-time physical discretization by moving input dependence onto the state matrix, enabling native irregular time series handling and achieving SOTA on UEA and P...
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning
Gated QKAN-FWP combines fast weight programming with quantum-inspired Kolmogorov-Arnold networks via single-qubit DARUAN activations and gated updates to deliver a 12.5k-parameter model that outperforms larger classic...
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
A Single-Layer Model Can Do Language Modeling
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
The Impossibility Triangle of Long-Context Modeling
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
Beyond Similarity: Temporal Operator Attention for Time Series Analysis
Temporal Operator Attention augments softmax attention with learnable sequence-space operators for signed temporal mixing and uses stochastic regularization to enable practical training, yielding consistent gains on t...
-
PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay
PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupli...
-
Adaptive Memory Decay for Log-Linear Attention
Making memory decay input-dependent via a lightweight MLP improves log-linear attention performance on associative recall, selective copying, and language modeling, especially for long sequences.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
-
Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview
The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
Reference graph
Works this paper leans on
-
[1]
Zoology: Measuring and improving recall in efficient language models
Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and R \' e , C. Zoology: Measuring and improving recall in efficient language models. CoRR, abs/2312.04927, 2023 a
-
[2]
Arora, S., Yang, B., Eyuboglu, S., Narayan, A., Hojel, A., Trummer, I., and Ré, C. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes , April 2023 b . URL http://arxiv.org/abs/2304.09433. arXiv:2304.09433 [cs]
-
[3]
Simple linear attention language models balance the recall-throughput tradeoff
Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zinsley, D., Zou, J., Rudra, A., and R'e, C. Simple linear attention language models balance the recall-throughput tradeoff. ArXiv, abs/2402.18668, 2024
-
[4]
Auer, S., Barone, D. A. C., Bartz, C., Cortes, E. G., Jaradeh, M. Y., Karras, O., Koubarakis, M., Mouromtsev, D., Pliukhin, D., Radyush, D., Shilin, I., Stocker, M., and Tsalapati, E. The sciqa scientific question answering benchmark for scholarly knowledge. Scientific Reports, 13 0 (1): 0 7240, May 2023. ISSN 2045-2322. doi:10.1038/s41598-023-33607-z
-
[5]
Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., and Ionescu, C. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016
work page 2016
-
[6]
xlstm: Ex- tended long short-term memory
Beck, M., P \"o ppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., and Hochreiter, S. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024
-
[7]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv preprint arXiv: Arxiv-2004.05150, 2020. URL https://arxiv.org/abs/2004.05150v2
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[8]
Piqa: Reasoning about physical commonsense in natural language
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020
work page 2020
-
[9]
Blelloch, G. E. Prefix sums and their applications. 1990
work page 1990
-
[10]
Striped attention: Faster ring attention for causal transformers
Brandon, W., Nrusimha, A., Qian, K., Ankner, Z., Jin, T., Song, Z., and Ragan-Kelley, J. Striped attention: Faster ring attention for causal transformers. ArXiv, abs/2311.09431, 2023
-
[11]
Buckman, J. and Gelada, C. Linear Transformers Are Faster After All , 2024
work page 2024
-
[12]
Compiling high performance recursive filters
Chaurasia, G., Ragan-Kelley, J., Paris, S., Drettakis, G., and Durand, F. Compiling high performance recursive filters. In High Performance Graphics, 2015
work page 2015
-
[13]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. PREPRINT, 2019. URL https://arxiv.org/abs/1904.10509v1
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[14]
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Cho, K., Van Merri \"e nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[15]
M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarl \' o s, T., Hawkins, P., Davis, J
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarl \' o s, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021
work page 2021
-
[16]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[17]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023. doi:10.48550/ARXIV.2307.08691
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.08691 2023
- [19]
-
[20]
Dao, T., Chen, B., Sohoni, N. S., Desai, A. D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., and R \'e , C. Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning, 2022 a
work page 2022
-
[21]
Y., Ermon, S., Rudra, A., and R \' e , C
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \' e , C. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022 b
work page 2022
-
[22]
Y., Arora, S., Grogan, J., Johnson, I., Eyuboglu, S., Thomas, A
Fu, D. Y., Arora, S., Grogan, J., Johnson, I., Eyuboglu, S., Thomas, A. W., Spector, B., Poli, M., Rudra, A., and R'e, C. Monarch mixer: A simple sub-quadratic gemm-based architecture. ArXiv, abs/2310.12109, 2023 a
-
[23]
Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and R \' e , C. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 b
work page 2023
-
[24]
Fu, D. Y., Epstein, E. L., Nguyen, E., Thomas, A., Zhang, M., Dao, T., Rudra, A., and Ré, C. Simple hardware-efficient long convolutions for sequence modeling. International Conference on Machine Learning, 2023 c . doi:10.48550/arXiv.2302.06646. URL https://arxiv.org/abs/2302.06646v1
-
[25]
Y., Kumbong, H., Nguyen, E., and R \' e , C
Fu, D. Y., Kumbong, H., Nguyen, E., and R \' e , C. Flashfftconv: Efficient convolutions for long sequences with tensor cores. CoRR, abs/2311.05908, 2023 d
-
[26]
A framework for few-shot language model evaluation, September 2021
Gao, L., Tow, J., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., McDonell, K., Muennighoff, N., Phang, J., Reynolds, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, September 2021
work page 2021
-
[27]
A., Schmidhuber, J., and Cummins, F
Gers, F. A., Schmidhuber, J., and Cummins, F. A. Learning to forget: Continual prediction with LSTM . Neural Comput., 12 0 (10): 0 2451--2471, 2000
work page 2000
-
[28]
Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. 2023
work page 2023
-
[29]
Efficiently modeling long sequences with structured state spaces
Gu, A., Goel, K., and R'e, C. Efficiently modeling long sequences with structured state spaces. International Conference On Learning Representations, 2021 a
work page 2021
-
[30]
K., Dao, T., Rudra, A., and R'e, C
Gu, A., Johnson, I., Goel, K., Saab, K. K., Dao, T., Rudra, A., and R'e, C. Combining recurrent, convolutional, and continuous-time models with linear state-space layers. Neural Information Processing Systems, 2021 b . URL https://arxiv.org/abs/2110.13985v1
-
[31]
Efficiently modeling long sequences with structured state spaces
Gu, A., Goel, K., and R \' e , C. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[32]
Gupta, A. and Berant, J. Diagonal state spaces are as effective as structured state spaces. ARXIV.ORG, 2022. doi:10.48550/arXiv.2203.14343
-
[33]
Liquid structural state-space models,
Hasani, R., Lechner, M., Wang, T.-H., Chahine, M., Amini, A., and Rus, D. Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022
-
[34]
Hinton, G. E. and Plaut, D. C. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp.\ 177--186, 1987
work page 1987
-
[35]
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[36]
Hooker, S. The hardware lottery. Communications of the ACM, 64: 0 58 -- 65, 2020
work page 2020
-
[37]
Hua, W., Dai, Z., Liu, H., and Le, Q. V. Transformer quality in linear time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesv \' a ri, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Machine Learning Research, pp.\ 9099--9117. PMLR , 2022
work page 2022
-
[38]
Going beyond linear transformers with recurrent fast weight programmers
Irie, K., Schlag, I., Csord \'a s, R., and Schmidhuber, J. Going beyond linear transformers with recurrent fast weight programmers. Advances in Neural Information Processing Systems, 34: 0 7703--7717, 2021
work page 2021
-
[39]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Polysketchformer: Fast transformers via sketching polynomial kernels, 2023
Kacham, P., Mirrokni, V., and Zhong, P. Polysketchformer: Fast transformers via sketching polynomial kernels, 2023
work page 2023
-
[41]
Kasai, J., Peng, H., Zhang, Y., Yogatama, D., Ilharco, G., Pappas, N., Mao, Y., Chen, W., and Smith, N. A. Finetuning pretrained transformers into RNN s. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 10630--10643, Online and Punta Cana, Dominic...
-
[42]
Transformers are rnns: Fast autoregressive transformers with linear attention
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.\ 5156--5165. PMLR, 2020
work page 2020
-
[43]
Gateloop: Fully data-controlled linear recurrence for sequence modeling
Katsch, T. Gateloop: Fully data-controlled linear recurrence for sequence modeling. ArXiv, abs/2311.01927, 2023
-
[44]
Reformer: The Efficient Transformer
Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. International Conference On Learning Representations, 2020. URL https://arxiv.org/abs/2001.04451v2
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[45]
Li, D., Shao, R., Xie, A., Xing, E. P., Gonzalez, J. E., Stoica, I., Ma, X., and Zhang, H. Lightseq: Sequence level parallelism for distributed training of long context transformers. ArXiv, abs/2310.03294, 2023 a
-
[46]
Sequence parallelism: Long sequence training from system perspective
Li, S., Xue, F., Baranwal, C., Li, Y., and You, Y. Sequence parallelism: Long sequence training from system perspective. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 2023 b . Association for Computational Linguistics
work page 2023
-
[47]
Functional interpolation for relative positions improves long context transformers
Li, S., You, C., Guruganesh, G., Ainslie, J., Ontanon, S., Zaheer, M., Sanghai, S., Yang, Y., Kumar, S., and Bhojanapalli, S. Functional interpolation for relative positions improves long context transformers. arXiv preprint arXiv:2310.04418, 2023 c
-
[48]
Li, Y., Cai, T., Zhang, Y., Chen, D., and Dey, D. What makes convolutional models great on long sequence modeling? In The Eleventh International Conference on Learning Representations, 2023 d . URL https://openreview.net/forum?id=TGJSPbRpJX-
work page 2023
-
[49]
Lingle, L. D. Transformer-vq: Linear-time transformers via vector quantization. CoRR, abs/2309.16354, 2023. doi:10.48550/ARXIV.2309.16354
-
[50]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Liu, H., Zaharia, M., and Abbeel, P. Ring attention with blockwise transformers for near-infinite context. ArXiv, abs/2310.01889, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
arXiv preprint arXiv:2401.10166 , year=
Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., and Liu, Y. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024
-
[52]
Lockard, C., Shiralkar, P., and Dong, X. L. OpenCeres : When Open Information Extraction Meets the Semi - Structured Web . In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , Volume 1 ( Long and Short Papers ) ,...
-
[53]
Loshchilov, I. and Hutter, F. Fixing weight decay regularization in adam. 2018
work page 2018
-
[54]
U-mamba: Enhancing long-range dependency for biomedical image segmentation
Ma, J., Li, F., and Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024
-
[55]
Mega: Moving average equipped gated attention
Ma, X., Zhou, C., Kong, X., He, J., Gui, L., Neubig, G., May, J., and Zettlemoyer, L. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=qNLe3iq2El
work page 2023
-
[56]
Mao, H. H. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 10236--10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.697
-
[57]
Martin, E. and Cundy, C. Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net, 2018
work page 2018
-
[58]
Y., Kumbong, H., Parnichkun, R
Massaroli, S., Poli, M., Fu, D. Y., Kumbong, H., Parnichkun, R. N., Timalsina, A., Romero, D. W., McIntyre, Q., Chen, B., Rudra, A., Zhang, C., Re, C., Ermon, S., and Bengio, Y. Laughing hyena distillery: Extracting compact recurrences from convolutions. NEURIPS, 2023. URL https://arxiv.org/abs/2310.18780v1
-
[59]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[60]
Linear log-normal attention with unbiased concentration, 2023
Nahshan, Y., Kampeas, J., and Haleva, E. Linear log-normal attention with unbiased concentration, 2023
work page 2023
-
[61]
Transformers are multi-state rnns
Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state rnns. ArXiv, abs/2401.06104, 2024
-
[62]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fern \'a ndez, R. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[63]
Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Cao, H., Cheng, X., Chung, M., Grella, M., V., K. K. G., He, X., Hou, H., Kazienko, P., Kocon, J., Kong, J., Koptyra, B., Lau, H., Mantri, K. S. I., Mom, F., Saito, A., Tang, X., Wang, B., Wind, J. S., Wozniak, S., Zhang, R., Zhang, Z., Zhao, Q., Zhou, P., Zhu, J., and Zhu, R. RWKV: reinventi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.13048 2023
-
[64]
Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence
Peng, B., Goldstein, D., Anthony, Q., Albalak, A., Alcaide, E., Biderman, S., Cheah, E., Ferdinan, T., Hou, H., Kazienko, P., et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024
-
[65]
Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith, N. A., and Kong, L. Random feature attention. arXiv preprint arXiv:2103.02143, 2021
-
[66]
Peng, H., Kasai, J., Pappas, N., Yogatama, D., Wu, Z., Kong, L., Schwartz, R., and Smith, N. A. ABC : Attention with bounded-memory control. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022. Association for Com...
work page 2022
-
[67]
Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and R \' e , C
Poli, M., Massaroli, S., Nguyen, E., Fu, D. Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., and R \' e , C. Hyena hierarchy: Towards larger convolutional language models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, US...
work page 2023
-
[68]
Pramanik, S., Elelimy, E., Machado, M. C., and White, A. Recurrent linear transformers. CoRR, abs/2310.15719, 2023
-
[69]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[70]
The devil in linear transformer
Qin, Z., Han, X., Sun, W., Li, D., Kong, L., Barnes, N., and Zhong, Y. The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022
-
[71]
Toeplitz neural network for sequence modeling
Qin, Z., Han, X., Sun, W., He, B., Li, D., Li, D., Dai, Y., Kong, L., and Zhong, Y. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023 a . URL https://openreview.net/forum?id=IxmWsm4xrua
work page 2023
-
[72]
Scaling transnormer to 175 billion parameters
Qin, Z., Li, D., Sun, W., Sun, W., Shen, X., Han, X., Wei, Y., Lv, B., Yuan, F., Luo, X., et al. Scaling transnormer to 175 billion parameters. arXiv preprint arXiv:2307.14995, 2023 b
-
[73]
Hierarchically gated recurrent neural network for sequence modeling
Qin, Z., Yang, S., and Zhong, Y. Hierarchically gated recurrent neural network for sequence modeling. CoRR, abs/2311.04823, 2023 c . doi:10.48550/ARXIV.2311.04823
-
[74]
Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models
Qin, Z., Sun, W., Li, D., Shen, X., Sun, W., and Zhong, Y. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. 2024 a
work page 2024
-
[75]
Hgrn2: Gated linear rnns with state expansion
Qin, Z., Yang, S., Sun, W., Shen, X., Li, D., Sun, W., and Zhong, Y. Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024 b
-
[76]
W., Potapenko, A., Jayakumar, S
Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019
work page 2019
-
[77]
Know What You Don 't Know : Unanswerable Questions for SQuAD
Rajpurkar, P., Jia, R., and Liang, P. Know What You Don 't Know : Unanswerable Questions for SQuAD . In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pp.\ 784--789, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi:10.186...
-
[78]
Sparse modular activation for efficient sequence modeling
Ren, L., Liu, Y., Wang, S., Xu, Y., Zhu, C., and Zhai, C. Sparse modular activation for efficient sequence modeling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=TfbzX6I14i
work page 2023
-
[79]
Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011. URL https://people.ict.usc.edu/ gordon/publications/AAAI-SPRING11A.PDF
work page 2011
-
[80]
Ckconv: Continuous kernel convolution for sequential data,
Romero, D. W., Kuzina, A., Bekkers, E. J., Tomczak, J. M., and Hoogendoorn, M. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv: 2102.02611, 2021. URL https://arxiv.org/abs/2102.02611v3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.