Generating Long Sequences with Sparse Transformers

Alec Radford; Ilya Sutskever; Rewon Child; Scott Gray

arxiv: 1904.10509 · v1 · submitted 2019-04-23 · 💻 cs.LG · stat.ML

Generating Long Sequences with Sparse Transformers

Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever This is my paper

Pith reviewed 2026-05-10 19:45 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords sparse transformersattention mechanismslong sequence modelingdensity modelingenwik8cifar-10imagenet-64sequence generation

0 comments

The pith

Sparse factorizations of the attention matrix let transformers model sequences tens of thousands of timesteps long.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces changes to the transformer architecture that address the quadratic growth in time and memory with sequence length. Sparse factorizations reduce attention complexity to O(n sqrt n), while additional modifications allow deeper networks, save memory through recomputation, and speed up training with custom kernels. The resulting Sparse Transformers handle sequences of tens of thousands of steps across hundreds of layers and are applied to raw bytes of text, images, and audio. A reader would care because this makes it feasible to capture long-range structure in data that was previously too costly to model directly.

Core claim

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to O(n sqrt n). We also introduce a variation on architecture and initialization to train deeper networks, the recomputation of attention matrices to save memory, and fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling.

What carries the argument

Sparse factorizations of the attention matrix that lower complexity from quadratic to O(n sqrt n) while supporting long-range dependencies.

If this is right

Hundreds of layers become practical on sequences of tens of thousands of timesteps.
State-of-the-art density modeling results are reached on Enwik8, CIFAR-10, and ImageNet-64 from raw bytes.
Unconditional generation produces samples with global coherence and diversity.
Self-attention in principle extends to sequences of length one million or more.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured sparsity may suffice for many long-range dependencies instead of requiring full attention.
The approach could be tested on other high-dimensional sequence data such as video frames or full-length audio tracks.
Learned or data-adaptive sparsity patterns might further improve efficiency beyond the fixed factorizations used here.

Load-bearing premise

The chosen sparse factorizations of the attention matrix retain sufficient expressivity to capture the long-range dependencies needed for the reported density modeling tasks.

What would settle it

A head-to-head comparison on one of the long-sequence tasks where a full-attention transformer achieves clearly superior density estimates or sample coherence compared with the sparse version.

read the original abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse factorizations cut transformer attention to O(n sqrt n) and let them train on 16k+ sequences with new SOTA density numbers, but missing direct sparse-vs-dense ablations leaves the expressivity claim untested.

read the letter

The core advance here is the pair of hand-designed sparse attention patterns—strided and fixed—that factor the attention matrix so cost drops from quadratic to O(n sqrt n). They pair this with a deeper-network initialization, attention recomputation to save memory, and fast kernels, then train models with hundreds of layers on sequences up to tens of thousands of steps. The same architecture is applied to raw-byte modeling of text, images, and audio, beating prior numbers on Enwik8, CIFAR-10, and ImageNet-64 while producing samples that show global coherence over long ranges. They also sketch how the approach could reach a million steps in principle. That combination of scaling demonstration and concrete implementation tricks is what is actually new relative to the 2017-2018 transformer baseline. The work is useful because it gives practitioners a workable recipe for longer contexts without needing to invent new hardware. The patterns are simple enough to implement and the memory tricks are immediately practical. The main soft spot is the lack of a controlled comparison: there is no head-to-head run of the sparse version against a dense transformer on lengths like 1k or 2k where both are still feasible. Without that, it is hard to separate the benefit of sparsity from the deeper training and initialization changes. The SOTA claims would also land more solidly with error bars and fuller ablation tables on the individual components. The patterns are task-specific and hand-crafted, so it remains open whether they preserve enough expressivity for every long-range dependency that dense attention would capture. This paper is aimed at people who need transformers on long sequences in language, vision, or audio and who are willing to trade some theoretical generality for practical scaling. A reader already working on efficient attention or scaling experiments will find the concrete numbers and code-level details worth examining. It deserves a serious referee because the empirical results are substantial and the method is reproducible enough to test. I would send it to review, expecting the main questions to focus on the missing ablations and the precise contribution of each trick.

Referee Report

2 major / 2 minor

Summary. The paper introduces sparse factorizations of the self-attention matrix in Transformers that reduce complexity from quadratic to O(n √n). Combined with changes to architecture and initialization for training deeper networks, attention recomputation to reduce memory use, and optimized fast attention kernels, the resulting Sparse Transformers are shown to model sequences of tens of thousands of timesteps. The same architecture is applied to raw-byte modeling of text (Enwik8), images (CIFAR-10 and ImageNet-64), achieving new state-of-the-art density modeling results and generating globally coherent unconditional samples; the work also indicates that self-attention can in principle handle sequences of length one million or more.

Significance. If the reported results hold under verification, the work is significant: it provides concrete, practical sparse attention patterns that enable self-attention to scale to sequence lengths far beyond the reach of dense Transformers, while retaining sufficient expressivity for high-quality density modeling on established benchmarks. The accompanying engineering contributions (recomputation, fast kernels) are immediately usable and lower the barrier to experimenting with longer contexts in language, vision, and audio.

major comments (2)

[§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.
[§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.

minor comments (2)

[§3] Notation for the two sparse patterns (strided vs. fixed) is introduced in §3 but the precise definition of the attention mask for each is only shown graphically in Figure 2; an explicit matrix-level equation would improve clarity.
[abstract] The claim that sequences of length one million are feasible “in principle” is stated in the abstract but is not supported by any timing or memory measurements at that scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3, Eq. (3) and Figure 2: The central claim that the chosen strided and fixed sparse patterns retain sufficient expressivity for long-range dependencies rests on the SOTA density-modeling results, yet the manuscript contains no direct ablation of sparse versus dense attention on sequence lengths where dense attention remains tractable (e.g., n ≤ 2048). Without this comparison it is impossible to isolate whether the reported gains derive from the sparsity itself or from the deeper training and initialization changes.

Authors: We agree that a controlled ablation on shorter sequences would help isolate the contribution of the sparse patterns from the architectural and initialization changes. Although the primary motivation is scaling to lengths where dense attention is infeasible, we will add an ablation in the revised manuscript: we will train matched-depth dense and sparse models on sequences of length 512–2048 and report the resulting bits-per-byte (or bits-per-dim) to quantify any expressivity gap introduced by sparsity. revision: yes
Referee: [§4–5] Experimental results (abstract and §4–5): The headline SOTA numbers on Enwik8, CIFAR-10, and ImageNet-64 are presented without error bars, without ablations that quantify the contribution of each proposed component (sparsity pattern, initialization, recomputation), and without an explicit statement of the exact training protocol and hyper-parameters. These omissions make the scaling claim difficult to reproduce or falsify.

Authors: We acknowledge that the current presentation lacks error bars, component-wise ablations, and a fully explicit training protocol, all of which are important for reproducibility. In the revised version we will (i) report standard deviations from at least three independent runs for the main Enwik8, CIFAR-10, and ImageNet-64 results, (ii) add ablation tables that isolate the effect of the sparse factorization, the deeper-network initialization, and attention recomputation, and (iii) include a detailed appendix listing all hyperparameters, optimizer settings, data preprocessing, and hardware used for each experiment. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal and empirical benchmarks are independent of fitted inputs or self-referential definitions.

full rationale

The paper defines sparse attention factorizations (strided and fixed patterns) explicitly in §3 as a hand-designed reduction from dense O(n²) to O(n√n) attention, then evaluates the resulting model on standard external density-modeling benchmarks (Enwik8, CIFAR-10, ImageNet-64) whose test sets are disjoint from any training or hyperparameter choices. No equation equates a reported performance gain to a quantity defined by fitting the same data; no uniqueness theorem or ansatz is imported via self-citation to force the factorization choice; and the central claim (long-sequence modeling with hundreds of layers) rests on measured perplexity/BPD numbers rather than a renaming or self-definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that sparse attention patterns can approximate full attention for the density modeling tasks; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Sparse factorizations of the attention matrix preserve enough long-range modeling capacity for the target tasks
Invoked to justify the O(n sqrt(n)) reduction while still claiming SOTA performance.

pith-pipeline@v0.9.0 · 5457 in / 1285 out tokens · 64595 ms · 2026-05-10T19:45:41.453749+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean eight_tick_forces_D3; linking_requires_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce sparse factorizations of the attention matrix which reduce this to O(n√n)... We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers... setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; Jcost unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sparse factorizations of the attention matrix... two 2d factorized attention schemes... strided... fixed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Limits of Long-Context Transformers
cs.LG 2026-05 unverdicted novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models
cs.LG 2026-03 conditional novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
Rotation Equivariant Mamba for Vision Tasks
cs.CV 2026-03 unverdicted novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-e...
RULER: What's the Real Context Size of Your Long-Context Language Models?
cs.CL 2024-04 accept novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
Denoising Diffusion Probabilistic Models
cs.LG 2020-06 accept novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
Scaling Laws for Neural Language Models
cs.LG 2020-01 unverdicted novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
cs.LG 2026-05 unverdicted novelty 7.0

Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
Beyond Detection: A Structure-Aware Framework for Scene Text Tracking
cs.CV 2026-05 unverdicted novelty 7.0

SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
cs.GR 2026-05 unverdicted novelty 7.0

A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
cs.CL 2026-05 conditional novelty 7.0

EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
cs.LG 2026-05 unverdicted novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
gr-qc 2026-05 unverdicted novelty 7.0

Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
SpecEdit: Training-Free Acceleration for Diffusion based Image Editing via Semantic Locking
cs.CV 2026-05 unverdicted novelty 7.0

SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 conditional novelty 7.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by introducing a second temporal operator in LTL, with global and local attention being expressively complementary.
Adaptive Head Budgeting for Efficient Multi-Head Attention
cs.LG 2026-04 unverdicted novelty 7.0

BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
Improving Sparse Autoencoder with Dynamic Attention
cs.LG 2026-04 unverdicted novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size
cs.CL 2026-04 unverdicted novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
cs.NE 2026-04 unverdicted novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
cs.LG 2026-04 unverdicted novelty 7.0

Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
cs.CV 2026-03 unverdicted novelty 7.0

Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
cs.AI 2025-11 unverdicted novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
cs.CL 2025-10 conditional novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
IAFormer: Interaction-Aware Transformer network for collider data analysis
hep-ph 2025-05 unverdicted novelty 7.0

IAFormer uses boost-invariant pairwise quantities and differential attention to create a sparse Transformer that achieves state-of-the-art classification on top-quark and quark-gluon jet datasets while using over an o...
Transformer Neural Processes - Kernel Regression
cs.LG 2024-11 unverdicted novelty 7.0

TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regressio...
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
cs.LG 2024-07 accept novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
cs.CL 2024-04 conditional novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
cs.LG 2024-02 unverdicted novelty 7.0

HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, a...
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Scaling Laws for Autoregressive Generative Modeling
cs.LG 2020-10 accept novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
cs.CL 2020-06 unverdicted novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Augmenting Self-attention with Persistent Memory
cs.LG 2019-07 unverdicted novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
Approaching I/O-optimality for Approximate Attention
cs.LG 2026-05 unverdicted novelty 6.0

Presents I/O-efficient algorithms for approximate attention with almost-linear cost in n, approaching lower bounds in most parameter regimes.
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models
cs.CV 2026-05 unverdicted novelty 6.0

Polynomial replacements for activations in MLPs, convolutions, and attention within MetaFormer yield PolyNeXt models that match or exceed standard performance on ImageNet, ADE20K, and robustness benchmarks while beati...
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer
cs.GR 2026-05 unverdicted novelty 6.0

A transformer with prediction-correction and hierarchical super-token encoding unifies simulation across six physical dynamics categories on shared Lagrangian particles.
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
Phasor Memory Networks: Stable Backpropagation Through Time for Scalable Explicit Memory
cs.LG 2026-05 unverdicted novelty 6.0

PMNet uses unitary phasor dynamics and hierarchical anchors to make explicit memory stable for long sequences, matching a 3x larger Mamba model on long-context robustness with a 119M parameter network.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
cs.CL 2026-05 unverdicted novelty 6.0

FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 141 Pith papers · 5 internal anchors

[1]

Character-Level Language Modeling with Deeper Self-Attention

Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Character-level language modeling with deeper self- attention. arXiv preprint arXiv:1808.04444,

work page Pith review arXiv
[2]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Y ., and Luong, M.-T

Britz, D., Guan, M. Y ., and Luong, M.-T. Efﬁcient attention using a ﬁxed-size memory representation. arXiv preprint arXiv:1707.00110,

work page arXiv
[4]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review arXiv
[5]

Pixelsnail: An improved autoregressive generative model

Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,

work page arXiv
[6]

Monotonic chunkwise attention.arXiv preprint arXiv:1712.05382, 2017a

Chiu, C.-C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382,

work page arXiv
[7]

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y . N. Convolutional sequence to sequence learning.arXiv preprint arXiv:1705.03122,

work page Pith review arXiv
[8]

Identity Mappings in Deep Residual Networks

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027,

work page Pith review arXiv
[9]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Music Transformer

Generating Long Sequences with Sparse Transformers Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A. M., Hoffman, M. D., and Eck, D. An improved relative self-attention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281,

work page Pith review arXiv
[11]

Exploring the Limits of Language Modeling

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y . Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv
[12]

A clockwork rnn,

Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. A clockwork rnn. arXiv preprint arXiv:1402.3511,

work page arXiv
[13]

Generating Wikipedia by Summarizing Long Sequences

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- ssi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page Pith review arXiv
[14]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y . Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837,

work page Pith review arXiv
[15]

Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

Menick, J. and Kalchbrenner, N. Generating high ﬁdelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,

work page Pith review arXiv
[16]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review arXiv
[17]

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759,

work page Pith review arXiv
[18]

Image Transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,

work page Pith review arXiv
[19]

Reed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas, N. Paral- lel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664,

work page arXiv
[20]

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517,

work page Pith review arXiv
[21]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499,

work page internal anchor Pith review arXiv
[22]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017

work page 2017

[1] [1]

Character-Level Language Modeling with Deeper Self-Attention

Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Character-level language modeling with deeper self- attention. arXiv preprint arXiv:1808.04444,

work page Pith review arXiv

[2] [2]

Layer Normalization

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Y ., and Luong, M.-T

Britz, D., Guan, M. Y ., and Luong, M.-T. Efﬁcient attention using a ﬁxed-size memory representation. arXiv preprint arXiv:1707.00110,

work page arXiv

[4] [4]

Training Deep Nets with Sublinear Memory Cost

Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review arXiv

[5] [5]

Pixelsnail: An improved autoregressive generative model

Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763,

work page arXiv

[6] [6]

Monotonic chunkwise attention.arXiv preprint arXiv:1712.05382, 2017a

Chiu, C.-C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382,

work page arXiv

[7] [7]

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y . N. Convolutional sequence to sequence learning.arXiv preprint arXiv:1705.03122,

work page Pith review arXiv

[8] [8]

Identity Mappings in Deep Residual Networks

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027,

work page Pith review arXiv

[9] [9]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Music Transformer

Generating Long Sequences with Sparse Transformers Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., Hawthorne, C., Dai, A. M., Hoffman, M. D., and Eck, D. An improved relative self-attention mechanism for transformer with application to music generation. arXiv preprint arXiv:1809.04281,

work page Pith review arXiv

[11] [11]

Exploring the Limits of Language Modeling

Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y . Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv

[12] [12]

A clockwork rnn,

Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. A clockwork rnn. arXiv preprint arXiv:1402.3511,

work page arXiv

[13] [13]

Generating Wikipedia by Summarizing Long Sequences

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- ssi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page Pith review arXiv

[14] [14]

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., and Bengio, Y . Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837,

work page Pith review arXiv

[15] [15]

Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

Menick, J. and Kalchbrenner, N. Generating high ﬁdelity im- ages with subscale pixel networks and multidimensional upscaling. arXiv preprint arXiv:1812.01608,

work page Pith review arXiv

[16] [16]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaev, O., Venkatesh, G., et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review arXiv

[17] [17]

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759,

work page Pith review arXiv

[18] [18]

Image Transformer

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser,Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751,

work page Pith review arXiv

[19] [19]

Reed, S., Oord, A. v. d., Kalchbrenner, N., Colmenarejo, S. G., Wang, Z., Belov, D., and de Freitas, N. Paral- lel multiscale autoregressive density estimation. arXiv preprint arXiv:1703.03664,

work page arXiv

[20] [20]

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized lo- gistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517,

work page Pith review arXiv

[21] [21]

WaveNet: A Generative Model for Raw Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499,

work page internal anchor Pith review arXiv

[22] [22]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017

work page 2017