pith. sign in

super hub Canonical reference

Generating Long Sequences with Sparse Transformers

Canonical reference. 82% of citing Pith papers cite this work as background.

145 Pith papers citing it
Background 82% of classified citations
abstract

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

hub tools

citation-role summary

background 26 method 6 baseline 1

citation-polarity summary

claims ledger

  • abstract Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a

authors

co-cited works

clear filters

representative citing papers

Scaling Limits of Long-Context Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.

Rotation Equivariant Mamba for Vision Tasks

cs.CV · 2026-03-10 · unverdicted · novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Efficiently Modeling Long Sequences with Structured State Spaces

cs.LG · 2021-10-31 · unverdicted · novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.

Denoising Diffusion Probabilistic Models

cs.LG · 2020-06-19 · accept · novelty 8.0

Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.

Scaling Laws for Neural Language Models

cs.LG · 2020-01-23 · unverdicted · novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.

VORT: Adaptive Power-Law Memory for NLP Transformers

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

citing papers explorer

Showing 10 of 10 citing papers after filters.

  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 14 · internal anchor

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  • LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding cs.CL · 2023-08-28 · unverdicted · none · ref 77 · internal anchor

    LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

  • Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 13 · internal anchor

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  • MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 4 · internal anchor

    MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.

  • Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs cs.CL · 2023-10-03 · conditional · none · ref 71 · internal anchor

    FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.

  • Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 154 · internal anchor

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  • DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models cs.LG · 2023-09-25 · accept · none · ref 27 · internal anchor

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  • H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 9 · internal anchor

    H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

  • Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 6 · internal anchor

    Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.

  • A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 67 · internal anchor

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.