For uniform keys on the d-dimensional sphere, softmax attention becomes selective at inverse temperature scaling β_n* ≍ n^{2/(d-1)}, with explicit limiting laws for attention weights and outputs in each regime.
super hub Canonical reference
Generating Long Sequences with Sparse Transformers
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a
authors
co-cited works
representative citing papers
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.
Denoising diffusion probabilistic models generate high-quality images by learning to reverse a fixed forward diffusion process, achieving FID 3.17 on CIFAR10.
Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
Meta-Attention introduces per-token Bayesian routing among attention mechanisms via amortised variational inference with a Dirichlet prior, yielding lower projected FLOP cost than prior-free routing on a Tiny LM benchmark.
Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
A transformer with prediction-correction and hierarchical super-token merging unifies simulation of six physical dynamics categories on Lagrangian particles and generalizes to unseen conditions.
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
SpecEdit accelerates diffusion-based image editing up to 10x by using a low-resolution draft to identify edit-relevant tokens via semantic discrepancies for selective high-resolution denoising.
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
LoSA caches prefix attention for stable tokens in block-wise DLMs and applies sparse attention only to active tokens, preserving near-dense accuracy while achieving 1.54x lower attention density and up to 4.14x speedup.
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
citing papers explorer
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Evaluating Large Language Models Trained on Code
Codex achieves 28.8% pass@1 on HumanEval, rising to 70.2% with 100 samples per problem via repeated sampling.
-
VideoGPT: Video Generation using VQ-VAE and Transformers
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
-
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
TransUNet is a hybrid CNN-Transformer architecture that outperforms prior U-Net and Transformer baselines on multi-organ and cardiac medical image segmentation tasks.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
Jukebox: A Generative Model for Music
Jukebox generates high-fidelity and diverse songs with singing and coherence up to multiple minutes by compressing raw audio via multi-scale VQ-VAE and modeling the codes with large autoregressive Transformers conditioned on artist, genre, and unaligned lyrics.
-
Compressive Transformers for Long-Range Sequence Modelling
Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.
-
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
-
Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning
Feature alignment quality determines whether concatenation or cross-attention excels for multimodal fusion, with concatenation winning on pre-aligned features due to lower sample complexity O(dv+dt) versus O(dv*dt).
-
Dynamic Video Generation: Shaping Video Generation Across Time and Space
DVG dynamically selects content-aware spatio-temporal acceleration strategies for diffusion-based video generation, delivering up to 7x speedup with near-lossless quality on models like HunyuanVideo.
-
CALMem : Application-Layer Dual Memory for Conversational AI
CALMem delivers virtually unbounded effective context for LLM conversations via an application-layer dual memory architecture with intra-session retrieval and token-adaptive injection.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Kwai Summary Attention Technical Report
Kwai Summary Attention compresses historical contexts into learnable summary tokens to reduce sequence modeling cost to O(n/k) while preserving linear KV cache and long-range dependencies.
-
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
Tool Attention cuts tool-related tokens by 95% and raises context utilization from 24% to 91% in a 120-tool simulation via dynamic gating and lazy loading.
-
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
-
Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.
-
Sessa: Selective State Space Attention
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
-
Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models
Fine-tuned recurrent models like Mamba2 produce competitive text embeddings with linear-time constant-memory inference via vertical chunking, outperforming transformers in memory use.
-
Heterogeneous Connectivity in Sparse Networks: Fan-in Profiles, Gradient Hierarchy, and Topological Equilibria
Arbitrary heterogeneous fan-in profiles in sparse networks match uniform random accuracy at high sparsity, but initializing RigL dynamic sparse training with equilibrium-matched lognormal profiles improves performance by up to 0.49% on classification tasks.
-
Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
-
Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
Block-sparse global attention accelerates multi-view reconstruction transformers by over 3x by exploiting concentrated attention on cross-view correspondences.
-
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
-
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.
-
World Model on Million-Length Video And Language With Blockwise RingAttention
Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
-
Mistral 7B
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
-
Agglomerative Attention
Presents agglomerative attention, a linear-complexity attention model that achieves comparable performance to full attention on language modeling tasks.
-
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.
-
BFLA: Block-Filtered Long-Context Attention Mechanism
BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside retained tiles, delivering speedups on models like Llama 3.1 with minimal accuracy 0
-
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
-
Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models
Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
-
Transformer-Enhanced Reinforcement Learning: Fundamentals and Applications in Communication Networks
A survey of Transformer-enhanced reinforcement learning fundamentals and applications in communication networks covering resource allocation, computation offloading, routing, trajectory control, and security.
- EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
- AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse
- Stochastic Sparse Attention for Memory-Bound Inference
- Characterizing the Expressivity of Local Attention in Transformers
- A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
- Adaptive Head Budgeting for Efficient Multi-Head Attention
- Simplified Sparse Attention via Gist Tokens
- Accelerating Sparse Transformer Inference on GPU