Linformer: Self-Attention with Linear Complexity

Belinda Z. Li; Han Fang; Hao Ma; Madian Khabsa; Sinong Wang

arxiv: 2006.04768 · v3 · submitted 2020-06-08 · 💻 cs.LG · stat.ML

Linformer: Self-Attention with Linear Complexity

Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma This is my paper

Pith reviewed 2026-05-12 00:29 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords self-attentiontransformerlinear complexitylow-rank approximationefficient NLPsequence lengthmemory efficiency

0 comments

The pith

Self-attention in transformers can be approximated by a low-rank matrix to reduce complexity to linear in sequence length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large transformer models achieve strong results on language tasks but face high costs from the quadratic scaling of standard self-attention. The paper establishes that the attention matrix can be closely approximated by a low-rank form. Projecting the key and value sequences to a much smaller fixed dimension before the dot-product step turns the full computation linear in sequence length. The resulting Linformer model matches the accuracy of the original transformer on typical benchmarks while using substantially less memory and time.

Core claim

The self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n²) to O(n) in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.

What carries the argument

Low-rank projection matrices applied to the key and value vectors before attention, which replace the full n-by-n matrix with a much smaller n-by-k matrix where k is fixed and far smaller than n.

Load-bearing premise

The low-rank projections, whether learned or fixed, retain enough information from the original attention scores for the model to succeed on the tasks and sequence lengths it will see.

What would settle it

If the Linformer shows a clear accuracy gap compared with the standard transformer on a task that uses sequences several times longer than those seen during training, the low-rank approximation would be shown insufficient.

read the original abstract

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linformer shows a clean low-rank projection on keys and values that turns self-attention linear and keeps performance close on the tested NLP tasks, but offers no error bounds or length-extrapolation checks.

read the letter

The main point is that projecting keys and values down to a fixed small dimension k before attention lets them rewrite the computation to run in linear time and space. The algebra is straightforward and correct, and they demonstrate it on GLUE, WikiText, and machine translation with results that stay competitive against the full quadratic version while using far less memory for longer sequences. That practical payoff is what makes the paper worth reading if you care about scaling context length without blowing up compute.

Referee Report

3 major / 2 minor

Summary. The paper claims that self-attention in Transformers can be approximated via low-rank projections on the key and value matrices (using fixed or learned E, F matrices of size k x n with k << n), reducing attention complexity from O(n²) to O(n) in time and space. The resulting Linformer model is shown to achieve competitive performance with standard Transformers on GLUE, WikiText-103, and machine translation benchmarks while being more memory- and time-efficient.

Significance. If the empirical claims hold under broader validation, this is a significant contribution to efficient sequence modeling. It offers a practical architectural change that preserves the core attention mechanism while delivering linear scaling, which is valuable for long-context applications. The algebraic correctness of the low-rank rewriting and the competitive numbers on public NLP benchmarks are strengths; the work provides a clear efficiency gain without requiring entirely new attention formulations.

major comments (3)

[§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
[§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
[§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.

minor comments (2)

[Figure 1] Figure 1 and the surrounding text could include a small diagram explicitly showing the shapes of E and F and how they are applied to K and V.
[§3.2] The complexity analysis in §3.2 would benefit from an explicit step-by-step derivation of the O(n) claim including the cost of the projections themselves.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating the revisions we intend to make.

read point-by-point responses

Referee: [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.

Authors: We appreciate this observation. While the manuscript does not include a formal error bound, we provide empirical analysis demonstrating that attention matrices exhibit low effective rank, justifying the projection (see the singular value plots in the paper). The performance retention is validated across multiple tasks. In revision, we will add further discussion on how the approximation error scales with k and n based on these observations, though a complete theoretical bound remains an open question for future work. revision: partial
Referee: [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.

Authors: We agree that multiple random seeds and ablations would enhance the rigor. Our reported results follow the single-run convention common for such large-scale experiments due to resource constraints. We will rerun key experiments with multiple seeds to report means and standard deviations, and include ablations on the projection dimension k for different n and tasks in the revised manuscript. revision: yes
Referee: [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.

Authors: This point highlights an important aspect of generalization. The current experiments adhere to the standard fixed-length settings of the benchmarks. We will extend the evaluation in the revision to include tests with longer sequences and some domain shifts to verify that the learned projections maintain effectiveness when the attention rank increases with n. revision: yes

standing simulated objections not resolved

Providing a formal error bound or complete theoretical analysis of the approximation error's dependence on n, k, and effective rank

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with independent empirical validation

full rationale

The Linformer derivation proposes an explicit architectural change—projecting the key and value matrices via learned low-rank matrices E and F of size k x n (k << n)—to approximate the O(n^2) attention matrix with O(n) complexity. This is not obtained by fitting parameters to a target quantity and then renaming the fit as a prediction, nor by self-referential definitions or load-bearing self-citations. The low-rank property is motivated by empirical observation of attention matrices but the method itself is a constructive proposal whose performance is measured on held-out public benchmarks (e.g., GLUE, SQuAD) with standard training protocols. No equation reduces to its own input by construction, and the central claim retains independent content beyond any cited prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that attention matrices are approximately low-rank and on the assumption that a fixed or learned projection dimension k suffices for downstream tasks. No new physical entities or unproven mathematical axioms are introduced.

free parameters (1)

projection dimension k
Chosen by the authors (typically 128 or 256) and controls the quality-efficiency trade-off; its value is not derived from first principles.

axioms (1)

domain assumption The attention matrix admits a useful low-rank approximation for the tasks considered.
Invoked in Section 3 to justify the projection; no proof is given that this holds for arbitrary sequences or domains.

pith-pipeline@v0.9.0 · 5433 in / 1290 out tokens · 46410 ms · 2026-05-12T00:29:37.213350+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergent Stochastic Training of Attention and Understanding LoRA
cs.LG 2026-05 unverdicted novelty 8.0

Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
VMamba: Visual State Space Model
cs.CV 2024-01 conditional novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
cs.CL 2023-08 unverdicted novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
cs.LG 2026-05 unverdicted novelty 7.0

Derives a blockwise resolvent-style attention operator that exploits structured sparsity for subquadratic O(n^{4/3}d) entity tracking while matching dense accuracy.
Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging
hep-ex 2026-05 unverdicted novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection
cs.LG 2026-05 conditional novelty 7.0

ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inf...
VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
Projection-Free Transformers via Gaussian Kernel Attention
cs.LG 2026-05 unverdicted novelty 7.0

Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
cs.LG 2026-04 unverdicted novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
cs.LG 2026-04 unverdicted novelty 7.0

HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
cs.LG 2026-04 unverdicted novelty 7.0

Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
Collapse-Free Prototype Readout Layer for Transformer Encoders
cs.LG 2026-04 unverdicted novelty 7.0

DDCL-Attention introduces a collapse-free prototype readout for transformers that decomposes the training loss exactly into reconstruction and diversity terms while providing stability guarantees via singular perturba...
The Volterra signature
stat.ML 2026-03 unverdicted novelty 7.0

The Volterra signature is a kernel-weighted tensor feature map for paths that is injective, universally approximating, and computable via linear ODEs or a two-parameter integral equation.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights
cs.LG 2026-02 unverdicted novelty 7.0

MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
CoFrGeNet: Continued Fraction Architectures for Language Generation
cs.CL 2026-01 unverdicted novelty 7.0

CoFrGeNets implement a continued-fraction function class as plug-in replacements for transformer blocks, delivering competitive or superior downstream performance on GPT2-xl and Llama3-scale models with one-half to tw...
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
cs.LG 2025-10 unverdicted novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts
cs.LG 2025-10 unverdicted novelty 7.0

RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.
Transformer Neural Processes - Kernel Regression
cs.LG 2024-11 unverdicted novelty 7.0

TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regressio...
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
cs.LG 2022-05 accept novelty 7.0

FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
Perceiver IO: A General Architecture for Structured Inputs & Outputs
cs.LG 2021-07 unverdicted novelty 7.0

Perceiver IO is a general architecture that processes arbitrary structured inputs and outputs with linear scaling and achieves strong results on GLUE, Sintel optical flow, multi-task reasoning, and StarCraft II withou...
Rethinking Attention with Performers
cs.LG 2020-09 unverdicted novelty 7.0

Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
cs.LG 2026-05 unverdicted novelty 6.0

ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
Towards Understanding Self-Pretraining for Sequence Classification
cs.LG 2026-05 unverdicted novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.
COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
cs.AI 2026-05 unverdicted novelty 6.0

COAgents introduces a cooperative multi-agent system with a partial search graph to guide intensification and diversification in vehicle routing problems, achieving new state-of-the-art results among learning-based me...
Spectral Progressive Diffusion for Efficient Image and Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Spectral Progressive Diffusion accelerates image and video generation in pretrained diffusion models by progressively growing resolution along the denoising trajectory using spectral noise expansion and a power spectr...
ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices
cs.CV 2026-05 unverdicted novelty 6.0

ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Nectar: Neural Estimation of Cached-Token Attention via Regression
cs.LG 2026-05 unverdicted novelty 6.0

Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
cs.LG 2026-05 unverdicted novelty 6.0

A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
Gated Subspace Inference for Transformer Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
Stochastic Sparse Attention for Memory-Bound Inference
cs.LG 2026-05 accept novelty 6.0

SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...
Linear-Time Global Visual Modeling without Explicit Attention
cs.CV 2026-05 unverdicted novelty 6.0

Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
GateMOT: Q-Gated Attention for Dense Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
cs.LG 2026-04 unverdicted novelty 6.0

ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems
cs.IR 2026-04 unverdicted novelty 6.0

RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
cs.SE 2026-04 unverdicted novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection
cs.CR 2026-04 unverdicted novelty 6.0

ESPRESSO achieves over 0.99 true positive rate at 10^{-3} false positive rate for stepping-stone intrusion detection on synthetic data for SSH, SOCAT, ICMP, DNS and mixed protocols, outperforming DeepCoFFEA while also...
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
cs.CV 2026-04 unverdicted novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
Why Attend to Everything? Focus is the Key
cs.CL 2026-03 conditional novelty 6.0

Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
CoFrGeNet: Continued Fraction Architectures for Language Generation
cs.CL 2026-01 unverdicted novelty 6.0

CoFrGeNet uses continued-fraction function classes to build transformer replacements that match or beat GPT-2 and Llama performance with half to two-thirds the parameters.
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models
cs.AI 2026-01 unverdicted novelty 6.0

AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
cs.LG 2025-12 unverdicted novelty 6.0

BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
SURF: Signature-Retained Fast Video Generation
cs.GR 2025-11 unverdicted novelty 6.0

SURF accelerates high-resolution video generation up to 12.5x by using noise reshifting for low-res previews from pretrained models and a shifting-window Refiner for efficient upscaling that retains original signatures.
Cambrian-S: Towards Spatial Supersensing in Video
cs.CV 2025-11 unverdicted novelty 6.0

Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise o...
Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live
cs.OS 2025-11 unverdicted novelty 6.0

Continuum applies a time-to-live mechanism to KV cache retention during tool calls in multi-turn LLM agents, reporting over 8x faster average job completion times on benchmarks including SWE-Bench with models up to 35...
Higher-order Linear Attention
cs.LG 2025-10 unverdicted novelty 6.0

Higher-order Linear Attention realizes second-order and higher interactions in linear-time causal attention via constant-size state and associative scans.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
SpikingBrain: Spiking Brain-inspired Large Models
cs.LG 2025-09 unverdicted novelty 6.0

SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
Lizard: An Efficient Linearization Framework for Large Language Models
cs.CL 2025-07 unverdicted novelty 6.0

Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 85 Pith papers · 11 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review arXiv
[4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

work page 2019
[6]

arXiv preprint arXiv:2004.07320 , year=

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme ﬁxed-point compression. arXiv preprint arXiv:2004.07320,

work page arXiv 2004
[7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[9]

Pointer Sentinel Mixture Models

9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review arXiv
[11]

& Zettlemoyer, L

Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660,

work page arXiv 1904
[12]

fairseq: A fast, extensible toolkit for sequence modeling

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53,

work page 2019
[13]

arXiv preprint arXiv:1911.02972 , year=

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,

work page arXiv 1911
[14]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review arXiv 1910
[15]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,

work page 2016
[16]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[17]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013
[19]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

URL http://arxiv.org/abs/1804.07461. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27,

work page internal anchor Pith review arXiv
[20]

(JL, for short), the following version is from (Arriaga & Vempala, 2006). Lemma

work page 2006

[1] [1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,

work page internal anchor Pith review arXiv

[4] [4]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[5] [5]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

work page 2019

[6] [6]

arXiv preprint arXiv:2004.07320 , year=

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme ﬁxed-point compression. arXiv preprint arXiv:2004.07320,

work page arXiv 2004

[7] [7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[9] [9]

Pointer Sentinel Mixture Models

9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740,

work page internal anchor Pith review arXiv

[11] [11]

& Zettlemoyer, L

Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660,

work page arXiv 1904

[12] [12]

fairseq: A fast, extensible toolkit for sequence modeling

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53,

work page 2019

[13] [13]

arXiv preprint arXiv:1911.02972 , year=

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,

work page arXiv 1911

[14] [14]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprint arXiv:1910.10683,

work page internal anchor Pith review arXiv 1910

[15] [15]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,

work page 2016

[16] [16]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[17] [17]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,

work page 2013

[18] [19]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

URL http://arxiv.org/abs/1804.07461. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27,

work page internal anchor Pith review arXiv

[19] [20]

(JL, for short), the following version is from (Arriaga & Vempala, 2006). Lemma

work page 2006