arxiv: 2604.03260 · v2 · submitted 2026-03-12 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Why Attend to Everything? Focus is the Key

Hengshuai Yao , Xing Chen , Ahmed Murtadha , Jin Li , Yasin Abbasi Yadkori , Shuai Shao , Changling Liu , Guan Wang

show 3 more authors

Mingli Yuan William Chen Sen Song

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Focus attentionefficient transformerssparse attentionlong context modelingpretrained model compositioncentroid gatingattention sparsity

0 comments

The pith

Focus learns centroids to selectively gate long-range attention, allowing addition to any pretrained model with no performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that full attention is often unnecessary because a small number of learned centroids can determine which token pairs should attend to each other. By training only these centroids while freezing the rest of a pretrained transformer, Focus can be composed onto existing models of various sizes and architectures. This approach maintains or even improves perplexity and benchmark performance, while also providing substantial speedups for long sequences. The insight is that attention can be made sparse in a structured way without losing the critical dependencies that models rely on. If this holds, it means efficient long-context modeling becomes much more accessible without massive retraining costs.

Core claim

Focus adds a small set of learnable centroids per layer that partition tokens into groups, restricting long-range attention to pairs within the same group. When composed onto pretrained models by training only the centroids, this yields zero degradation on downstream tasks from 124M to 70B parameters across five attention architectures, and can outperform full attention in some cases.

What carries the argument

Learnable centroids that serve as gates: tokens attend long-range only if assigned to the same centroid group.

If this is right

Pretrained models can gain efficient long-range attention by training a tiny number of additional parameters.
Sparse Focus attention can match or exceed dense attention quality at certain scales.
Significant inference speedups are possible, up to 8.6x at million-token lengths using FlashAttention decomposition.
Focus works across model sizes and attention variants without custom kernel development for basic use.
The method supports training from scratch at larger scales with performance parity to full attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach implies that language model attention patterns can be captured by low-dimensional group assignments rather than full pairwise computations.
Integrating Focus earlier in training might yield even stronger models optimized for sparsity from the start.
Similar centroid-based gating could extend to other components like feed-forward layers for further efficiency gains.
Models using Focus might scale to context lengths far beyond current practical limits with linear rather than quadratic costs.

Load-bearing premise

A small number of learned centroids can accurately identify all the important long-range token interactions without overlooking any that full attention would use.

What would settle it

Observing a drop in perplexity or accuracy on standard benchmarks after composing Focus onto a pretrained model would falsify the no-degradation claim.

Figures

Figures reproduced from arXiv: 2604.03260 by Ahmed Murtadha, Changling Liu, Guan Wang, Hengshuai Yao, Jin Li, Mingli Yuan, Sen Song, Shuai Shao, William Chen, Xing Chen, Yasin Abbasi Yadkori.

read the original abstract

Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into pretrained models, they often degrade perplexity, downstream accuracy, or both. We introduce Focus, a method that learns which token pairs matter. Focus adds a small set of learnable centroids--as few as 148K parameters per layer--that act as gates: only token pairs belonging to the same centroid group attend to each other over long ranges. Focus is composable: it can be added to any pretrained model by training only the centroids while keeping all original weights frozen. Experiments show that composing Focus onto pretrained models yields zero degradation on downstream benchmarks across model sizes from 124M to 70B parameters and five attention architectures. Surprisingly, sparse Focus attention outperforms full attention at 124M scale (30.3 vs. 31.4 perplexity) and matches full attention when trained from scratch at 7B scale (13.82 vs. 13.89). Focus is also fast: top-k group membership gives a 2x speedup with better quality than the original pretrained model. Using our FlashAttention decomposition, Focus achieves an 8.6x speedup at 1M tokens without custom kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Focus adds learnable centroids as sparse gates that can be trained on frozen models with reported zero degradation up to 70B, but the abstract leaves the dependency-preservation claim under-supported.

read the letter

The main takeaway is that Focus learns a small set of centroids per layer to group tokens and restrict long-range attention to within-group pairs, then trains only those centroids while freezing the base model. They report this retrofit produces no drop on downstream benchmarks from 124M to 70B across five attention architectures, and in one small-scale case the sparse version even improves perplexity over full attention. The speed claims are concrete: top-k grouping gives 2x, and their FlashAttention version reaches 8.6x at 1M tokens without custom kernels. That composability angle is the clearest new piece; most prior sparse methods require more retraining or architectural changes. The empirical breadth across scales is also useful to see in one place. The soft spot is exactly the one the stress-test note flags. Zero degradation requires that the learned groups keep every long-range dependency the original model actually uses. The abstract gives no ablations showing that the centroids recover the original attention mass on distant tokens, no error bars, and no protocol details on how the centroids are optimized or initialized. Without those, it is hard to know whether the result is robust or tied to the particular datasets and tasks tested. The 148K parameters per layer is small enough that overfitting to the training distribution remains possible. This is for researchers working on practical long-context inference who want drop-in efficiency tricks rather than full retraining. A reader already following sparse-attention variants would get value from the scale of the experiments and the composability claim, but would need the full paper, code, and more controls before treating the zero-degradation result as settled. I would send it to peer review; the idea is straightforward enough and the reported results broad enough that referees could usefully pressure-test the missing pieces.

Referee Report

3 major / 2 minor

Summary. The paper introduces Focus, a sparse attention method that adds a small set of learnable centroids (148K parameters per layer) acting as gates so that only token pairs in the same centroid group attend over long ranges. Focus is presented as composable with pretrained models by training solely the centroids while freezing all original weights. Experiments claim zero degradation on downstream benchmarks across model sizes 124M–70B and five attention architectures, with outperformance at 124M scale (30.3 vs. 31.4 perplexity) and matching performance when trained from scratch at 7B, plus speedups of 2x (top-k) and 8.6x at 1M tokens via FlashAttention decomposition.

Significance. If the zero-degradation and composability results hold, Focus would be a notable contribution to efficient attention by enabling sparsity with minimal parameter overhead and no full-model retraining. The breadth of evaluation across scales and architectures, plus the reported outperformance at small scale and long-context speedups, would strengthen its practical impact. The approach of learning a partition via centroids is conceptually simple and could generalize if the grouping reliably preserves critical dependencies.

major comments (3)

[Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.
[Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.
[Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.

minor comments (2)

[Abstract] Abstract: 'top-k group membership' is invoked for the 2x speedup but the value of k and its selection criterion are not defined.
The manuscript would benefit from pseudocode or a clear equation for the centroid-based attention mask to clarify how group membership is computed and applied during inference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to improve the empirical rigor and analysis as outlined.

read point-by-point responses

Referee: [Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.

Authors: We agree that reporting error bars, multiple seeds, and statistical tests would strengthen verifiability of the zero-degradation results. In the revised manuscript we will rerun the primary experiments (including the 124M and 7B cases) with at least three random seeds, include standard deviations, and add paired statistical tests to confirm that differences are significant. This directly addresses the concern. revision: yes
Referee: [Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.

Authors: We acknowledge that an explicit analysis of attention mass across boundaries would provide useful supporting evidence. While the lack of degradation in our broad empirical evaluation suggests the partitions preserve necessary dependencies, we will add a new subsection in the revised paper that quantifies the fraction of original-model attention mass crossing centroid boundaries on held-out data for representative models. revision: yes
Referee: [Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.

Authors: We agree these ablations are important for supporting the parameter-efficiency claim. In the revised version we will add experiments varying centroid count (e.g., 4–32), comparing initialization strategies, using different centroid-training data subsets, and evaluating on additional downstream tasks beyond the current benchmarks. These results will be reported for multiple model scales. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical composability claim is externally validated

full rationale

The paper defines Focus via a new set of learnable centroids (148K parameters per layer) that induce a sparse attention mask over token groups. The load-bearing claim is the empirical observation that freezing the original pretrained weights and training only these centroids produces zero degradation on downstream benchmarks from 124M to 70B scale. No equation or derivation reduces the reported perplexity or accuracy numbers to a quantity defined in terms of the same fitted centroids; the result is measured against held-out external benchmarks and multiple attention architectures. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the central result. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of a small number of learned centroids to select relevant token pairs; this introduces new free parameters whose values are determined by training rather than derived from prior theory.

free parameters (1)

centroids = 148K per layer
148K learnable parameters per layer that define token grouping for attention gating; their values are fit during the Focus training stage.

axioms (1)

domain assumption Base model uses standard quadratic attention that can be sparsified by group membership without loss of expressivity
Invoked when claiming that restricting attention to same-centroid pairs preserves full model capability.

invented entities (1)

centroids no independent evidence
purpose: Learnable gates that partition tokens into groups for selective long-range attention
New structures introduced by the paper; no independent evidence outside the training process is provided.

pith-pipeline@v0.9.0 · 5552 in / 1267 out tokens · 60540 ms · 2026-05-15T11:51:24.869712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

LoRA learns less and forgets less

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Havens, Robert Jennings, Daniel King, Sam Havens, Nick Blankenship, et al. LoRA learns less and forgets less. Transactions on Machine Learning Research, 2024

work page 2024
[3]

Class-based n-gram models of natural language

Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18 0 (4): 0 467--480, 1992

work page 1992
[4]

Unsupervised learning of visual features by contrasting cluster assignments

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, 2020

work page 2020
[5]

Rethinking attention with performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations, 2021

work page 2021
[6]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

work page 2024
[7]

Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024

work page 2024
[8]

Flash A ttention: Fast and memory-efficient exact attention with IO -awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

work page 2022
[9]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

DeepSeek-V3 Technical Report

DeepSeek-AI . DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

work page 2022
[12]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team . Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

OLMo : Accelerating the science of language models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Joshi, Valentina Pyatkin, et al. OLMo : Accelerating the science of language models. In Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[14]

LoRA : Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[15]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7 B . arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention. In Advances in Neural Information Processing Systems, 2024 b

work page 2024
[18]

Transformers are RNN s: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are RNN s: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 2020

work page 2020
[19]

Scaling laws for fine-grained mixture of experts

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piotrowski, Piotr Sankowski, Micha Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Marek Jaszczur, et al. Scaling laws for fine-grained mixture of experts. In International Conference on Machine Learning, 2024

work page 2024
[20]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Horace Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Amnon Shashua, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Ring attention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, 2024 a

work page 2024
[22]

DoRA : Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In International Conference on Machine Learning, 2024 b

work page 2024
[23]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

MoBA : Mixture of block attention for long-context LLMs

Shuming Lu et al. MoBA : Mixture of block attention for long-context LLMs . arXiv preprint arXiv:2502.13189, 2025

work page arXiv 2025
[25]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109--165. Elsevier, 1989

work page 1989
[26]

From sparse to soft mixtures of experts

Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In International Conference on Learning Representations, 2024

work page 2024
[27]

Efficient content-based sparse attention with routing transformers

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9: 0 53--68, 2021

work page 2021
[28]

Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems, 2024

work page 2024
[29]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

work page 2017
[30]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[31]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Differential transformer

Tianzhu Ye, Li Li, Gao Huang, et al. Differential transformer. In International Conference on Learning Representations, 2025

work page 2025
[33]

Native sparse attention: Hardware-aligned and natively trainable sparse attention

Jingyang Yuan, Huazuo Liu, Zhaozhuo Zhang, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[34]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, 2020

work page 2020
[35]

The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry

Michael Zhang, Kush Bhatia, Jonathan Ragan-Kelley, and Christopher R \'e . The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In International Conference on Learning Representations, 2024

work page 2024
[36]

H _2 O : Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Zhangyang Wang, Beidi Chen, and others. H _2 O : Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[37]

Loki: Low-rank keys for efficient sparse attention

Prajwal Singhania, Siddharth Nrusimha, Chih-Ping Park, and Joo-Young Kim. Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542, 2024

work page arXiv 2024
[38]

Spar Q Attention: Bandwidth-efficient LLM inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Sheridan, Thang Bui, and Walterio Mayol-Cuevas. Spar Q Attention: Bandwidth-efficient LLM inference. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[39]

MagicPIG : LSH sampling for efficient LLM generation

Zhuoming Chen, Ranajoy Sadhukhan, Ying Ye, Yang Chen, Baris Kasikci, and Hao Zheng. MagicPIG : LSH sampling for efficient LLM generation. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024