Recognition: no theorem link
Why Attend to Everything? Focus is the Key
Pith reviewed 2026-05-15 11:51 UTC · model grok-4.3
The pith
Focus learns centroids to selectively gate long-range attention, allowing addition to any pretrained model with no performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Focus adds a small set of learnable centroids per layer that partition tokens into groups, restricting long-range attention to pairs within the same group. When composed onto pretrained models by training only the centroids, this yields zero degradation on downstream tasks from 124M to 70B parameters across five attention architectures, and can outperform full attention in some cases.
What carries the argument
Learnable centroids that serve as gates: tokens attend long-range only if assigned to the same centroid group.
If this is right
- Pretrained models can gain efficient long-range attention by training a tiny number of additional parameters.
- Sparse Focus attention can match or exceed dense attention quality at certain scales.
- Significant inference speedups are possible, up to 8.6x at million-token lengths using FlashAttention decomposition.
- Focus works across model sizes and attention variants without custom kernel development for basic use.
- The method supports training from scratch at larger scales with performance parity to full attention.
Where Pith is reading between the lines
- This approach implies that language model attention patterns can be captured by low-dimensional group assignments rather than full pairwise computations.
- Integrating Focus earlier in training might yield even stronger models optimized for sparsity from the start.
- Similar centroid-based gating could extend to other components like feed-forward layers for further efficiency gains.
- Models using Focus might scale to context lengths far beyond current practical limits with linear rather than quadratic costs.
Load-bearing premise
A small number of learned centroids can accurately identify all the important long-range token interactions without overlooking any that full attention would use.
What would settle it
Observing a drop in perplexity or accuracy on standard benchmarks after composing Focus onto a pretrained model would falsify the no-degradation claim.
Figures
read the original abstract
Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into pretrained models, they often degrade perplexity, downstream accuracy, or both. We introduce Focus, a method that learns which token pairs matter. Focus adds a small set of learnable centroids--as few as 148K parameters per layer--that act as gates: only token pairs belonging to the same centroid group attend to each other over long ranges. Focus is composable: it can be added to any pretrained model by training only the centroids while keeping all original weights frozen. Experiments show that composing Focus onto pretrained models yields zero degradation on downstream benchmarks across model sizes from 124M to 70B parameters and five attention architectures. Surprisingly, sparse Focus attention outperforms full attention at 124M scale (30.3 vs. 31.4 perplexity) and matches full attention when trained from scratch at 7B scale (13.82 vs. 13.89). Focus is also fast: top-k group membership gives a 2x speedup with better quality than the original pretrained model. Using our FlashAttention decomposition, Focus achieves an 8.6x speedup at 1M tokens without custom kernels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Focus, a sparse attention method that adds a small set of learnable centroids (148K parameters per layer) acting as gates so that only token pairs in the same centroid group attend over long ranges. Focus is presented as composable with pretrained models by training solely the centroids while freezing all original weights. Experiments claim zero degradation on downstream benchmarks across model sizes 124M–70B and five attention architectures, with outperformance at 124M scale (30.3 vs. 31.4 perplexity) and matching performance when trained from scratch at 7B, plus speedups of 2x (top-k) and 8.6x at 1M tokens via FlashAttention decomposition.
Significance. If the zero-degradation and composability results hold, Focus would be a notable contribution to efficient attention by enabling sparsity with minimal parameter overhead and no full-model retraining. The breadth of evaluation across scales and architectures, plus the reported outperformance at small scale and long-context speedups, would strengthen its practical impact. The approach of learning a partition via centroids is conceptually simple and could generalize if the grouping reliably preserves critical dependencies.
major comments (3)
- [Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.
- [Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.
- [Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.
minor comments (2)
- [Abstract] Abstract: 'top-k group membership' is invoked for the 2x speedup but the value of k and its selection criterion are not defined.
- The manuscript would benefit from pseudocode or a clear equation for the centroid-based attention mask to clarify how group membership is computed and applied during inference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to improve the empirical rigor and analysis as outlined.
read point-by-point responses
-
Referee: [Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.
Authors: We agree that reporting error bars, multiple seeds, and statistical tests would strengthen verifiability of the zero-degradation results. In the revised manuscript we will rerun the primary experiments (including the 124M and 7B cases) with at least three random seeds, include standard deviations, and add paired statistical tests to confirm that differences are significant. This directly addresses the concern. revision: yes
-
Referee: [Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.
Authors: We acknowledge that an explicit analysis of attention mass across boundaries would provide useful supporting evidence. While the lack of degradation in our broad empirical evaluation suggests the partitions preserve necessary dependencies, we will add a new subsection in the revised paper that quantifies the fraction of original-model attention mass crossing centroid boundaries on held-out data for representative models. revision: yes
-
Referee: [Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.
Authors: We agree these ablations are important for supporting the parameter-efficiency claim. In the revised version we will add experiments varying centroid count (e.g., 4–32), comparing initialization strategies, using different centroid-training data subsets, and evaluating on additional downstream tasks beyond the current benchmarks. These results will be reported for multiple model scales. revision: yes
Circularity Check
No circularity: empirical composability claim is externally validated
full rationale
The paper defines Focus via a new set of learnable centroids (148K parameters per layer) that induce a sparse attention mask over token groups. The load-bearing claim is the empirical observation that freezing the original pretrained weights and training only these centroids produces zero degradation on downstream benchmarks from 124M to 70B scale. No equation or derivation reduces the reported perplexity or accuracy numbers to a quantity defined in terms of the same fitted centroids; the result is measured against held-out external benchmarks and multiple attention architectures. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the central result. The derivation is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- centroids =
148K per layer
axioms (1)
- domain assumption Base model uses standard quadratic attention that can be sparsified by group membership without loss of expressivity
invented entities (1)
-
centroids
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
LoRA learns less and forgets less
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Havens, Robert Jennings, Daniel King, Sam Havens, Nick Blankenship, et al. LoRA learns less and forgets less. Transactions on Machine Learning Research, 2024
work page 2024
-
[3]
Class-based n-gram models of natural language
Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18 0 (4): 0 467--480, 1992
work page 1992
-
[4]
Unsupervised learning of visual features by contrasting cluster assignments
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[5]
Rethinking attention with performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations, 2021
work page 2021
-
[6]
Flash A ttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024
work page 2024
-
[7]
Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024
work page 2024
-
[8]
Flash A ttention: Fast and memory-efficient exact attention with IO -awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[9]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
DeepSeek-AI . DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022
work page 2022
-
[12]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team . Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
OLMo : Accelerating the science of language models
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Joshi, Valentina Pyatkin, et al. OLMo : Accelerating the science of language models. In Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[14]
LoRA : Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[15]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7 B . arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention. In Advances in Neural Information Processing Systems, 2024 b
work page 2024
-
[18]
Transformers are RNN s: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are RNN s: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 2020
work page 2020
-
[19]
Scaling laws for fine-grained mixture of experts
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piotrowski, Piotr Sankowski, Micha Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Marek Jaszczur, et al. Scaling laws for fine-grained mixture of experts. In International Conference on Machine Learning, 2024
work page 2024
-
[20]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Horace Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Amnon Shashua, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Ring attention with blockwise transformers for near-infinite context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, 2024 a
work page 2024
-
[22]
DoRA : Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In International Conference on Machine Learning, 2024 b
work page 2024
-
[23]
Llama Team . The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
MoBA : Mixture of block attention for long-context LLMs
Shuming Lu et al. MoBA : Mixture of block attention for long-context LLMs . arXiv preprint arXiv:2502.13189, 2025
-
[25]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109--165. Elsevier, 1989
work page 1989
-
[26]
From sparse to soft mixtures of experts
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In International Conference on Learning Representations, 2024
work page 2024
-
[27]
Efficient content-based sparse attention with routing transformers
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9: 0 53--68, 2021
work page 2021
-
[28]
Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[30]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[31]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Tianzhu Ye, Li Li, Gao Huang, et al. Differential transformer. In International Conference on Learning Representations, 2025
work page 2025
-
[33]
Native sparse attention: Hardware-aligned and natively trainable sparse attention
Jingyang Yuan, Huazuo Liu, Zhaozhuo Zhang, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Annual Meeting of the Association for Computational Linguistics, 2025
work page 2025
-
[34]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[35]
The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry
Michael Zhang, Kush Bhatia, Jonathan Ragan-Kelley, and Christopher R \'e . The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In International Conference on Learning Representations, 2024
work page 2024
-
[36]
H _2 O : Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Zhangyang Wang, Beidi Chen, and others. H _2 O : Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[37]
Loki: Low-rank keys for efficient sparse attention
Prajwal Singhania, Siddharth Nrusimha, Chih-Ping Park, and Joo-Young Kim. Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542, 2024
-
[38]
Spar Q Attention: Bandwidth-efficient LLM inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Sheridan, Thang Bui, and Walterio Mayol-Cuevas. Spar Q Attention: Bandwidth-efficient LLM inference. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[39]
MagicPIG : LSH sampling for efficient LLM generation
Zhuoming Chen, Ranajoy Sadhukhan, Ying Ye, Yang Chen, Baris Kasikci, and Hao Zheng. MagicPIG : LSH sampling for efficient LLM generation. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.