pith. machine review for the scientific record. sign in

arxiv: 2604.03260 · v2 · submitted 2026-03-12 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Why Attend to Everything? Focus is the Key

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Focus attentionefficient transformerssparse attentionlong context modelingpretrained model compositioncentroid gatingattention sparsity
0
0 comments X

The pith

Focus learns centroids to selectively gate long-range attention, allowing addition to any pretrained model with no performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that full attention is often unnecessary because a small number of learned centroids can determine which token pairs should attend to each other. By training only these centroids while freezing the rest of a pretrained transformer, Focus can be composed onto existing models of various sizes and architectures. This approach maintains or even improves perplexity and benchmark performance, while also providing substantial speedups for long sequences. The insight is that attention can be made sparse in a structured way without losing the critical dependencies that models rely on. If this holds, it means efficient long-context modeling becomes much more accessible without massive retraining costs.

Core claim

Focus adds a small set of learnable centroids per layer that partition tokens into groups, restricting long-range attention to pairs within the same group. When composed onto pretrained models by training only the centroids, this yields zero degradation on downstream tasks from 124M to 70B parameters across five attention architectures, and can outperform full attention in some cases.

What carries the argument

Learnable centroids that serve as gates: tokens attend long-range only if assigned to the same centroid group.

If this is right

  • Pretrained models can gain efficient long-range attention by training a tiny number of additional parameters.
  • Sparse Focus attention can match or exceed dense attention quality at certain scales.
  • Significant inference speedups are possible, up to 8.6x at million-token lengths using FlashAttention decomposition.
  • Focus works across model sizes and attention variants without custom kernel development for basic use.
  • The method supports training from scratch at larger scales with performance parity to full attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach implies that language model attention patterns can be captured by low-dimensional group assignments rather than full pairwise computations.
  • Integrating Focus earlier in training might yield even stronger models optimized for sparsity from the start.
  • Similar centroid-based gating could extend to other components like feed-forward layers for further efficiency gains.
  • Models using Focus might scale to context lengths far beyond current practical limits with linear rather than quadratic costs.

Load-bearing premise

A small number of learned centroids can accurately identify all the important long-range token interactions without overlooking any that full attention would use.

What would settle it

Observing a drop in perplexity or accuracy on standard benchmarks after composing Focus onto a pretrained model would falsify the no-degradation claim.

Figures

Figures reproduced from arXiv: 2604.03260 by Ahmed Murtadha, Changling Liu, Guan Wang, Hengshuai Yao, Jin Li, Mingli Yuan, Sen Song, Shuai Shao, William Chen, Xing Chen, Yasin Abbasi Yadkori.

Figure 1
Figure 1. Figure 1: Quality–speed Pareto frontier of efficient attention retrofits on GPT-2 124M / PG-19 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Standard attention scales quadratically with sequence length. Efficient attention methods reduce this O(n^2) cost, but when retrofitted into pretrained models, they often degrade perplexity, downstream accuracy, or both. We introduce Focus, a method that learns which token pairs matter. Focus adds a small set of learnable centroids--as few as 148K parameters per layer--that act as gates: only token pairs belonging to the same centroid group attend to each other over long ranges. Focus is composable: it can be added to any pretrained model by training only the centroids while keeping all original weights frozen. Experiments show that composing Focus onto pretrained models yields zero degradation on downstream benchmarks across model sizes from 124M to 70B parameters and five attention architectures. Surprisingly, sparse Focus attention outperforms full attention at 124M scale (30.3 vs. 31.4 perplexity) and matches full attention when trained from scratch at 7B scale (13.82 vs. 13.89). Focus is also fast: top-k group membership gives a 2x speedup with better quality than the original pretrained model. Using our FlashAttention decomposition, Focus achieves an 8.6x speedup at 1M tokens without custom kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Focus, a sparse attention method that adds a small set of learnable centroids (148K parameters per layer) acting as gates so that only token pairs in the same centroid group attend over long ranges. Focus is presented as composable with pretrained models by training solely the centroids while freezing all original weights. Experiments claim zero degradation on downstream benchmarks across model sizes 124M–70B and five attention architectures, with outperformance at 124M scale (30.3 vs. 31.4 perplexity) and matching performance when trained from scratch at 7B, plus speedups of 2x (top-k) and 8.6x at 1M tokens via FlashAttention decomposition.

Significance. If the zero-degradation and composability results hold, Focus would be a notable contribution to efficient attention by enabling sparsity with minimal parameter overhead and no full-model retraining. The breadth of evaluation across scales and architectures, plus the reported outperformance at small scale and long-context speedups, would strengthen its practical impact. The approach of learning a partition via centroids is conceptually simple and could generalize if the grouping reliably preserves critical dependencies.

major comments (3)
  1. [Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.
  2. [Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.
  3. [Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.
minor comments (2)
  1. [Abstract] Abstract: 'top-k group membership' is invoked for the 2x speedup but the value of k and its selection criterion are not defined.
  2. The manuscript would benefit from pseudocode or a clear equation for the centroid-based attention mask to clarify how group membership is computed and applied during inference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to improve the empirical rigor and analysis as outlined.

read point-by-point responses
  1. Referee: [Experiments] Experiments: The zero-degradation claim (e.g., 30.3 vs. 31.4 perplexity at 124M and matching results up to 70B) is presented without error bars, multiple random seeds, or statistical significance tests. This directly affects verifiability of the central composability result, as small variance could mask degradation.

    Authors: We agree that reporting error bars, multiple seeds, and statistical tests would strengthen verifiability of the zero-degradation results. In the revised manuscript we will rerun the primary experiments (including the 124M and 7B cases) with at least three random seeds, include standard deviations, and add paired statistical tests to confirm that differences are significant. This directly addresses the concern. revision: yes

  2. Referee: [Method] Method: The gating mechanism partitions tokens via learned centroids so that cross-group long-range pairs are blocked. No analysis is provided (e.g., measuring original-model attention mass across centroid boundaries on held-out data) to confirm that all high-attention long-range dependencies fall inside groups; if any critical cross-centroid pairs exist, the sparse mask must cause degradation by construction.

    Authors: We acknowledge that an explicit analysis of attention mass across boundaries would provide useful supporting evidence. While the lack of degradation in our broad empirical evaluation suggests the partitions preserve necessary dependencies, we will add a new subsection in the revised paper that quantifies the fraction of original-model attention mass crossing centroid boundaries on held-out data for representative models. revision: yes

  3. Referee: [Experiments] Experiments: Ablations on centroid count, initialization, training data for the centroids, and sensitivity to downstream tasks are absent. These are load-bearing for the claim that 148K parameters per layer suffice to recover the necessary partition across 124M–70B scales without degradation.

    Authors: We agree these ablations are important for supporting the parameter-efficiency claim. In the revised version we will add experiments varying centroid count (e.g., 4–32), comparing initialization strategies, using different centroid-training data subsets, and evaluating on additional downstream tasks beyond the current benchmarks. These results will be reported for multiple model scales. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical composability claim is externally validated

full rationale

The paper defines Focus via a new set of learnable centroids (148K parameters per layer) that induce a sparse attention mask over token groups. The load-bearing claim is the empirical observation that freezing the original pretrained weights and training only these centroids produces zero degradation on downstream benchmarks from 124M to 70B scale. No equation or derivation reduces the reported perplexity or accuracy numbers to a quantity defined in terms of the same fitted centroids; the result is measured against held-out external benchmarks and multiple attention architectures. No self-citation chain, uniqueness theorem, or ansatz smuggling is invoked to justify the central result. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of a small number of learned centroids to select relevant token pairs; this introduces new free parameters whose values are determined by training rather than derived from prior theory.

free parameters (1)
  • centroids = 148K per layer
    148K learnable parameters per layer that define token grouping for attention gating; their values are fit during the Focus training stage.
axioms (1)
  • domain assumption Base model uses standard quadratic attention that can be sparsified by group membership without loss of expressivity
    Invoked when claiming that restricting attention to same-centroid pairs preserves full model capability.
invented entities (1)
  • centroids no independent evidence
    purpose: Learnable gates that partition tokens into groups for selective long-range attention
    New structures introduced by the paper; no independent evidence outside the training process is provided.

pith-pipeline@v0.9.0 · 5552 in / 1267 out tokens · 60540 ms · 2026-05-15T11:51:24.869712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  2. [2]

    LoRA learns less and forgets less

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Havens, Robert Jennings, Daniel King, Sam Havens, Nick Blankenship, et al. LoRA learns less and forgets less. Transactions on Machine Learning Research, 2024

  3. [3]

    Class-based n-gram models of natural language

    Peter F Brown, Vincent J Della Pietra, Peter V deSouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18 0 (4): 0 467--480, 1992

  4. [4]

    Unsupervised learning of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, 2020

  5. [5]

    Rethinking attention with performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In International Conference on Learning Representations, 2021

  6. [6]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

  7. [7]

    Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, 2024

  8. [8]

    Flash A ttention: Fast and memory-efficient exact attention with IO -awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems, 2022

  9. [9]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024 a

  10. [10]

    DeepSeek-V3 Technical Report

    DeepSeek-AI . DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024 b

  11. [11]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  12. [12]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team . Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

  13. [13]

    OLMo : Accelerating the science of language models

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Joshi, Valentina Pyatkin, et al. OLMo : Accelerating the science of language models. In Annual Meeting of the Association for Computational Linguistics, 2024

  14. [14]

    LoRA : Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  15. [15]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7 B . arXiv preprint arXiv:2310.06825, 2023

  16. [16]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024 a

  17. [17]

    MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. MI nference 1.0: Accelerating pre-filling for long-context LLM s via dynamic sparse attention. In Advances in Neural Information Processing Systems, 2024 b

  18. [18]

    Transformers are RNN s: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are RNN s: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 2020

  19. [19]

    Scaling laws for fine-grained mixture of experts

    Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Piotrowski, Piotr Sankowski, Micha Ciebiera, Krystian Kr \'o l, Tomasz Odrzyg \'o \'z d \'z , Marek Jaszczur, et al. Scaling laws for fine-grained mixture of experts. In International Conference on Machine Learning, 2024

  20. [20]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Horace Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Amnon Shashua, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024

  21. [21]

    Ring attention with blockwise transformers for near-infinite context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, 2024 a

  22. [22]

    DoRA : Weight-decomposed low-rank adaptation

    Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA : Weight-decomposed low-rank adaptation. In International Conference on Machine Learning, 2024 b

  23. [23]

    The Llama 3 Herd of Models

    Llama Team . The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    MoBA : Mixture of block attention for long-context LLMs

    Shuming Lu et al. MoBA : Mixture of block attention for long-context LLMs . arXiv preprint arXiv:2502.13189, 2025

  25. [25]

    Catastrophic interference in connectionist networks: The sequential learning problem

    Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109--165. Elsevier, 1989

  26. [26]

    From sparse to soft mixtures of experts

    Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In International Conference on Learning Representations, 2024

  27. [27]

    Efficient content-based sparse attention with routing transformers

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9: 0 53--68, 2021

  28. [28]

    Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flash A ttention-3: Fast and accurate attention with asynchrony and low-precision. In Advances in Neural Information Processing Systems, 2024

  29. [29]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

  30. [30]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020

  31. [31]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

  32. [32]

    Differential transformer

    Tianzhu Ye, Li Li, Gao Huang, et al. Differential transformer. In International Conference on Learning Representations, 2025

  33. [33]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Liu, Zhaozhuo Zhang, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Annual Meeting of the Association for Computational Linguistics, 2025

  34. [34]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, 2020

  35. [35]

    The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry

    Michael Zhang, Kush Bhatia, Jonathan Ragan-Kelley, and Christopher R \'e . The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In International Conference on Learning Representations, 2024

  36. [36]

    H _2 O : Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Zhangyang Wang, Beidi Chen, and others. H _2 O : Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  37. [37]

    Loki: Low-rank keys for efficient sparse attention

    Prajwal Singhania, Siddharth Nrusimha, Chih-Ping Park, and Joo-Young Kim. Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542, 2024

  38. [38]

    Spar Q Attention: Bandwidth-efficient LLM inference

    Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Sheridan, Thang Bui, and Walterio Mayol-Cuevas. Spar Q Attention: Bandwidth-efficient LLM inference. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  39. [39]

    MagicPIG : LSH sampling for efficient LLM generation

    Zhuoming Chen, Ranajoy Sadhukhan, Ying Ye, Yang Chen, Baris Kasikci, and Hao Zheng. MagicPIG : LSH sampling for efficient LLM generation. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024