pith. sign in

arxiv: 2605.24168 · v1 · pith:MBUVUZT4new · submitted 2026-05-22 · 💻 cs.AI · cs.LG

Inference Time Context Sparsity: Illusion or Opportunity?

Pith reviewed 2026-06-30 15:55 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords context sparsityLLM inferencedecode sparsityattention efficiencymodel robustnesslong context processingsparse kernels
0
0 comments X

The pith

LLMs remain accurate when most context tokens are ignored during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the heavy compute of dense attention for long contexts is not a fundamental limit but an artificial constraint that can be removed. It supports this view with an empirical study showing that models handle extreme inference-time context sparsity without any prior training for it. This holds across 20 models from five families, different context lengths, and tasks from retrieval to mathematical reasoning and agentic coding. The work also shows that current hardware already delivers large speedups from such sparsity. If the claim holds, future LLM systems can be built around principled context sparsity rather than dense processing.

Core claim

Current LLMs are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. This robustness exists even though the models were never trained for sparsity, and the process of projecting a long context into a much smaller hidden dimension already makes dense attention inherently lossy. Combined with the fact that sparse kernels on existing hardware such as the H100 deliver up to 10x acceleration at 50x sparsity, the results position extreme context sparsity as a feasible and beneficial foundation for LLM inference.

What carries the argument

Inference-time decode sparsity: the selective use of only a small fraction of context tokens when computing attention during decoding.

If this is right

  • Hardware such as the H100 can already deliver up to 10x speedups over dense kernels at 50x sparsity using sparse decode implementations.
  • No retraining or model modification is required to obtain the robustness and efficiency gains.
  • The same sparsity approach works across retrieval, multi-hop QA, mathematical reasoning, and agentic coding tasks.
  • Context sparsity can serve as a starting point for redesigning inference systems, training procedures, and model architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives that explicitly encourage context sparsity could produce models that tolerate even higher sparsity ratios.
  • New attention mechanisms could be designed from the start to operate only on sparse context subsets rather than approximating dense attention.
  • Similar sparsity opportunities may exist in other layers or components beyond the attention mechanism.

Load-bearing premise

The observed robustness to sparsity will continue to hold for arbitrary models, tasks, context lengths, and sparsity ratios without retraining or architectural changes.

What would settle it

Performance on retrieval or reasoning tasks drops sharply for a new model family once sparsity exceeds 20x on contexts longer than those tested.

Figures

Figures reproduced from arXiv: 2605.24168 by Aditya Desai, Agniva Chowdhury, Anshumali Shrivastava, Ion Stoica, Joseph E. Gonzalez, Kumar Krishna Agrawal, Prithvi Dixit, Sahil Joshi.

Figure 1
Figure 1. Figure 1: Inference-time 50× context sparsity is bandwidth-friendly by construction (a) and re￾tains near-dense quality across diverse workloads on a single model (b). (a) Three decode regimes share an HBM band but read from it differently. Dense reads the full KV cache every step (O(N·d) bytes); Sparse routes through a lightweight indexer that selects k rows (O(k·d) bytes, k≪N); Linear (Gated DeltaNet) reads a fixe… view at source ↗
Figure 2
Figure 2. Figure 2: RULER-HARD-32K score across families, scales, and sparsity. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: vAttention vs OracleTopK retention at 50× sparsity on RULER-HARD-32K. Bars report relative score (sparse / dense); the horizontal line at 1.0 marks dense parity. Values ≥ 1.0 are bold green; values <0.90 are muted red. Panels separate model family ((a) Qwen2.5, (b) Ministral 3). Traditionally, the failure of top-k sparse attention has been attributed to the diffusion of attention scores across the context.… view at source ↗
Figure 4
Figure 4. Figure 4: LOFT subspan-EM retention under 5× and 50× inference-time sparsity. Bars report relative score (sparse Subspan-EM / dense Subspan-EM); the horizontal line at 1.0 marks dense parity. Values ≥ 1.0 (sparse meets or exceeds dense) are bold green; values < 0.90 (over a 10% relative drop) are muted red. Layout: rows separate context length (top: 32K, bottom: 128K); columns separate model family (Qwen2.5: 3B–32B,… view at source ↗
Figure 5
Figure 5. Figure 5: We use AIME 2025 for evaluating long-form generation. Although we allow generations up to 65K tokens, models generate approximately 25K tokens on average across samples. Since sparse attention introduces approximation errors at each embedding update, an important concern is whether these errors compound across layers and, more critically, across autoregressively generated tokens. The results on AIME are th… view at source ↗
Figure 6
Figure 6. Figure 6: SWE-Bench Django head-to-head, Qwen3.5-27B under dense, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Empty-patch root cause changes with attention configuration. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-LLM-call prompt size is unaffected by sparsity on productive outcomes. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript argues that dense attention during LLM inference is an artificial constraint rather than a fundamental requirement. It advances a theoretical claim that a query projects O(N) context information into a d-dimensional hidden space (d << N), rendering dense attention inherently lossy, and supports this with an empirical study across 20 models from five families. The study reports that current LLMs remain robust to inference-time decode sparsity on retrieval, multi-hop QA, mathematical reasoning, and agentic coding tasks, with sparse kernels delivering up to 10x speedup over FlashInfer at 50x sparsity on H100 hardware. The position is that extreme context sparsity should become a foundational principle for future inference, training, and architecture design.

Significance. If the reported robustness generalizes, the work would have substantial practical significance for scaling long-context and agentic LLM workloads by reducing attention compute and memory without retraining. The breadth of the multi-family empirical evaluation and the concrete hardware kernel results are strengths that could influence systems research. The theoretical dimensionality argument provides an intuitive motivation but does not itself predict downstream task accuracy under token dropping.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim of 'remarkable robustness' across arbitrary context lengths, sparsity ratios, model families, and task distributions rests on extrapolation from a finite study; the dimensionality argument explains potential lossiness but supplies no bound on when task performance will degrade, leaving the generalization assumption load-bearing and untested.
  2. [Abstract] Abstract: the statement that 'current hardware is already sufficient' and the 10x speedup claim over FlashInfer at 50x sparsity require explicit reporting of the sparsity pattern selection method, prompt formatting controls, and statistical significance across the 20 models to rule out confounds in the experimental setup.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the exact sparsity ratios, context lengths, and task metrics used in the 'extensive study' to allow readers to assess the scope of the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our empirical claims and the need for clearer experimental details. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim of 'remarkable robustness' across arbitrary context lengths, sparsity ratios, model families, and task distributions rests on extrapolation from a finite study; the dimensionality argument explains potential lossiness but supplies no bound on when task performance will degrade, leaving the generalization assumption load-bearing and untested.

    Authors: We agree that the study is finite and does not furnish a theoretical bound on degradation points. The dimensionality argument is offered only as intuition for why dense attention may be lossy, not as a predictive guarantee. Our central claim is the observed empirical trend of robustness across the tested regimes (20 models, 5 families, varied lengths/ratios/tasks). In revision we will add an explicit limitations paragraph qualifying that generalization beyond the studied distributions remains an open question and is not proven by the current results. revision: partial

  2. Referee: [Abstract] Abstract: the statement that 'current hardware is already sufficient' and the 10x speedup claim over FlashInfer at 50x sparsity require explicit reporting of the sparsity pattern selection method, prompt formatting controls, and statistical significance across the 20 models to rule out confounds in the experimental setup.

    Authors: We will revise the abstract and methods section to state explicitly: (1) sparsity pattern is top-k selection on per-query attention scores, (2) prompt formatting follows the standard templates released with each benchmark, and (3) all reported speedups and accuracy numbers are means with standard deviation across the 20 models, with the same trend holding in every family. These details already appear in the full experimental appendix; we will surface them in the main text and abstract for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical study and qualitative dimensionality argument

full rationale

The paper supports its position via an extensive empirical evaluation on 20 models across tasks and a qualitative argument that attention is lossy because a query projects O(N) information into dimension d << N. Neither element reduces to a self-definition, fitted parameter renamed as prediction, or self-citation chain. The robustness observation is reported as measured data, not derived from quantities defined in terms of the target result. No equations or ansatzes are smuggled via prior self-work in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The position rests on the untested assumption that inference-time sparsity patterns can be applied without retraining and that the observed robustness generalizes beyond the tested regimes.

axioms (2)
  • domain assumption A query token projects O(N) context information into a hidden space of dimension d << N, rendering dense attention inherently lossy
    Presented as the first strand of evidence in the abstract.
  • domain assumption Robustness to inference-time sparsity does not require any model retraining or architectural change
    Stated explicitly when describing the study of current LLMs.

pith-pipeline@v0.9.1-grok · 5842 in / 1312 out tokens · 38273 ms · 2026-06-30T15:55:46.672656+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 16 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 Tech- nical Report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b Model Card.arXiv preprint arXiv:2508.10925, 2025

  3. [3]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  4. [4]

    Separations in the representational capabilities of transformers and recurrent architectures

    Satwik Bhattamishra, Michael Hahn, Phil Blunsom, and Varun Kanade. Separations in the representational capabilities of transformers and recurrent architectures. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  5. [5]

    SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

    Beidi Chen, Tharun Medini, James Farwell, Charlie Tai, Anshumali Shrivastava, et al. SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. Proceedings of Machine Learning and Systems, 2:291–306, 2020

  6. [6]

    MagicPIG: LSH Sampling for Efficient LLM Generation

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuan- dong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. MagicPIG: LSH Sampling for Efficient LLM Generation. InThe Thirteenth International Conference on Learning Repre- sentations, 2025

  7. [7]

    Rethinking Attention with Per- formers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking Attention with Per- formers. InInternational Conference on Learning Representations, 2021

  8. [8]

    Gonzalez, and Ion Stoica

    Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. vAttention: Verified Sparse Attention via Sampling. InThe Fourteenth International Conference on Learning Representations, 2026

  9. [9]

    Gonzalez, and Ion Stoica

    Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. HashAttention: Semantic Sparsity for Faster Inference. InForty-second International Conference on Machine Learning, 2025

  10. [10]

    Switch Transformers: Scaling to Trillion Param- eter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research, 23(120):1– 39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to Trillion Param- eter Models with Simple and Efficient Sparsity.Journal of Machine Learning Research, 23(120):1– 39, 2022

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. InFirst Conference on Language Modeling, 2024

  13. [13]

    REALM: Retrieval-Augmented Language Model Pre-Training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-Augmented Language Model Pre-Training. InInternational Conference on Machine Learning (ICML), 2020. 11

  14. [14]

    Squeezed Attention: Accelerating Long Context Length LLM Inference

    Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Mah- eswaran, Sebastian Zhao, June Paik, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  15. [15]

    Large Language Models for Software Engineering: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology, 33:1–79, 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology, 33:1–79, 2024

  16. [16]

    RULER: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? In First Conference on Language Modeling, 2024

  17. [17]

    Leveraging Passage Retrieval with Generative Models for OpenDomainQuestionAnswering

    Gautier Izacard and Edouard Grave. Leveraging Passage Retrieval with Generative Models for OpenDomainQuestionAnswering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, 2021

  18. [18]

    ATLAS: Few-Shot Learning with Retrieval Augmented Language Models.The Journal of Machine Learning Research, 24(1), 2023

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. ATLAS: Few-Shot Learning with Retrieval Augmented Language Models.The Journal of Machine Learning Research, 24(1), 2023

  19. [19]

    SWE-bench: Can Language Models Resolve Real-world Github Issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues? InThe Twelfth International Conference on Learning Representations, 2024

  20. [20]

    SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

    Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, and Anshumali Shrivastava. SOCKET: SOft Collison Kernel EsTimator for Sparse Attention.arXiv preprint arXiv:2602.06283, 2026

  21. [21]

    RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outra- geouslyLargeContexts

    Sahil Joshi, Agniva Chowdhury, Amar Kanakamedala, Ekam Singh, Evan Tu, and Anshumali Shrivastava. RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outra- geouslyLargeContexts. InThe Fourteenth International Conference on Learning Representations, 2026

  22. [22]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. InInternational Conference on Machine Learning, 2020

  23. [23]

    Mamba-3: Improved Sequence Modeling using State Space Principles

    Aakash Lahoti, Kevin Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Al- bert Gu. Mamba-3: Improved Sequence Modeling using State Space Principles. InThe Fourteenth International Conference on Learning Representations, 2026

  24. [24]

    FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

    Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

  25. [25]

    Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and...

  26. [26]

    Competition-Level Code Generation with AlphaCode.Science, 378(6624):1092–1097, 2022

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-Level Code Generation with AlphaCode.Science, 378(6624):1092–1097, 2022. 12

  27. [27]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  28. [28]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.arXiv preprint arXiv:2512.02556, 2025

  29. [29]

    Ministral 3

    Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

  30. [30]

    RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chen- gruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  31. [31]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

  32. [32]

    Deja vu: Contextual sparsity for efficient llms at inference time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023

  33. [33]

    American Invitational Mathematics Examination (AIME) 2025.https://maa.org/maa-invitational-competitions/, 2025

    Mathematical Association of America. American Invitational Mathematics Examination (AIME) 2025.https://maa.org/maa-invitational-competitions/, 2025. Accessed: 2026-05- 20

  34. [34]

    Random Feature Attention

    Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random Feature Attention. InInternational Conference on Learning Representations, 2021

  35. [35]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and Narrowing the Compositionality Gap in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

  36. [36]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950, 2023

  37. [37]

    Survey: Transformer based Video-Language Pre-Training.AI Open, 3:1–13, 2022

    Ludan Ruan and Qin Jin. Survey: Transformer based Video-Language Pre-Training.AI Open, 3:1–13, 2022

  38. [38]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  39. [39]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. InInternational Conference on Learning Representations, 2017. 13

  40. [40]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. OpenAI GPT-5 System Card. arXiv preprint arXiv:2601.03267, 2025

  41. [41]

    OpenClaw: Personal AI Assistant.https: //github.com/openclaw/openclaw, 2026

    Peter Steinberger and OpenClaw contributors. OpenClaw: Personal AI Assistant.https: //github.com/openclaw/openclaw, 2026

  42. [42]

    QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-Aware Sparsity for Efficient Long-Context LLM Inference. InInternational Conference on Machine Learning, 2024

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  44. [44]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

  45. [45]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint arXiv:2302.13971, 2023

  46. [46]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

  47. [47]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  48. [48]

    arXiv preprint arXiv:2408.07092 , year=

    Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, and Lianmin Zheng. Post-Training Sparse Attention with Double Sparsity.arXiv preprint arXiv:2408.07092, 2024

  49. [49]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh Interna- tional Conference on Learning Representations, 2023

  50. [50]

    LongMamba: Enhancing Mamba’s Long- Context Capabilities via Training-Free Receptive Field Enlargement

    Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. LongMamba: Enhancing Mamba’s Long- Context Capabilities via Training-Free Receptive Field Enlargement. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025

  51. [51]

    Flashinfer documentation,

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. Flashinfer documentation,

  52. [52]

    Accessed: 2025-05-27. 14

  53. [53]

    Benjamin Erichson

    Annan Yu and N. Benjamin Erichson. Block-Biased Mamba for Long-Range Sequence Processing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  54. [54]

    Big Bird: Trans- formers for Longer Sequences

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago On- tanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird: Trans- formers for Longer Sequences. InAdvances in Neural Information Processing Systems, 2020

  55. [55]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. GLM-5: from Vibe Coding to Agentic Engineering. arXiv preprint arXiv:2602.15763, 2026

  56. [56]

    PQCache: Product Quantization-based KVCache for Long Context LLM Inference.Proceedings of the ACM on Management of Data, 3(3):1–30, 2025

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. PQCache: Product Quantization-based KVCache for Long Context LLM Inference.Proceedings of the ACM on Management of Data, 3(3):1–30, 2025

  57. [57]

    SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference. InForty-second International Conference on Machine Learning, 2025

  58. [58]

    H2O: Heavy- Hitter Oracle for Efficient Generative Inference of Large Language Models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy- Hitter Oracle for Efficient Generative Inference of Large Language Models. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  59. [59]

    An efficient training algorithm for models with block-wise sparsity.Transactions on Machine Learning Research, 2025

    Ding Zhu, Zhiqun Zuo, and Mohammad Mahdi Khalili. An efficient training algorithm for models with block-wise sparsity.Transactions on Machine Learning Research, 2025. 15 Appendix A Proofs A.1 Proof of Theorem 1 Proof.Consider the linear mapT:R N→Rd defined byT(a) :=V ⊤a. SinceV ⊤∈Rd×N, we have rank(T)≤d. LetN(T) :={z∈RN :V ⊤z= 0}be the null space ofT. By ...