Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Pith reviewed 2026-05-19 20:47 UTC · model grok-4.3
pith:5BWLUGJF Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{5BWLUGJF}
Prints a linked pith:5BWLUGJF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Full-attention LLMs already contain the structure to become highly sparse models after only a few hundred training steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. The approach keeps the full KV cache solely for a small set of retrieval heads, adds a 16-dimensional indexer to locate relevant tokens, and applies dynamic top-p selection because the useful token count depends on the query. These changes allow the model to reach high sparsity after just a few hundred training steps while matching full-attention accuracy on long-context and reasoning benchmarks.
What carries the argument
RTPurbo, which preserves the complete KV cache only for retrieval heads and adds a lightweight 16-dimensional token indexer together with query-dependent top-p selection.
If this is right
- Up to 9.36 times faster prefill at one million token context length
- Roughly 2 times faster decoding while keeping near-lossless accuracy
- Sparsification completed in only a few hundred training steps instead of full native sparse pretraining
- No need for heuristic token eviction or expensive sparse-from-scratch training
- Strong sparse inference obtained directly from any standard full-attention checkpoint
Where Pith is reading between the lines
- The same head specialization and low-dimensional retrieval pattern could appear in other transformer variants, allowing similar quick sparsification outside language models.
- Hardware designs that optimize for 16-dimensional indexing might become practical if the pattern holds across model scales.
- Extending the dynamic top-p rule to multimodal or retrieval-augmented settings could further cut memory use without extra training.
Load-bearing premise
Only a small subset of attention heads needs full long-context processing and long-range retrieval lives mostly inside a low-dimensional subspace.
What would settle it
Apply the same hundred-step adaptation to a model in which every attention head has been forced to participate equally in long-range retrieval and measure whether accuracy on 1M-context benchmarks falls by more than a few percent.
read the original abstract
Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that full-attention LLMs are intrinsically sparse and can be converted into highly sparse models via RTPurbo with only a few hundred training steps. RTPurbo retains full KV cache solely for a small subset of retrieval heads, uses a lightweight 16-dimensional indexer to retrieve relevant tokens for sparse attention on the remaining heads, and applies dynamic top-p token selection. Experiments reportedly show near-lossless accuracy on long-context benchmarks and reasoning tasks, with speedups up to 9.36× prefill at 1M context and 2.01× decode.
Significance. If the empirical results and the low-dimensional retrieval assumption hold, the work would be significant for enabling efficient long-context inference directly from standard full-attention models without costly native sparse pretraining. The minimal adaptation budget and the reported speedups on 1M-context settings represent a practical advance. The grounding in three concrete observations about head specialization, subspace structure, and query-dependent budgets is a strength, as is the focus on preserving accuracy rather than heuristic eviction.
major comments (2)
- [§4.2] §4.2 (Token Indexer): the central claim that long-range retrieval is governed primarily by a low-dimensional subspace (enabling a 16-dim indexer to support near-lossless dynamic top-p attention) lacks supporting ablations. No quantitative comparison is provided between 16 dimensions and higher-dimensional alternatives (32 or 64) or against full attention on multi-hop or precise retrieval tasks at 1M context. Because RTPurbo routes non-retrieval heads exclusively through this indexer while keeping full KV only for classified heads, insufficient subspace capacity would produce retrieval misses that directly undermine the near-lossless accuracy claim.
- [§3.1] §3.1 and §5.1 (Retrieval Head Identification): the procedure for classifying the small subset of heads that require full long-context processing is load-bearing for the efficiency-accuracy trade-off, yet the manuscript provides no sensitivity analysis on the classification threshold or on how misclassification of even a few heads would affect the reported speedups and accuracy. If the classification is unstable across tasks, the claimed intrinsic sparsity would not generalize.
minor comments (2)
- [Table 2] Table 2: the speedup numbers at 1M context would benefit from explicit reporting of the effective sparsity ratio achieved by the dynamic top-p mechanism rather than only the final wall-clock figures.
- [Figure 4] Figure 4: axis labels and legend entries for the baseline methods are difficult to distinguish at the printed size; consider adding a supplementary table with exact values.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our empirical claims. We address each major comment below and commit to revisions that strengthen the supporting evidence without altering the core contributions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Token Indexer): the central claim that long-range retrieval is governed primarily by a low-dimensional subspace (enabling a 16-dim indexer to support near-lossless dynamic top-p attention) lacks supporting ablations. No quantitative comparison is provided between 16 dimensions and higher-dimensional alternatives (32 or 64) or against full attention on multi-hop or precise retrieval tasks at 1M context. Because RTPurbo routes non-retrieval heads exclusively through this indexer while keeping full KV only for classified heads, insufficient subspace capacity would produce retrieval misses that directly undermine the near-lossless accuracy claim.
Authors: We agree that explicit ablations on indexer dimensionality would provide stronger quantitative support for the low-dimensional subspace observation. The manuscript's main results already show that the 16-dimensional indexer, when paired with retrieval-head classification and dynamic top-p selection, yields near-lossless accuracy on the evaluated long-context benchmarks at 1M context. To directly address concerns about higher-dimensional alternatives and multi-hop/precise retrieval, we will add new experiments in the revised version comparing 16-, 32-, and 64-dimensional indexers, including targeted evaluations on multi-hop reasoning tasks. These additions will quantify any retrieval misses and confirm the sufficiency of the chosen subspace dimension. revision: yes
-
Referee: [§3.1] §3.1 and §5.1 (Retrieval Head Identification): the procedure for classifying the small subset of heads that require full long-context processing is load-bearing for the efficiency-accuracy trade-off, yet the manuscript provides no sensitivity analysis on the classification threshold or on how misclassification of even a few heads would affect the reported speedups and accuracy. If the classification is unstable across tasks, the claimed intrinsic sparsity would not generalize.
Authors: The referee is correct that a sensitivity analysis on the classification threshold would better demonstrate robustness. Section 3.1 grounds the classification in the observed head specialization from the three core observations, and the reported results reflect a fixed but effective threshold that preserves accuracy while enabling the claimed speedups. In the revision we will include a sensitivity study varying the threshold, measuring its effects on both accuracy and speedup across tasks, and assessing stability of the identified retrieval heads. This will provide direct evidence that the intrinsic sparsity generalizes and that misclassification of a small number of heads does not materially degrade the efficiency-accuracy trade-off. revision: yes
Circularity Check
No significant circularity; derivation grounded in independent empirical observations
full rationale
The paper's central claim—that full-attention models are intrinsically sparse and can be sparsified via minimal adaptation—rests on three explicitly stated observations about head specialization, low-dimensional retrieval subspaces, and query-dependent token budgets. These observations are presented as empirical findings derived from analysis of existing full-attention models rather than fitted parameters or self-referential definitions that encode the target result. The RTPurbo method is then constructed on top of these observations, with experimental validation on benchmarks serving as external check rather than tautological confirmation. No self-citation chains, ansatz smuggling, or renaming of known results appear as load-bearing steps in the provided derivation outline. The approach remains self-contained against external benchmarks without reducing the claimed efficiency gains to a direct fit or definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption only a small subset of attention heads truly requires full long-context processing
- domain assumption long-range retrieval is governed primarily by a low-dimensional subspace
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for C...
-
[2]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/
work page 2025
-
[3]
Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=mZn2Xyh9Ec
work page 2024
-
[5]
DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[7]
Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024. URLhttps://arxiv.org/abs/2401. 06102
work page 2024
-
[8]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong 11 Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie J...
-
[9]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=kIoBbc76Sy
work page 2024
-
[10]
Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, YUCHENG LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://op...
work page 2024
-
[11]
Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference
Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=OfjIlbelrT
work page 2025
-
[12]
SnapKV: LLM knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= poE54GOq2l
work page 2024
-
[13]
MoBA: Mixture of block attention for long-context LLMs
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Yutao Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-contex...
work page 2026
-
[14]
Maxwell-Jia. Aime 2024 dataset, 2024. URLhttps://huggingface.co/datasets/Maxwell-Jia/AIME_2024
work page 2024
-
[15]
Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
In-context learning and induction heads,
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
-
[17]
URLhttps://arxiv.org/abs/2209.11895
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=...
work page 2024
-
[19]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063
-
[20]
Razorattention: Efficient KV cache compression through retrieval heads
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w
work page 2025
-
[21]
Quest: query-aware sparsity for efficient long-context llm inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[22]
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezh...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Prism: Spectral-aware block-sparse attention, 2026
Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, and Xipeng Qiu. Prism: Spectral-aware block-sparse attention, 2026. URLhttps://arxiv.org/abs/2602.08426
-
[25]
FASA:FREQUENCY-AWARESPARSEATTENTION
Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, andJulianMcAuley. FASA:FREQUENCY-AWARESPARSEATTENTION. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FnSgecCEwg
work page 2026
-
[26]
Mmlu-pro: a more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural ...
- [27]
-
[29]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF
work page 2024
-
[30]
Duoattention: Efficient long-context LLM inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, junxian guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=cFu7ze7xUm
work page 2025
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, and Huizi Mao. Blasst: Dynamic blocked attention sparsity via softmax thresholding. ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Spargeattention: Accurate and training-free sparse attention accelerating any model inference
Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=74c3Wwk8Tc
work page 2025
-
[35]
Nicolas Zucchet, Francesco D’Angelo, Andrew Kyle Lampinen, and Stephanie C.Y. Chan. The emergence of sparse attention: impact of data distribution and benefits of repetition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=jMhRbV47pS. 14 Appendix A Headwise Analysis of Local/Retrieval...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.