Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arxiv: 2605.16928 · v1 · pith:5BWLUGJFnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Yanke Zhou , Yiduo Li , Hanlin Tang , Maohua Li , Kan Liu , Lan Tao , Lin Qu , Yuan Yao

show 1 more author

Xiaoxing Ma

This is my paper

Pith reviewed 2026-05-19 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sparse attentionlong-context inferencelarge language modelsefficient inferenceattention headsKV cachetoken retrievalmodel adaptation

0 comments p. Extension

pith:5BWLUGJF Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{5BWLUGJF}

Prints a linked pith:5BWLUGJF badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Full-attention LLMs already contain the structure to become highly sparse models after only a few hundred training steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard full-attention large language models contain built-in sparsity that can be activated with very little extra training. It finds that only certain attention heads handle the full long context, that relevant tokens can be located through a low-dimensional space, and that the number of useful tokens varies with each query. From these patterns the authors build a method that keeps the full key-value cache only for the important heads and uses a lightweight indexer to drop the rest. If correct, this removes the need for costly native sparse training or heuristic token dropping while still delivering large speed gains on long inputs.

Core claim

Full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. The approach keeps the full KV cache solely for a small set of retrieval heads, adds a 16-dimensional indexer to locate relevant tokens, and applies dynamic top-p selection because the useful token count depends on the query. These changes allow the model to reach high sparsity after just a few hundred training steps while matching full-attention accuracy on long-context and reasoning benchmarks.

What carries the argument

RTPurbo, which preserves the complete KV cache only for retrieval heads and adds a lightweight 16-dimensional token indexer together with query-dependent top-p selection.

If this is right

Up to 9.36 times faster prefill at one million token context length
Roughly 2 times faster decoding while keeping near-lossless accuracy
Sparsification completed in only a few hundred training steps instead of full native sparse pretraining
No need for heuristic token eviction or expensive sparse-from-scratch training
Strong sparse inference obtained directly from any standard full-attention checkpoint

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same head specialization and low-dimensional retrieval pattern could appear in other transformer variants, allowing similar quick sparsification outside language models.
Hardware designs that optimize for 16-dimensional indexing might become practical if the pattern holds across model scales.
Extending the dynamic top-p rule to multimodal or retrieval-augmented settings could further cut memory use without extra training.

Load-bearing premise

Only a small subset of attention heads needs full long-context processing and long-range retrieval lives mostly inside a low-dimensional subspace.

What would settle it

Apply the same hundred-step adaptation to a model in which every attention head has been forced to participate equally in long-range retrieval and measure whether accuracy on 1M-context benchmarks falls by more than a few percent.

read the original abstract

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTPurbo shows full-attention models can be made sparse in a few hundred steps by keeping full KV only on retrieval heads and routing the rest through a 16-dim indexer plus dynamic top-p.

read the letter

The main point is that this paper gives a concrete way to take an existing full-attention checkpoint and turn it into a sparse inference model with very little extra training. They keep full KV cache only for a small set of retrieval heads, then use a lightweight 16-dimensional indexer to fetch relevant tokens for the other heads via query-dependent top-p selection. The result is reported speedups of roughly 9x on prefill and 2x on decode at 1M context while staying close to the original accuracy on the benchmarks they ran.

Referee Report

2 major / 2 minor

Summary. The paper claims that full-attention LLMs are intrinsically sparse and can be converted into highly sparse models via RTPurbo with only a few hundred training steps. RTPurbo retains full KV cache solely for a small subset of retrieval heads, uses a lightweight 16-dimensional indexer to retrieve relevant tokens for sparse attention on the remaining heads, and applies dynamic top-p token selection. Experiments reportedly show near-lossless accuracy on long-context benchmarks and reasoning tasks, with speedups up to 9.36× prefill at 1M context and 2.01× decode.

Significance. If the empirical results and the low-dimensional retrieval assumption hold, the work would be significant for enabling efficient long-context inference directly from standard full-attention models without costly native sparse pretraining. The minimal adaptation budget and the reported speedups on 1M-context settings represent a practical advance. The grounding in three concrete observations about head specialization, subspace structure, and query-dependent budgets is a strength, as is the focus on preserving accuracy rather than heuristic eviction.

major comments (2)

[§4.2] §4.2 (Token Indexer): the central claim that long-range retrieval is governed primarily by a low-dimensional subspace (enabling a 16-dim indexer to support near-lossless dynamic top-p attention) lacks supporting ablations. No quantitative comparison is provided between 16 dimensions and higher-dimensional alternatives (32 or 64) or against full attention on multi-hop or precise retrieval tasks at 1M context. Because RTPurbo routes non-retrieval heads exclusively through this indexer while keeping full KV only for classified heads, insufficient subspace capacity would produce retrieval misses that directly undermine the near-lossless accuracy claim.
[§3.1] §3.1 and §5.1 (Retrieval Head Identification): the procedure for classifying the small subset of heads that require full long-context processing is load-bearing for the efficiency-accuracy trade-off, yet the manuscript provides no sensitivity analysis on the classification threshold or on how misclassification of even a few heads would affect the reported speedups and accuracy. If the classification is unstable across tasks, the claimed intrinsic sparsity would not generalize.

minor comments (2)

[Table 2] Table 2: the speedup numbers at 1M context would benefit from explicit reporting of the effective sparsity ratio achieved by the dynamic top-p mechanism rather than only the final wall-clock figures.
[Figure 4] Figure 4: axis labels and legend entries for the baseline methods are difficult to distinguish at the printed size; consider adding a supplementary table with exact values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our empirical claims. We address each major comment below and commit to revisions that strengthen the supporting evidence without altering the core contributions.

read point-by-point responses

Referee: [§4.2] §4.2 (Token Indexer): the central claim that long-range retrieval is governed primarily by a low-dimensional subspace (enabling a 16-dim indexer to support near-lossless dynamic top-p attention) lacks supporting ablations. No quantitative comparison is provided between 16 dimensions and higher-dimensional alternatives (32 or 64) or against full attention on multi-hop or precise retrieval tasks at 1M context. Because RTPurbo routes non-retrieval heads exclusively through this indexer while keeping full KV only for classified heads, insufficient subspace capacity would produce retrieval misses that directly undermine the near-lossless accuracy claim.

Authors: We agree that explicit ablations on indexer dimensionality would provide stronger quantitative support for the low-dimensional subspace observation. The manuscript's main results already show that the 16-dimensional indexer, when paired with retrieval-head classification and dynamic top-p selection, yields near-lossless accuracy on the evaluated long-context benchmarks at 1M context. To directly address concerns about higher-dimensional alternatives and multi-hop/precise retrieval, we will add new experiments in the revised version comparing 16-, 32-, and 64-dimensional indexers, including targeted evaluations on multi-hop reasoning tasks. These additions will quantify any retrieval misses and confirm the sufficiency of the chosen subspace dimension. revision: yes
Referee: [§3.1] §3.1 and §5.1 (Retrieval Head Identification): the procedure for classifying the small subset of heads that require full long-context processing is load-bearing for the efficiency-accuracy trade-off, yet the manuscript provides no sensitivity analysis on the classification threshold or on how misclassification of even a few heads would affect the reported speedups and accuracy. If the classification is unstable across tasks, the claimed intrinsic sparsity would not generalize.

Authors: The referee is correct that a sensitivity analysis on the classification threshold would better demonstrate robustness. Section 3.1 grounds the classification in the observed head specialization from the three core observations, and the reported results reflect a fixed but effective threshold that preserves accuracy while enabling the claimed speedups. In the revision we will include a sensitivity study varying the threshold, measuring its effects on both accuracy and speedup across tasks, and assessing stability of the identified retrieval heads. This will provide direct evidence that the intrinsic sparsity generalizes and that misclassification of a small number of heads does not materially degrade the efficiency-accuracy trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in independent empirical observations

full rationale

The paper's central claim—that full-attention models are intrinsically sparse and can be sparsified via minimal adaptation—rests on three explicitly stated observations about head specialization, low-dimensional retrieval subspaces, and query-dependent token budgets. These observations are presented as empirical findings derived from analysis of existing full-attention models rather than fitted parameters or self-referential definitions that encode the target result. The RTPurbo method is then constructed on top of these observations, with experimental validation on benchmarks serving as external check rather than tautological confirmation. No self-citation chains, ansatz smuggling, or renaming of known results appear as load-bearing steps in the provided derivation outline. The approach remains self-contained against external benchmarks without reducing the claimed efficiency gains to a direct fit or definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on three observations presented as empirical facts rather than derived quantities; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption only a small subset of attention heads truly requires full long-context processing
Stated as observation (1) in the abstract; if false the selective KV retention strategy collapses.
domain assumption long-range retrieval is governed primarily by a low-dimensional subspace
Observation (2); justifies the 16-dimensional indexer.

pith-pipeline@v0.9.0 · 5817 in / 1251 out tokens · 38220 ms · 2026-05-19T20:47:59.513839+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 9 internal anchors

[1]

L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for C...

work page doi:10.18653/v1/2024.acl-long.172 2024
[2]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

work page 2025
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[5]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[7]

Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024. URLhttps://arxiv.org/abs/2401. 06102

work page 2024
[8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong 11 Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie J...

work page doi:10.1038/s41586-025-09422-z 2025
[9]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=kIoBbc76Sy

work page 2024
[10]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, YUCHENG LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://op...

work page 2024
[11]

Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=OfjIlbelrT

work page 2025
[12]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= poE54GOq2l

work page 2024
[13]

MoBA: Mixture of block attention for long-context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Yutao Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-contex...

work page 2026
[14]

Aime 2024 dataset, 2024

Maxwell-Jia. Aime 2024 dataset, 2024. URLhttps://huggingface.co/datasets/Maxwell-Jia/AIME_2024

work page 2024
[15]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

In-context learning and induction heads,

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page
[17]

URLhttps://arxiv.org/abs/2209.11895

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=...

work page 2024
[19]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2024
[20]

Razorattention: Efficient KV cache compression through retrieval heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w

work page 2025
[21]

Quest: query-aware sparsity for efficient long-context llm inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[22]

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Prism: Spectral-aware block-sparse attention, 2026

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, and Xipeng Qiu. Prism: Spectral-aware block-sparse attention, 2026. URLhttps://arxiv.org/abs/2602.08426

work page arXiv 2026
[25]

FASA:FREQUENCY-AWARESPARSEATTENTION

Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, andJulianMcAuley. FASA:FREQUENCY-AWARESPARSEATTENTION. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FnSgecCEwg

work page 2026
[26]

Mmlu-pro: a more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural ...

work page
[27]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385

work page
[29]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024
[30]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, junxian guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=cFu7ze7xUm

work page 2025
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Qwen2.5-1M Technical Report

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, and Huizi Mao. Blasst: Dynamic blocked attention sparsity via softmax thresholding. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Spargeattention: Accurate and training-free sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=74c3Wwk8Tc

work page 2025
[35]

Nicolas Zucchet, Francesco D’Angelo, Andrew Kyle Lampinen, and Stephanie C.Y. Chan. The emergence of sparse attention: impact of data distribution and benefits of repetition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=jMhRbV47pS. 14 Appendix A Headwise Analysis of Local/Retrieval...

work page 2026

[1] [1]

L ong B ench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for C...

work page doi:10.18653/v1/2024.acl-long.172 2024

[2] [2]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/

work page 2025

[3] [3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=mZn2Xyh9Ec

work page 2024

[5] [5]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024

[7] [7]

Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024. URLhttps://arxiv.org/abs/2401. 06102

work page 2024

[8] [8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong 11 Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie J...

work page doi:10.1038/s41586-025-09422-z 2025

[9] [9]

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=kIoBbc76Sy

work page 2024

[10] [10]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, YUCHENG LI, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://op...

work page 2024

[11] [11]

Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. Flexprefill: A context-aware sparse atten- tion mechanism for efficient long-sequence inference. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=OfjIlbelrT

work page 2025

[12] [12]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id= poE54GOq2l

work page 2024

[13] [13]

MoBA: Mixture of block attention for long-context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Yutao Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, and Jiezhong Qiu. MoBA: Mixture of block attention for long-contex...

work page 2026

[14] [14]

Aime 2024 dataset, 2024

Maxwell-Jia. Aime 2024 dataset, 2024. URLhttps://huggingface.co/datasets/Maxwell-Jia/AIME_2024

work page 2024

[15] [15]

Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, S...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

In-context learning and induction heads,

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page

[17] [17]

URLhttps://arxiv.org/abs/2209.11895

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Lean- dro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=...

work page 2024

[19] [19]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomput., 568(C), February 2024. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.127063. URLhttps://doi.org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2024

[20] [20]

Razorattention: Efficient KV cache compression through retrieval heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient KV cache compression through retrieval heads. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w

work page 2025

[21] [21]

Quest: query-aware sparsity for efficient long-context llm inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[22] [22]

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezh...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Ga...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Prism: Spectral-aware block-sparse attention, 2026

Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, and Xipeng Qiu. Prism: Spectral-aware block-sparse attention, 2026. URLhttps://arxiv.org/abs/2602.08426

work page arXiv 2026

[25] [25]

FASA:FREQUENCY-AWARESPARSEATTENTION

Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, andJulianMcAuley. FASA:FREQUENCY-AWARESPARSEATTENTION. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FnSgecCEwg

work page 2026

[26] [26]

Mmlu-pro: a more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural ...

work page

[27] [27]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385

work page

[28] [29]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024

[29] [30]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, junxian guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=cFu7ze7xUm

work page 2025

[30] [31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [32]

Qwen2.5-1M Technical Report

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, and Zipeng Zhang. Qwen2.5-1m technical re...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [33]

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan, Cameron Shinn, Kai Xu, Jingze Cui, George Klimiashvili, Guangxuan Xiao, Perkz Zheng, Bo Li, Yuxin Zhou, Zhouhai Ye, Weijie You, Tian Zheng, Dominic Brown, Pengbo Wang, Markus Hoehnerbach, Richard Cai, Julien Demouth, John D. Owens, Xia Hu, Song Han, Timmy Liu, and Huizi Mao. Blasst: Dynamic blocked attention sparsity via softmax thresholding. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [34]

Spargeattention: Accurate and training-free sparse attention accelerating any model inference

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=74c3Wwk8Tc

work page 2025

[34] [35]

Nicolas Zucchet, Francesco D’Angelo, Andrew Kyle Lampinen, and Stephanie C.Y. Chan. The emergence of sparse attention: impact of data distribution and benefits of repetition. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=jMhRbV47pS. 14 Appendix A Headwise Analysis of Local/Retrieval...

work page 2026