arxiv: 2605.12110 · v1 · submitted 2026-05-12 · 💻 cs.DC

Recognition: no theorem link

AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

Di Liu , Ruitian Wang , Chen Chen , Mingliang Gong , Yongjie Yuan , Han Zhao , Yu Feng , Quan Chen

show 1 more author

Minyi Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:38 UTC · model grok-4.3

classification 💻 cs.DC

keywords sparse attentionblock sparsitylong contextKV cacheadaptive allocationGPU kernelsquantizationinference optimization

0 comments

The pith

Attention heads vary in block-size sensitivity, so assigning different sizes per head raises sparse-attention accuracy up to 5.43 percent with unchanged throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models face a memory bottleneck when loading the full KV cache for attention over long contexts. Prior block-sparse methods divide the cache into fixed-size blocks and skip low-importance ones, yet they apply the same block size to every attention head. AB-Sparse shows this uniform choice is wasteful because heads differ markedly in how much accuracy they lose when blocks are made coarser. The method measures each head's sensitivity with a lightweight rule, assigns smaller blocks only to sensitive heads, quantizes block centroids losslessly to control memory, and runs the variable blocks with custom GPU kernels. This produces up to 5.43 percent higher accuracy than uniform-block baselines on long-context tasks while preserving the original throughput.

Core claim

AB-Sparse is a training-free framework that allocates adaptive block sizes across attention heads according to their measured sensitivity to granularity. It pairs this allocation with lossless block-centroid quantization to offset the memory increase and supplies custom GPU kernels that execute variable block sizes efficiently. On long-context inference benchmarks the resulting system improves accuracy by as much as 5.43 percent over existing fixed-block sparse attention methods while incurring no throughput penalty.

What carries the argument

Per-head adaptive block-size allocation driven by a training-free sensitivity metric, together with block-centroid quantization and variable-block-size GPU kernels.

Load-bearing premise

Differences in attention-head sensitivity to block granularity are stable enough that a training-free measurement rule can identify useful allocations that hold across different inputs and tasks.

What would settle it

If a new long-context benchmark or model shows that the adaptive allocation yields less than one percent accuracy gain or reduces accuracy compared with the best fixed block size, the premise that per-head differences are reliably exploitable would be falsified.

Figures

Figures reproduced from arXiv: 2605.12110 by Chen Chen, Di Liu, Han Zhao, Mingliang Gong, Minyi Guo, Quan Chen, Ruitian Wang, Yongjie Yuan, Yu Feng.

**Figure 1.** Figure 1: Qualitative comparison of various sparse attention paradigms. Query Centroid Top-K Selection Key Sparse Attention Dot Product Estimation Block Representation Importance Score [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Normalized recall curves across block sizes, where normalization is performed with respect to the recall at block size 16. Insensitive heads maintain near-perfect normalized recall across all block sizes, while sensitive heads degrade sharply as block size increases. 1 5 1015202530 Head 1 5 10 15 20 25 30 Layer (a) Llama-3.1-8B 1 5 1015202530 Head 1 5 10 15 20 25 30 35 Layer (b) Qwen3-8B 16 32 64 Block S… view at source ↗

**Figure 5.** Figure 5: Architecture of AB-Sparse. Adaptive block size allocation entails design challenges in three aspects of the practical inference system. First, adaptivity requires a block size assignment for each attention head; dynamically adjusting assignments at runtime is prohibitively expensive, as it requires recomputing centroids over all key vectors. Second, assigning smaller blocks to sensitive heads significan… view at source ↗

**Figure 6.** Figure 6: Recall comparison between adaptive and uniform block size. The adaptive assignments are calibrated solely on wikipedia [27]. Despite this, they consistently outperform uniform block size across all RULER [28] tasks. Channel #Centroids (a) Llama-3.1-8B Channel #Centroids (b) Qwen3-8B −10 0 10 −40 −20 0 20 40 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 8.** Figure 8: Top-K page recall across layers on Llama-3.1-8B under different quantization bit widths and strategies. INT4 asymmetric perchannel quantization consistently maintains recall above 0.9 across all layers. Physical Pages Logical Blocks TopKA=1 StrideA=4 TopKB=1, 3 StrideB=2 Mapping Strategy Head A Head B [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: Decoding attention latency (ms) across three models with varying context lengths on A100 [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Throughput (tokens/s) with 64K context length and varying batch sizes on Llama-3.1-8B. Quest AB-Sparse AIME24 20.0 23.3 AMC23 47.5 60.0 MATH500 74.0 76.0 Avg. 47.2 53.1 [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 13.** Figure 13: RULER accuracy (%) under different centroid precisions across two models. INT4 quantization achieves accuracy comparable to the unquantized BF16 baseline. 64K 128K256K Context Length 5 10 Latency (ms) (a) Estimation 64K 128K256K Context Length 2 4 (b) Top-K 64K 128K256K Context Length 2 4 6 (c) Attention Naive AB-Sparse [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

read the original abstract

As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block size allocation across attention heads to improve accuracy. To compensate for the additional memory overhead, it further employs lossless block centroid quantization. In addition, custom GPU kernels are developed to support efficient execution with variable block sizes. Evaluation results demonstrate that AB-Sparse achieves an accuracy improvement of up to 5.43% over existing block sparse attention baselines without throughput overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AB-Sparse makes block sizes adaptive per attention head instead of uniform, adds quantization and variable-block kernels to keep speed, and claims up to 5.43% accuracy gains, but the supporting analysis and stability checks are not visible yet.

read the letter

The key point is that AB-Sparse drops the assumption of one block size for all heads in sparse attention. It assigns sizes adaptively based on per-head sensitivity, then uses lossless centroid quantization to offset the memory cost and custom GPU kernels to run the variable blocks without slowing down. The reported result is an accuracy improvement of up to 5.43% over fixed-block baselines at the same throughput.

Referee Report

2 major / 1 minor

Summary. The paper proposes AB-Sparse, a training-free algorithm-system co-design for block-sparse attention in long-context LLMs. It claims that attention heads exhibit varying sensitivity to block granularity (making uniform block sizes suboptimal), introduces per-head adaptive block-size allocation, lossless block-centroid quantization to offset memory costs, and custom GPU kernels for variable block sizes. The central empirical claim is an accuracy improvement of up to 5.43% over existing block-sparse baselines with no throughput overhead.

Significance. If the adaptive allocation rule generalizes, the approach could meaningfully improve the accuracy-throughput tradeoff in sparse attention methods by relaxing the homogeneity assumption. The training-free property and explicit system co-design (quantization plus kernels) are practical strengths that could aid deployment. However, the assessed significance is tempered by the absence of quantitative stability metrics or cross-task validation for the sensitivity-based rule.

major comments (2)

[Abstract, §3] Abstract and §3 (analysis of head sensitivity): the manuscript states that 'analysis revealed varying head sensitivity' and that uniformity is suboptimal, but provides no description of the measurement method, sensitivity metric, data used, or quantitative stability across inputs. This is load-bearing for the adaptive allocation rule and the claim that it delivers consistent gains without per-deployment retuning.
[§4] §4 (evaluation): the reported 5.43% accuracy gain is presented without details on exact baselines, number of runs, statistical significance testing, or cross-task/model validation. This directly affects assessment of the central claim, especially given the skeptic concern that optimal block sizes may shift with task (e.g., retrieval vs. summarization) or model scale.

minor comments (1)

[Abstract, §3.2] The abstract and method description could more explicitly state the precise definition of 'block centroid quantization' and how losslessness is verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the sensitivity analysis and evaluation details require expansion for clarity and reproducibility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (analysis of head sensitivity): the manuscript states that 'analysis revealed varying head sensitivity' and that uniformity is suboptimal, but provides no description of the measurement method, sensitivity metric, data used, or quantitative stability across inputs. This is load-bearing for the adaptive allocation rule and the claim that it delivers consistent gains without per-deployment retuning.

Authors: We acknowledge that §3 currently lacks sufficient detail on the head sensitivity analysis. The analysis measures per-head accuracy sensitivity by comparing performance under uniform block sizes versus per-head adaptive sizes, using accuracy drop as the metric on long-context benchmarks. We will revise §3 to fully describe the sensitivity metric, the specific data and inputs used for the analysis, and add quantitative stability results (e.g., variance across multiple sequences) to show the allocation rule generalizes without retuning. revision: yes
Referee: [§4] §4 (evaluation): the reported 5.43% accuracy gain is presented without details on exact baselines, number of runs, statistical significance testing, or cross-task/model validation. This directly affects assessment of the central claim, especially given the skeptic concern that optimal block sizes may shift with task (e.g., retrieval vs. summarization) or model scale.

Authors: We agree that §4 requires additional experimental details. We will expand the section to specify the exact baselines and their configurations, report results over multiple runs with standard deviations and statistical significance tests, and include further cross-task validation on retrieval and summarization tasks along with results on additional model scales to address generalizability concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper performs an empirical analysis of per-head sensitivity to block granularity, then introduces a training-free adaptive allocation rule plus supporting quantization and kernels. The reported accuracy gains (up to 5.43%) are measured outcomes on external benchmarks rather than quantities defined by the allocation rule itself. No equations reduce the final result to its inputs by construction, no load-bearing self-citations close the chain, and the central claim rests on independently verifiable system-level improvements rather than renaming or fitting. The approach is therefore non-circular under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the standard assumption that attention is dominated by a small token subset plus the paper-specific claim that head sensitivities differ enough to justify adaptive sizing.

axioms (2)

domain assumption Attention computation is dominated by a small subset of tokens
Cited from previous work on sparse attention
domain assumption Attention heads exhibit widely varying sensitivity to block granularity
Stated as revealed by the paper's analysis

pith-pipeline@v0.9.0 · 5509 in / 1276 out tokens · 64827 ms · 2026-05-13T04:38:06.864398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 6 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[2]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Curiousllm: Elevating multi-document question answering with llm-enhanced knowledge graph reasoning

Zukang Yang, Zixuan Zhu, and Jennifer Zhu. Curiousllm: Elevating multi-document question answering with llm-enhanced knowledge graph reasoning. InProceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 3: Industry Track), pages 274–286, 2025

work page 2025
[5]

Repoagent: An llm-powered open-source framework for repository-level code documentation generation.arXiv preprint arXiv:2402.16667, 2024

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, et al. Repoagent: An llm-powered open-source framework for repository-level code documentation generation.arXiv preprint arXiv:2402.16667, 2024

work page arXiv 2024
[6]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[7]

Llama-3.1-8b-instruct

Meta. Llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct, 2024

work page 2024
[8]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[10]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 155–172, 2024

work page 2024
[11]

Clusterkv: Manipulating llm kv cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating llm kv cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

work page arXiv 2024
[12]

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen, Jinkai Zhang, Baotong Lu, Qianxi Zhang, Chengruidong Zhang, Jingjia Luo, Di Liu, Huiqiang Jiang, Qi Chen, Jing Liu, et al. Retroinfer: A vector-storage approach for scalable long-context llm inference.arXiv preprint arXiv:2505.02922, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[14]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37:62557–62583, 2024

work page 2024
[15]

Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7, 2025

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems, 7, 2025

work page 2025
[16]

arXiv preprint arXiv:2406.10774 , year=

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774, 2024

work page arXiv 2024
[17]

Arkvale: Efficient generative llm inference with recallable key-value eviction.Advances in Neural Information Processing Systems, 37:113134– 113155, 2024

Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction.Advances in Neural Information Processing Systems, 37:113134– 113155, 2024

work page 2024
[18]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

work page arXiv 2024
[19]

Rating: [[...]] Analysis:

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mecha- nistically explains long-context factuality.arXiv preprint arXiv:2404.15574, 2024

work page arXiv 2024
[20]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

work page 2017
[21]

GQA: training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, 2023

work page 2023
[22]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024
[23]

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval, December 2024

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long- context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024

work page arXiv 2024
[24]

Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, et al. Magicpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179, 2024

work page arXiv 2024
[25]

Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms. arXiv preprint arXiv:2502.13189, 2025

work page arXiv 2025
[26]

Qwen3-8b.https://huggingface.co/Qwen/Qwen3-8B, 2025

Qwen. Qwen3-8b.https://huggingface.co/Qwen/Qwen3-8B, 2025

work page 2025
[27]

wikipedia

wikipedia. wikipedia. https://huggingface.co/datasets/wikimedia/wikipedia, 2025. 11

work page 2025
[28]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B, 2025

Qwen. Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B, 2025

work page 2025
[30]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[31]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

work page 2022
[32]

Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025

Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke, Minyi Guo, and Jieru Zhao. Freekv: Boosting kv cache retrieval for efficient llm inference.arXiv preprint arXiv:2505.13109, 2025

work page arXiv 2025
[33]

PRKV:page restruct KV cache for high accuracy and efficiency LLM generation, 2026

Fang Wu, Congming Gao, Weixi Zhu, and Jiwu Shu. PRKV:page restruct KV cache for high accuracy and efficiency LLM generation, 2026

work page 2026
[34]

Aime problems and solutions

Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving. com/wiki/index.php/AIME_Problems_and_Solutions, 2024

work page 2024
[35]

Amc problems and solutions

Art of Problem Solving. Amc problems and solutions. https://artofproblemsolving. com/wiki/index.php?title=AMC_Problems_and_Solutions, 2023

work page 2023
[36]

Let’s verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023

work page 2023
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025