ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Jie Li; Jiong Lou; Junjie Li

arxiv: 2605.16360 · v1 · pith:RZYYJYXXnew · submitted 2026-05-09 · 💻 cs.LG · cs.AI

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

Junjie Li , Jiong Lou , Jie Li This is my paper

Pith reviewed 2026-05-20 22:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cache pruninglong-context inferenceproxy modelsLLM efficiencyattention mechanismsmodel compressionprefilling acceleration

0 comments

The pith

A lightweight small-model proxy can generate KV cache pruning decisions for a larger LLM fast enough to cut prefilling time substantially while keeping nearly the same accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the forced choice between fast but imprecise KV cache pruning and accurate but slow scoring during long-context LLM inference. It establishes that a smaller model from the same family can compute which cache entries to discard, running in parallel with the main model so the large model never pays the full scoring cost. The design uses a mapper to align features across model sizes and a loss that trains for consistent ranking rather than exact score matching. If this holds, long documents and conversations become practical on hardware with limited memory without retraining the target model or accepting large accuracy losses.

Core claim

ProxyKV offloads importance scoring for KV cache pruning to a lightweight intra-family small-model proxy that runs asynchronously with the large-model target. A HybridAxialMapper disentangles temporal feature extraction from cross-head alignment to bridge architectural differences, while a Multi-Granularity Hybrid Loss trains the proxy to preserve relative ranking consistency instead of exact regression. On Llama-3.1, Qwen-2.5, and Qwen-3 families from 7B to 32B parameters, the method recovers approximately 98.7 percent of KVZip mean accuracy across LongBench, SCBench, and RULER while delivering up to 3.21 times prefilling speedup on Llama-3.1-8B and sustaining gains at 170k-token contexts.

What carries the argument

HybridAxialMapper paired with Multi-Granularity Hybrid Loss, which separates temporal features from head alignment and replaces exact score regression with relative ranking consistency to let small-proxy decisions transfer to the large target.

If this is right

Prefilling for contexts up to 170k tokens runs substantially faster on both single- and dual-GPU setups without retraining the main model.
Accuracy on standard long-context benchmarks remains within a small fraction of high-precision pruning baselines across multiple model families and sizes.
The pruning step can overlap with target-model computation because the proxy executes asynchronously.
The same proxy training recipe applies across 7B-to-32B targets from Llama and Qwen lineages without per-model redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment mapper proves stable, the same small proxy could serve multiple target sizes within a family, reducing the need for separate scoring models.
The ranking-focused loss might let the proxy be trained on shorter sequences and still work at much longer contexts than seen during training.
Hardware schedulers could choose proxy size on the fly according to available compute, trading a bit of accuracy for even lower latency when memory is tight.

Load-bearing premise

Importance scores from the lightweight small-model proxy transfer effectively to the large target once the mapper aligns their features and the loss enforces ranking consistency.

What would settle it

Measure accuracy on a 170k-token benchmark when the large target uses proxy-derived pruning masks versus masks computed directly on the target itself; a drop larger than a few percent would indicate the scores do not transfer.

Figures

Figures reproduced from arXiv: 2605.16360 by Jie Li, Jiong Lou, Junjie Li.

**Figure 1.** Figure 1: Three KV-cache pruning paradigms: SnapKV (a) heuristic, KVZip (b) reconstruction, ProxyKV (c) asynchronous proxy. Heuristic and Architectural Pruning. Rule-based methods identify non-essential tokens via local patterns: StreamingLLM [Xiao et al., 2023] retains attention sinks; H2O [Zhang et al., 2023], Scissorhands [Liu et al., 2023], and AhaKV [Gu et al., 2025] use accumulated scores or recent attention … view at source ↗

**Figure 2.** Figure 2: Overview of ProxyKV: an asynchronous Small-Model Proxy Ms feeds the HybridAxialMapper, which produces target-aligned importance scores Yˆ for the LargeModel Target Ml without a secondary prefilling pass. 4.1 HybridAxialMapper architecture Design rationale. Cross-model alignment must reconcile two coupled axes: a temporal axis along which token saliency evolves over the sequence, and a head axis along whic… view at source ↗

**Figure 3.** Figure 3: Aggregate accuracy on LongBench and SCBench for the Llama-3.1 and Qwen-2.5 families. ProxyKV tracks the KVZip oracle within ∼1.5 pp at ρ ≥ 0.5 (the gap widens to ∼5 pp at ρ ≤ 0.2, where pruning bites hardest) and outperforms heuristic SnapKV on SCBench. Competitive performance across model families. ProxyKV recovers ∼98.7% of the KVZip oracle across all benchmarks and sparsity levels. As shown in [PITH_FU… view at source ↗

**Figure 4.** Figure 4: Zero-shot transfer to held-out SCBench RepoQA. ProxyKV (blue) tracks the KVZip oracle (red) within 1–2 pp on both targets. Left: Qwen-2.5; right: Llama-3.1. Robust zero-shot generalization. ProxyKV transfers zero-shot to repositorylevel reasoning, matching KVZip on the held-out SCBench.RepoQA task. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Per-dataset performance on the 16 English LongBench tasks (Qwen-2.5); the remaining 5 Chinese subsets are reported in Section D. ProxyKV tracks the KVZip oracle and surpasses SnapKV on dense-synthesis tasks. on code-intensive LCC and RepoBench-P; SnapKV remains competitive only on simple structured tasks like TREC and fails on dense synthesis (SAMSum). The full 21-subset breakdown including the 5 Chinese t… view at source ↗

**Figure 6.** Figure 6: ProxyKV flattens the super-linear latency curve of KVZip while paying a modest one-time memory premium. (a–b): prefilling latency; (c–d): peak GPU memory, across context length. 1.5 2.0 2.5 3.0 Avg. Total Latency (s) 35 40 45 50 55 Avg. LongBench Score (5 tasks) 0.1 0.2 0.3 0.5 1.0 0.1 0.50.2 0.3 1.0 0.1 0.2 0.3 0.5 1.0 Llama-3.1-8B 1.00 1.25 1.50 1.75 2.00 2.25 2.50 Avg. Total Latency (s) 35 40 45 50 55 0… view at source ↗

**Figure 7.** Figure 7: Score–latency Pareto on LongBench (5 representative tasks, total time = prefill + generation). Numbers next to each marker indicate retention ratio ρ. ProxyKV (blue) dominates KVZip (red) on latency at every ρ and dominates SnapKV (orange) on score above ∼1.3 s. 1K 2K 6K 10K 13K 15K 16K 17K Context Length 0 1 2 3 4 5 6 7 ProxyKV Prefill Time (s) Mapper avg = 6.8% Llama-3.1-8B 1K 2K 7K 10K 12K 15K 16K 18K C… view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: ProxyKV tracks the KVZip oracle on RULER across all three target scales (7B, 8B, 32B). RULER 13-task average score versus retention ratio ρ. 6 Ablation studies We isolate the five loss terms, the three HybridAxialMapper stages, and the dominant loss coefficients. Since Figures 3 and 9 cluster tightly at ρ ≥ 0.5, we focus on ρ ∈ {0.1, 0.2}. Ablations use Llama-3.1-8B / Llama-3.2-1B (each LOO variant retrai… view at source ↗

**Figure 10.** Figure 10: Loss LOO ablation, LongBench-21 average. Lbin is the single most critical term at low retention, and the five loss swing-supports are nearly disjoint. At ρ ∈ {0.1, 0.2} removing Lbin produces the largest drop; for ρ ≥ 0.4 all six curves collapse into a 1-point band. The per-task LOO on LongBench-21 ( [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Single-GPU prefill latency vs. context length on real LongBench inputs at ρ=0.3 for Llama-3.1-8B and Qwen-2.5-7B. The ProxyKV–KVZip gap widens monotonically with context length. GPU-count-matched single-GPU context scan. ProxyKV remains 1.3×–1.6× faster than KVZip when every method is constrained to the same single GPU [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Dual-GPU memory timeline for ProxyKV (Llama-3.1-8B target on GPU 1, Llama-3.2-1B proxy on GPU 2). The orange band marks the prefill phase, the green band marks decode. The proxy GPU jumps from 3.5 GB (weights only) to 26.7 GB at the prefill peak—driven almost entirely by the prefill-time activation working set (attention logits, intermediate projections, and short-lived hidden states), since the 1B proxy’… view at source ↗

**Figure 13.** Figure 13: Component leave-one-out on the six representative LongBench subsets, six-task average vs. retention ratio ρ. The discriminating regime is ρ ∈ {0.1, 0.2}, where “w/o Conv” incurs the largest drop (−3.67 on the six-task average), followed by “w/o Time” (−2.38) and “w/o Head” (−1.58); all four configurations re-converge once ρ ≥ 0.5. Per-task drops are quantified in [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Hyperparameter sensitivity of the two dominant loss coefficients on the six representative LongBench subsets. The default (λbin, λmse) = (10, 20) leads at ρ ∈ {0.1, 0.2}, and all sweeps reconverge at high retention. residual robustness to the multi-ratio nature of Lbin: even at λbin=5 the binary signal still receives roughly a quarter of the gradient budget. D Complete LongBench results Figures 15 to 20 r… view at source ↗

**Figure 15.** Figure 15: Complete LongBench results, Qwen-2.5, Part I. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Complete LongBench results, Qwen-2.5, Part II. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Complete LongBench results, Llama-3.1, Part I. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Complete LongBench results, Llama-3.1, Part II. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Complete LongBench results, Qwen-3-32B, Part I. Qwen-3-32B target, Part I. The 11 single-document QA, multi-hop, and summarization panels test whether the HybridAxialMapper recipe holds when retrained on the Qwen-3-4B proxy paired with the much larger Qwen-3-32B target (∼8× target/proxy size ratio, the largest in our setup). The cross-method ordering observed on Qwen-2.5 and Llama-3.1 reproduces here with… view at source ↗

**Figure 20.** Figure 20: Complete LongBench results, Qwen-3-32B, Part II. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Complete per-task RULER results, Llama-3.1-8B target. 13 subsets × 4 methods × 9 retention ratios. Companion to the aggregate curve in [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Complete per-task RULER results, Qwen-2.5-7B target with a Qwen-2.5-1.5B proxy. Same axes and method ordering as [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Complete per-task RULER results, Qwen-3-32B target paired with a dedicated Qwen-3-4B proxy (∼8× target/proxy size ratio, the largest in our setup); ProxyKV stays within 1–2 points of KVZip on every subset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: Mass Reconstruction stabilizes near 0.95 within the first quarter of training, confirming that the Multi-Granularity Hybrid Loss converges smoothly without overfitting. Total loss (red) and Mass Reconstruction ratio (blue) over 100,000 training steps. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

read the original abstract

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering $\sim$$98.7\%$ of its mean accuracy) while delivering up to a $3.21\times$ prefilling speedup on Llama-3.1-8B (dual-GPU; $\sim$$1.5\times$ shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProxyKV gets real speedups on long-context KV pruning by offloading scoring to a small intra-family proxy, but the bridging components need better isolation to show they are doing the heavy lifting.

read the letter

The core takeaway is that this paper gives a practical route around the KV-cache memory wall for long contexts by running importance scoring on a lightweight same-family small model in the background. It reports matching KVZip accuracy at roughly 98.7 percent while cutting prefilling time by up to 3.21 times on Llama-3.1-8B and keeping the gains out to 170k tokens on Qwen models. That combination of cross-model proxy plus async execution is the part worth paying attention to if you work on inference efficiency.

Referee Report

2 major / 2 minor

Summary. The paper introduces ProxyKV, a cross-model proxy pruning framework for efficient long-context LLM inference. It offloads KV importance scoring to a lightweight intra-family small proxy model run asynchronously, bridged to the large target via the HybridAxialMapper (disentangling temporal features from cross-head alignment) and trained with a Multi-Granularity Hybrid Loss emphasizing relative ranking consistency over rigid regression. Evaluations across Llama-3.1, Qwen-2.5, and Qwen-3 families (7B–32B targets) on LongBench, SCBench, and RULER report ~98.7% recovery of KVZip mean accuracy with up to 3.21× prefilling speedup (Llama-3.1-8B, dual-GPU) sustained to 170k tokens.

Significance. If the proxy-to-target score transfer proves robust, the work would be significant for practical long-context deployment: it decouples expensive scoring from target size, yielding measurable prefilling speedups with near-parity accuracy to reconstruction-based baselines like KVZip. The multi-family, multi-benchmark scope and explicit scaling to 170k contexts strengthen the empirical case for asynchronous proxy pruning in production settings.

major comments (2)

[Method (HybridAxialMapper and loss description)] The central claim that proxy importance scores transfer effectively after HybridAxialMapper alignment and Multi-Granularity Hybrid Loss training is load-bearing, yet the manuscript supplies no ablation replacing the mapper with a simpler linear projection or the loss with plain regression/MSE. Without these controls it is impossible to separate the contribution of the proposed bridging machinery from baseline intra-family similarity.
[Experiments and results] Results section: aggregate accuracy recovery (~98.7% of KVZip) is reported without per-benchmark breakdowns, error bars, dataset splits, or statistical tests. This weakens confidence that the speedup-accuracy tradeoff holds reliably across the claimed model sizes and contexts up to 170k tokens.

minor comments (2)

[Method] Notation for the HybridAxialMapper components (temporal vs. cross-head) could be formalized with a small diagram or equation to improve readability.
[Abstract] The abstract states 'up to 3.21×' and '∼1.5×' speedups; clarifying whether these are mean or best-case and on which exact hardware configuration would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our method and results. We address each major comment point by point below and indicate the revisions made.

read point-by-point responses

Referee: [Method (HybridAxialMapper and loss description)] The central claim that proxy importance scores transfer effectively after HybridAxialMapper alignment and Multi-Granularity Hybrid Loss training is load-bearing, yet the manuscript supplies no ablation replacing the mapper with a simpler linear projection or the loss with plain regression/MSE. Without these controls it is impossible to separate the contribution of the proposed bridging machinery from baseline intra-family similarity.

Authors: We agree that explicit ablations are necessary to isolate the contributions of the HybridAxialMapper and Multi-Granularity Hybrid Loss from simpler baselines. In the revised manuscript, we have added these controls: replacing the mapper with a linear projection and the loss with plain MSE regression. The results show that both proposed components improve ranking consistency and transfer performance beyond intra-family similarity alone, particularly for larger context lengths and cross-head misalignment cases. revision: yes
Referee: [Experiments and results] Results section: aggregate accuracy recovery (~98.7% of KVZip) is reported without per-benchmark breakdowns, error bars, dataset splits, or statistical tests. This weakens confidence that the speedup-accuracy tradeoff holds reliably across the claimed model sizes and contexts up to 170k tokens.

Authors: We acknowledge the value of more granular reporting. The revised manuscript now includes per-benchmark accuracy tables for LongBench, SCBench, and RULER, with error bars computed from multiple random seeds where feasible, and explicit dataset split details moved to the appendix. We have also added variance analysis across model sizes and context lengths up to 170k tokens to better substantiate the reliability of the observed tradeoffs. revision: partial

standing simulated objections not resolved

Formal statistical hypothesis testing (e.g., paired t-tests or ANOVA across all benchmarks and scales) was not included in the original experimental protocol and would require substantial additional compute and re-runs that are not feasible within the current revision timeline.

Circularity Check

0 steps flagged

No circularity: empirical proxy pruning framework is self-contained

full rationale

The paper introduces ProxyKV as an empirical cross-model framework that trains a lightweight intra-family proxy with HybridAxialMapper and Multi-Granularity Hybrid Loss to generate transferable KV importance scores for a larger target model. Performance is measured against the external baseline KVZip on LongBench, SCBench, and RULER across multiple model families and sizes, with reported speedups. No mathematical derivation chain, equations, or self-referential definitions appear that would reduce any claimed prediction or result to its own inputs by construction. The method relies on standard training and external empirical validation rather than fitted parameters renamed as predictions or load-bearing self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the named components; full paper would be required to audit training hyperparameters or architectural assumptions.

pith-pipeline@v0.9.0 · 5796 in / 1278 out tokens · 53438 ms · 2026-05-20T22:41:37.111036+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HybridAxialMapper disentangles temporal feature extraction from cross-head alignment together with Multi-Granularity Hybrid Loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 9 internal anchors

[1]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Llama 3 Herd of Models

URLhttps://arxiv.org/ abs/2407.21783. Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, and Xiangmin Xu. Ahakv: Adaptive holistic attention-driven kv cache eviction for efficient inference of large language models.arXiv preprint arXiv:2506.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Fewer is more: Boosting math reasoning with reinforced context pruning

Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Fan Yang, and Mao Yang. Fewer is more: Boosting math reasoning with reinforced context pruning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13674–13695,

work page 2024
[7]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

URLhttps://arxiv.org/abs/2601.07891. Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology,

work page arXiv
[8]

Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416, 2025

URLhttps://github. com/gkamradt/LLMTest_NeedleInAHaystack. GitHub repository. Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416,

work page arXiv
[9]

What layers when: Learning to skip compute in llms with residual gates.arXiv preprint arXiv:2510.13876,

Filipe Laitenberger, Dawid Kopiczko, Cees GM Snoek, and Yuki M Asano. What layers when: Learning to skip compute in llms with residual gates.arXiv preprint arXiv:2510.13876,

work page arXiv
[10]

A survey on large lan- guage model acceleration based on kv cache management

11 Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a. Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng G...

work page arXiv
[11]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms.arXiv preprint arXiv:2512.00722,

Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, and Guohao Dai. Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms.arXiv preprint arXiv:2512.00722,

work page arXiv
[15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

1 GPU per method

Yi Zhao, Zuchao Li, and Hai Zhao. Iam: Efficient inference through attention mapping between different-scale llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19522–19533, 2025a. Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, and Xiaoming Fu. Smallkv: S...

work page arXiv

[1] [1]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Llama 3 Herd of Models

URLhttps://arxiv.org/ abs/2407.21783. Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, and Xiangmin Xu. Ahakv: Adaptive holistic attention-driven kv cache eviction for efficient inference of large language models.arXiv preprint arXiv:2506.03762,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Fewer is more: Boosting math reasoning with reinforced context pruning

Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Fan Yang, and Mao Yang. Fewer is more: Boosting math reasoning with reinforced context pruning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13674–13695,

work page 2024

[7] [7]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim

URLhttps://arxiv.org/abs/2601.07891. Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology,

work page arXiv

[8] [8]

Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416, 2025

URLhttps://github. com/gkamradt/LLMTest_NeedleInAHaystack. GitHub repository. Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416,

work page arXiv

[9] [9]

What layers when: Learning to skip compute in llms with residual gates.arXiv preprint arXiv:2510.13876,

Filipe Laitenberger, Dawid Kopiczko, Cees GM Snoek, and Yuki M Asano. What layers when: Learning to skip compute in llms with residual gates.arXiv preprint arXiv:2510.13876,

work page arXiv

[10] [10]

A survey on large lan- guage model acceleration based on kv cache management

11 Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a. Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H Abdi, Dongsheng Li, Jianfeng G...

work page arXiv

[11] [11]

Large Language Models: A Survey

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey.arXiv preprint arXiv:2402.06196,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms.arXiv preprint arXiv:2512.00722,

Jiaming Xu, Jiayi Pan, Hanzhen Wang, Yongkang Zhou, Jiancai Ye, Yu Wang, and Guohao Dai. Specontext: Enabling efficient long-context reasoning with speculative context sparsity in llms.arXiv preprint arXiv:2512.00722,

work page arXiv

[15] [15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

1 GPU per method

Yi Zhao, Zuchao Li, and Hai Zhao. Iam: Efficient inference through attention mapping between different-scale llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19522–19533, 2025a. Yi Zhao, Yajuan Peng, Cam-Tu Nguyen, Zuchao Li, Xiaoliang Wang, Hai Zhao, and Xiaoming Fu. Smallkv: S...

work page arXiv