FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Dongwon Jo; Jae-Joon Kim; Jiwon Song; Yulhwa Kim

arxiv: 2502.01068 · v7 · submitted 2025-02-03 · 💻 cs.LG · cs.CL

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Dongwon Jo , Jiwon Song , Yulhwa Kim , Jae-Joon Kim This is my paper

Pith reviewed 2026-05-23 03:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cache compressionprefill accelerationLLM inferencetoken selectioncontext reductiondecoding optimization

0 comments

The pith

FastKV decouples prefill token reduction from KV cache compression via a selective propagation layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that token importance stabilizes in later LLM layers, so full computation can run early and only selected tokens need forwarding afterward. This TSP decision reduces prefill work without forcing the same reduction on the KV cache size. Separate control of the two rates then lets prefill and decoding each be optimized independently while accuracy stays at the level of decoding-only baselines.

Core claim

FastKV runs full-context computation until a chosen Token-Selective Propagation layer, forwards only the most informative tokens from that point, and then independently selects salient KV entries from the propagated tokens for caching, thereby separating the TSP rate that controls prefill cost from the KV retention rate that controls decoding cost.

What carries the argument

Token-Selective Propagation (TSP) layer, which selects and forwards only informative tokens after early full computation so that later layers and KV selection operate on a reduced set.

If this is right

Increasing the TSP drop rate speeds prefill without changing the KV cache size or decoding budget.
Lowering the KV retention rate speeds decoding without changing how much prefill computation is saved.
The two rates can be chosen independently to match different hardware memory and latency targets.
Accuracy stays comparable to methods that compress only the KV cache during decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation could be tested on non-transformer architectures that also build growing caches.
Dynamic choice of TSP layer during a single inference run might further reduce average cost on mixed-length inputs.
If token stability holds across model scales, the TSP layer position may be predictable from depth alone.

Load-bearing premise

Token importance becomes stable after some layer so that dropping tokens there does not remove information required for accurate results in the remaining layers.

What would settle it

Run the same model and prompts with the TSP layer placed at different depths while holding KV retention fixed and measure whether accuracy remains flat or drops sharply when the layer is moved earlier.

Figures

Figures reproduced from arXiv: 2502.01068 by Dongwon Jo, Jae-Joon Kim, Jiwon Song, Yulhwa Kim.

**Figure 2.** Figure 2: Illustration of the proposed FastKV scheme. The proposed FastKV introduce Token-Selective Propagation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of normalized L2 distances be [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: End-to-end inference latency breakdown of LLaMA-3.1-8B-Instruct at varying input context lengths [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Effect of TSP rate on LongBench average [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of processing flows between GemFilter and FastKV. GemFilter prunes the input prompt and [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of attention computation. In GemFilter, discarded tokens are entirely excluded from attention, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Needle-in-a-Haystack results of LLaMA-3.1-8B-Instruct with 10% KV retention rate. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: End-to-end inference latency breakdown of Ministral-8B-Instruct at varying input context lengths [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the decoding-only baselines. Our code is available at https://github.com/dongwonjo/FastKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastKV decouples prefill pruning from KV selection with a TSP layer but the whole claim rests on token importance stabilizing after that layer.

read the letter

The main new piece is the explicit split between a TSP layer that cuts tokens for the rest of prefill and a separate step that picks which KV entries to retain from the surviving tokens. Earlier compression work tied the two rates together because of layer-dependent token importance, so giving independent knobs on TSP rate and KV retention rate is a useful framing even if the underlying observation is not brand new. They report concrete speedups (up to 1.82× prefill, 2.87× decode) against full-context and decoding-only baselines while claiming accuracy parity, and the code is released, which makes the empirical claims checkable. That is the part worth taking seriously for anyone tuning long-context inference. The soft spot is the stabilization assumption itself. The method needs a single intermediate layer where dropping the unselected tokens does not materially change what the deeper layers compute or what the KV selector later sees. The abstract presents this as given rather than showing layer-sensitivity tests, error introduced by early selection, or why the chosen layer generalizes across models and tasks. If importance keeps shifting, the accuracy match would not hold and the decoupling would not deliver what is claimed. The free parameters (TSP position, rates) also suggest post-hoc choices that need scrutiny in the experiments. This is for engineers working on practical KV cache and prefill optimizations rather than core model research. A reader already implementing compression methods could borrow the control-knob idea, but the central result needs the full experimental details to stand up. I would send it for peer review so the layer choice and accuracy claims can be properly examined.

Referee Report

3 major / 2 minor

Summary. The paper claims that by performing full-context computation up to a Token-Selective Propagation (TSP) layer where token importance has stabilized, FastKV forwards only selected tokens to later layers while independently selecting salient KV entries for the cache; this decouples the TSP rate (prefill reduction) from the KV retention rate (decoding budget), yielding speedups of up to 1.82× prefill and 2.87× decoding versus the full-context baseline while matching accuracy of decoding-only baselines.

Significance. If the stabilization assumption holds across models and tasks, the decoupling enables independent tuning of prefill compute and KV memory that prior coupled compression methods could not achieve, potentially improving long-context inference efficiency. The open-sourced code at the provided GitHub link supports reproducibility.

major comments (3)

[§3] §3 (Method): The central decoupling claim rests on the assertion that token importance stabilizes after the TSP layer with negligible impact on later-layer accuracy, yet no layer-wise analysis, importance-score trajectories, or error bound is provided to justify the chosen TSP position or to quantify degradation when non-selected tokens are dropped.
[§4] §4 (Experiments): The accuracy-parity claim versus decoding-only baselines is reported without an ablation that varies TSP rate independently of KV retention rate while holding total compute fixed; without this, it is unclear whether the observed parity is due to the decoupling or to post-hoc selection of the TSP layer and rates.
[Table 2] Table 2 (or equivalent results table): Speedup numbers are given relative to the full-context baseline, but the corresponding accuracy numbers for the same TSP/KV configurations are not shown alongside the decoding-only baselines, making it impossible to verify that the independent control does not trade accuracy for the reported prefill gains.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'matching the accuracy of the decoding-only baselines' should specify the exact baselines, datasets, and metrics (e.g., perplexity or downstream task accuracy) used for the comparison.
[§3.2] §3.2: The TSP layer is described as selecting 'the most informative tokens' without an explicit equation or pseudocode for the importance scoring function, which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method): The central decoupling claim rests on the assertion that token importance stabilizes after the TSP layer with negligible impact on later-layer accuracy, yet no layer-wise analysis, importance-score trajectories, or error bound is provided to justify the chosen TSP position or to quantify degradation when non-selected tokens are dropped.

Authors: We agree that explicit layer-wise analysis would better support the stabilization assumption. In the revision we will add plots of token importance scores across layers for representative models and tasks, along with an empirical quantification of accuracy degradation when tokens are dropped at different depths. This will justify the chosen TSP layer position without altering the core method. revision: yes
Referee: [§4] §4 (Experiments): The accuracy-parity claim versus decoding-only baselines is reported without an ablation that varies TSP rate independently of KV retention rate while holding total compute fixed; without this, it is unclear whether the observed parity is due to the decoupling or to post-hoc selection of the TSP layer and rates.

Authors: The current experiments already vary TSP rate and KV retention rate independently and report accuracy parity with decoding-only baselines. To directly address the fixed-compute concern, we will add a new ablation table that sweeps the two rates while constraining total FLOPs to match the decoding-only baseline; this will be included in the revised Section 4. revision: yes
Referee: [Table 2] Table 2 (or equivalent results table): Speedup numbers are given relative to the full-context baseline, but the corresponding accuracy numbers for the same TSP/KV configurations are not shown alongside the decoding-only baselines, making it impossible to verify that the independent control does not trade accuracy for the reported prefill gains.

Authors: We will expand Table 2 (or add a companion table) to report accuracy for every TSP/KV configuration next to the speedups, together with the corresponding decoding-only baseline accuracies, enabling direct side-by-side verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent validation

full rationale

The paper introduces FastKV as an empirical framework that assumes token importance stabilization after a TSP layer and demonstrates speedups plus accuracy parity via direct comparisons to full-context and decoding-only baselines. No equations, parameter fits, or self-citations are presented that reduce the central claims to inputs by construction; the decoupling of TSP rate and KV retention rate is justified by experimental outcomes rather than definitional or fitted equivalence. The stabilization premise is treated as an enabling observation, not derived from prior results within the paper itself.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 1 invented entities

The framework rests on an empirical domain observation about token importance and introduces two tunable rates plus a new layer concept whose values are selected for performance.

free parameters (3)

TSP layer position
Layer chosen where token importance is assumed to stabilize; value not specified in abstract.
TSP rate
Fraction of tokens propagated after the TSP layer; independent control parameter.
KV retention rate
Fraction of salient KV entries kept in cache; set independently of TSP rate.

axioms (1)

domain assumption Token importance stabilizes in later layers of transformer LLMs.
Invoked to justify full-context computation only up to the TSP layer.

invented entities (1)

Token-Selective Propagation (TSP) layer no independent evidence
purpose: Designated layer that selects and forwards only the most informative tokens to later layers.
New construct introduced to enable the decoupling.

pith-pipeline@v0.9.0 · 5788 in / 1345 out tokens · 47899 ms · 2026-05-23T03:56:11.506963+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
cs.AI 2026-05 unverdicted novelty 6.0

OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark s...
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
cs.LG 2026-04 unverdicted novelty 6.0

SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference
cs.CL 2026-04 unverdicted novelty 6.0

StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
cs.CV 2026-05 conditional novelty 5.0

MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 4 Pith papers · 10 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Anthropic . 2024. The claude 3 model family: Opus, sonnet, haiku

work page 2024
[6]

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems (NeurIPS), 37:100213--100240

work page 2024
[7]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, and 1 others. 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. https://doi.org/10.48550/arXiv.2406.02069 Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling . CoRR, abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. International Conference on Learning Representations (ICLR)

work page 2023
[11]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. Gptq: Accurate post-training quantization for generative pre-trained transformers. International Conference on Learning Representations (ICLR)

work page 2023
[14]

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. 2024. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258

work page arXiv 2024
[15]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems (NeurIPS), 37:1270--1303

work page 2024
[16]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. https://arxiv.org/abs/2407.02490 Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention . Preprint, arXiv:2407.02490

work page arXiv 2024
[18]

Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack

work page 2023
[19]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626

work page 2023
[20]

Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, A. R. Fabbri, Caiming Xiong, Shafiq R. Joty, and Chien-Sheng Wu. 2023. https://api.semanticscholar.org/CorpusID:266164172 Summedits: Measuring llm ability at factual reasoning through the lens of summarization . In Conference on Empirical Methods in Natural Language Processing

work page 2023
[21]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. https://openreview.net/forum?id=OfjIlbelrT Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference . In The Thirteenth International Conference on Learning Representations

work page 2025
[22]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems (NeurIPS)

work page 2024
[23]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

work page 2024
[24]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. International Conference on Machine Learning (ICML)

work page 2024
[25]

Mistral AI Team . 2025. Mistral-nemo. https://mistral.ai/news/ministraux

work page 2025
[26]

Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, and Jun Wang. 2025. Accelerating prefilling for long-context llms via sparse pattern sharing. arXiv preprint arXiv:2505.19578

work page arXiv 2025
[27]

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. Longcodebench: Evaluating coding llms at 1m context windows. arXiv preprint arXiv:2505.07897

work page arXiv 2025
[28]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems (NeurIPS), 37:68658--68685

work page 2024
[29]

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2024. Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction. arXiv preprint arXiv:2409.17422

work page arXiv 2024
[30]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024 a . https://doi.org/10.18653/v1/2024.findings-acl.195 P yramid I nfer: Pyramid KV cache compression for high-throughput LLM inference . In Findings of the Association for Computational Linguistics: ACL 2024, pages 3258--3270, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.195 2024
[32]

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. 2024 b . No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096

work page arXiv 2024
[33]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and 1 others. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. arXiv preprint arXiv:2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013

work page arXiv 2024
[35]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (NeurIPS)

work page 2023
[36]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, and 1 others. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems (NeurIPS), 37:62557--62583

work page 2024

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Anthropic . 2024. The claude 3 model family: Opus, sonnet, haiku

work page 2024

[6] [6]

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems (NeurIPS), 37:100213--100240

work page 2024

[7] [7]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, and 1 others. 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. https://doi.org/10.48550/arXiv.2406.02069 Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling . CoRR, abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024

[9] [9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. International Conference on Learning Representations (ICLR)

work page 2023

[11] [11]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2024. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. Gptq: Accurate post-training quantization for generative pre-trained transformers. International Conference on Learning Representations (ICLR)

work page 2023

[14] [14]

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. 2024. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258

work page arXiv 2024

[15] [15]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems (NeurIPS), 37:1270--1303

work page 2024

[16] [16]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. https://arxiv.org/abs/2407.02490 Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention . Preprint, arXiv:2407.02490

work page arXiv 2024

[18] [18]

Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack

work page 2023

[19] [19]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626

work page 2023

[20] [20]

Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, A. R. Fabbri, Caiming Xiong, Shafiq R. Joty, and Chien-Sheng Wu. 2023. https://api.semanticscholar.org/CorpusID:266164172 Summedits: Measuring llm ability at factual reasoning through the lens of summarization . In Conference on Empirical Methods in Natural Language Processing

work page 2023

[21] [21]

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, and Xun Zhou. 2025. https://openreview.net/forum?id=OfjIlbelrT Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference . In The Thirteenth International Conference on Learning Representations

work page 2025

[22] [22]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems (NeurIPS)

work page 2024

[23] [23]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

work page 2024

[24] [24]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. International Conference on Machine Learning (ICML)

work page 2024

[25] [25]

Mistral AI Team . 2025. Mistral-nemo. https://mistral.ai/news/ministraux

work page 2025

[26] [26]

Dan Peng, Zhihui Fu, Zewen Ye, Zhuoran Song, and Jun Wang. 2025. Accelerating prefilling for long-context llms via sparse pattern sharing. arXiv preprint arXiv:2505.19578

work page arXiv 2025

[27] [27]

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. 2025. Longcodebench: Evaluating coding llms at 1m context windows. arXiv preprint arXiv:2505.07897

work page arXiv 2025

[28] [28]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems (NeurIPS), 37:68658--68685

work page 2024

[29] [29]

Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2024. Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction. arXiv preprint arXiv:2409.17422

work page arXiv 2024

[30] [30]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. 2024 a . https://doi.org/10.18653/v1/2024.findings-acl.195 P yramid I nfer: Pyramid KV cache compression for high-throughput LLM inference . In Findings of the Association for Computational Linguistics: ACL 2024, pages 3258--3270, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-acl.195 2024

[32] [32]

June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. 2024 b . No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096

work page arXiv 2024

[33] [33]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and 1 others. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving. arXiv preprint arXiv:2501.01005

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013

work page arXiv 2024

[35] [35]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (NeurIPS)

work page 2023

[36] [36]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, and 1 others. 2024. Sglang: Efficient execution of structured language model programs. Advances in Neural Information Processing Systems (NeurIPS), 37:62557--62583

work page 2024