KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov; Denis Kuznedelev; Ivan Ermakov; Vyacheslav Zhdanovskiy; Yegor Yershov

arxiv: 2604.08426 · v4 · pith:Z4MBO6YKnew · submitted 2026-04-09 · 💻 cs.LG · cs.AI· cs.CL

KV Cache Offloading for Context-Intensive Tasks

Andrey Bocharnikov , Ivan Ermakov , Denis Kuznedelev , Vyacheslav Zhdanovskiy , Yegor Yershov This is my paper

Pith reviewed 2026-05-19 16:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords KV cache offloadinglong-context LLMscontext-intensive tasksinference optimizationText2JSON benchmarkmemory reductionaccuracy preservation

0 comments

The pith

KV-cache offloading causes major accuracy losses on tasks that require pulling lots of details from long inputs, but a simpler alternative recovers performance across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests KV-cache offloading on context-intensive tasks where the model must extract and use large amounts of information from the prompt to solve the problem. It releases the Text2JSON benchmark for turning raw text into structured JSON and runs evaluations on Llama 3 and Qwen 3. Existing offloading approaches show clear accuracy drops, which the authors trace to low-rank projections of the keys and unreliable landmarks for deciding what to keep or move. They introduce a simpler offloading strategy that raises accuracy on these tasks and on other benchmarks for multiple LLM families. The results indicate that prior tests of long-context methods have overlooked the demands of information-heavy workloads.

Core claim

Standard KV-cache offloading produces large accuracy drops on context-intensive tasks because keys are projected to low rank and the landmarks used to manage the cache are unreliable. Evaluations on the new Text2JSON benchmark and similar tasks confirm the degradation for Llama 3 and Qwen 3. A simpler alternative strategy avoids these issues and delivers substantially higher accuracy across several LLM families and benchmarks.

What carries the argument

Low-rank projection of keys combined with unreliable landmarks, identified as the sources of accuracy loss, which the simpler alternative strategy bypasses.

If this is right

Offloading remains viable for memory savings if the simpler strategy replaces current approaches.
Benchmarks must include context-intensive examples such as Text2JSON to give trustworthy results.
The simpler strategy works across multiple LLM families without added complexity.
Long-context compression methods need re-examination when the task requires extracting many facts from the input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that summarize or query long documents may need custom offloading rules to avoid losing key details.
The released benchmark makes it straightforward for others to check whether new compression ideas hold up under information-heavy conditions.
Similar low-rank and landmark problems could appear in other KV management schemes if they are not tested on extraction-heavy tasks.

Load-bearing premise

The observed accuracy drops are driven mainly by low-rank key projections and unreliable landmarks rather than by other details of the offloading code or the choice of prompts and metrics.

What would settle it

Run the simpler strategy and the original methods on Text2JSON while forcing full-rank keys or more stable landmark selection; if accuracy gaps close or reverse, the claimed causes would not hold.

Figures

Figures reproduced from arXiv: 2604.08426 by Andrey Bocharnikov, Denis Kuznedelev, Ivan Ermakov, Vyacheslav Zhdanovskiy, Yegor Yershov.

**Figure 2.** Figure 2: Evaluation of ShadowKV (w/o SVD compression) offloading for Section 4.2 with varying [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Evaluation of ShadowKV offloading with different landmark precisions and chunks sizes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of ShadowKV offloading with different landmark precisions and chunks sizes on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation of ShadowKV offloading with different KV compression strategies for Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation of ShadowKV (w/o SVD compression) offloading for Section 4.2 with varying [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of 1.5-bit residual landmark quantization with 1-bit and 2-bit HIGGS. Results [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

With the growing demand for long-context LLMs across a wide range of applications, the key-value (KV) cache has become a critical bottleneck for both latency and memory usage. Recently, KV-cache offloading has emerged as a promising approach to reduce memory footprint and inference latency while preserving accuracy. Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context. In this work, we study KV-cache offloading on context-intensive tasks: problems where the solution requires looking up a lot of information from the input prompt. We create and release the Text2JSON benchmark, a highly context-intensive task that requires extracting structured knowledge from raw text. We evaluate modern KV offloading on Text2JSON and other context-intensive tasks and find significant performance degradation on both Llama 3 and Qwen 3 models. Our analysis identifies two key reasons for poor accuracy: low-rank projection of keys and unreliable landmarks, and proposes a simpler alternative strategy that significantly improves accuracy across multiple LLM families and benchmarks. These findings highlight the need for a comprehensive and rigorous evaluation of long-context compression techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags real accuracy drops from KV offloading on tasks that actually need the full context and offers a practical workaround, but the causal story on low-rank keys and landmarks stays correlational without tighter controls.

read the letter

The key point is that common KV-cache offloading techniques lose a lot of accuracy once the task forces the model to retrieve and use many specific details from a long prompt. The authors built Text2JSON to test exactly that kind of extraction and saw clear degradation on both Llama 3 and Qwen 3 families, then showed a simpler strategy that recovers much of the performance across several benchmarks. That combination of a targeted benchmark plus concrete before-and-after numbers is the useful part here. It directly addresses a gap in how offloading methods have been tested so far, which mostly used lighter tasks. The work is empirical and straightforward about releasing the benchmark, which makes the reported gaps easier to check later. The soft spot is the attribution step. The abstract links the drops to low-rank key projections and unreliable landmarks, yet it does not describe controlled ablations that keep eviction policy, quantization, and prompt details fixed while changing only those two elements. Without that isolation the diagnosis remains suggestive rather than definitive, and the proposed fix could be helping for other reasons. Minor issues like missing error bars or exact implementation specs are the usual things that would come up in review but do not sink the main observation. This paper is aimed at people shipping long-context models for document work or multi-step reasoning where memory constraints matter. A practitioner or systems researcher would get immediate value from the benchmark and the accuracy gaps even if they end up tweaking the diagnosis themselves. It is worth sending to a serious referee because the empirical finding is grounded enough to spark useful follow-up work on evaluation standards for offloading.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Text2JSON benchmark for highly context-intensive tasks requiring structured extraction from raw text. It evaluates modern KV-cache offloading methods on Text2JSON and related tasks using Llama 3 and Qwen 3 models, reports significant accuracy degradation, attributes the drops to low-rank key projections and unreliable landmarks, and proposes a simpler alternative offloading strategy that yields accuracy gains across multiple LLM families and benchmarks.

Significance. If the empirical results and causal attributions hold, the work is significant for exposing limitations of existing KV offloading in information-heavy long-context scenarios and for releasing a new benchmark that stresses retrieval over generation. The proposed simpler strategy offers a practical, immediately usable improvement, and the emphasis on rigorous evaluation of compression techniques addresses a timely gap in long-context LLM research.

major comments (1)

[Analysis of causes] Analysis of causes (around the identification of low-rank projections and landmarks): the claim that these two factors are the primary drivers of degradation on Text2JSON is not isolated from other offloading implementation choices. No controlled ablations are reported that vary only key-projection rank or landmark reliability while holding eviction policy, quantization, and prompt formatting fixed; the link therefore remains correlational.

minor comments (2)

[Experimental results] Results tables and figures lack error bars or multiple random seeds, making it difficult to assess the statistical reliability of the reported accuracy drops and gains.
[Evaluation setup] The exact implementation details of the baseline offloading systems (e.g., specific eviction heuristics and quantization schemes) should be stated more explicitly to allow reproduction of the observed degradations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the causal analysis of accuracy degradation. We address the major comment below.

read point-by-point responses

Referee: Analysis of causes (around the identification of low-rank projections and landmarks): the claim that these two factors are the primary drivers of degradation on Text2JSON is not isolated from other offloading implementation choices. No controlled ablations are reported that vary only key-projection rank or landmark reliability while holding eviction policy, quantization, and prompt formatting fixed; the link therefore remains correlational.

Authors: We acknowledge that the current analysis relies on comparative evaluations across offloading methods and supporting measurements of key projection ranks and landmark reliability rather than fully isolated controlled ablations. While these comparisons hold eviction policy, quantization, and prompt formatting consistent within each method family, we agree that additional experiments varying only the projection rank and landmark mechanism would provide stronger causal evidence. We will add such controlled ablations to the revised manuscript, including quantitative results on Text2JSON and related benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is an empirical study that introduces the Text2JSON benchmark, evaluates existing KV-cache offloading methods on context-intensive tasks across Llama 3 and Qwen 3 models, identifies performance issues through observation, and proposes a simpler alternative strategy based on those results. No mathematical derivation chain, first-principles predictions, or fitted parameters are present that reduce to the paper's own inputs by construction. Claims rest on experimental measurements and analysis rather than self-definitional loops, self-citation load-bearing premises, or renamed known results. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the new benchmark definition, the choice of offloading implementations, and the attribution of accuracy loss to two specific mechanisms; no free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5741 in / 1071 out tokens · 36533 ms · 2026-05-19T16:37:28.609807+00:00 · methodology

Review history (4 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
cs.LG 2026-05 unverdicted novelty 6.0

A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

R. Y . Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y . He. Deepspeed inference: Enabling efficient inference of trans- former models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022

work page 2022
[2]

Ananthanarayanan and A

S. Ananthanarayanan and A. Sengupta. Understanding the physics of key-value cache compres- sion for LLMs through attention dynamics.arXiv preprint arXiv:2603.01426, 2026

work page arXiv 2026
[3]

Ananthanarayanan, A

S. Ananthanarayanan, A. Sengupta, and T. Chakraborty. Understanding the physics of key-value cache compression for llms through attention dynamics, 2026

work page 2026
[4]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

work page 2024
[5]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[6]

Bianchi, M

O. Bianchi, M. J. Koretsky, M. Willey, C. X. Alvarado, T. Nayak, A. Asija, N. Kuznetsov, M. A. Nalls, F. Faghri, and D. Khashabi. Lost in the haystack: Smaller needles are more difficult for llms to find.arXiv preprint arXiv:2505.18148, abs/2505.18148, 2025

work page arXiv 2025
[7]

A. Chen, Z. Chen, M. Zhang, D. Yang, and H. Zhao. The pitfalls of KV cache compression. arXiv preprint arXiv:2510.00231, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

A. Chen, R. Geh, A. Grover, G. V . den Broeck, and D. Israel. The pitfalls of kv cache compression, 2025

work page 2025
[9]

R. Chen, Z. Wang, B. Cao, T. Wu, S. Zheng, X. Li, X. Wei, S. Yan, M. Li, and Y . Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[10]

Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y . Wang. Finqa: A dataset of numerical reasoning over financial data.Proceedings of EMNLP 2021, 2021

work page 2021
[11]

Dagdelen, A

J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, and A. Jain. Structured information extraction from scientific text with large language models.Nature Communications, 15(1):1418, 2024

work page 2024
[12]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Egiazarian, R

V . Egiazarian, R. L. Castro, D. Kuznedelev, A. Panferov, E. Kurtic, S. Pandit, A. Marques, M. Kurtz, S. Ashkboos, T. Hoefler, and D. Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2026

work page 2026
[14]

GLM, :, A

T. GLM, :, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Sun, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. L...

work page 2024
[15]

Hengle, P

A. Hengle, P. Bajpai, S. Dan, and T. Chakraborty. Multilingual needle in a haystack: Investigat- ing long-context behavior of multilingual large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2025
[16]

Hooper, S

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gho- lami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. In Advances in Neural Information Processing Systems 37, 2024

work page 2024
[17]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models? InProceedings of the First Conference on Language Modeling (COLM), 2024

work page 2024
[18]

Jégou, R

H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 861–864. IEEE, 2011

work page 2011
[19]

Jiang, Y

H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[20]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[21]

W. Lee, J. Lee, J. Seo, and J. Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024

work page 2024
[22]

D. Li, R. Shao, A. Xie, Y . Sheng, L. Zheng, J. Gonzalez, I. Stoica, X. Ma, and H. Zhang. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

work page 2023
[23]

J. Li, N. Farahini, E. Iuliugin, M. Vesterlund, C. Häggström, G. Wang, S. Upasani, A. Sachdeva, R. Li, F. Fu, C. Wu, A. Siddiqua, J. Long, T. Zhao, M. Musaddiq, H. Zeffer, Y . Du, M. Wang, Q. Li, B. Li, U. Thakker, and R. Prabhakar. Snapstream: Efficient long sequence decoding on dataflow accelerators, 2025

work page 2025
[24]

J. Li, M. Wang, Z. Zheng, and M. Zhang. Loogle: Can long-context language models understand long contexts?, 2024

work page 2024
[25]

J. Li, Z. Wang, Y . Zhang, S. Liu, M. Liu, X. Li, J. Chen, Y . Shen, Z. Zhang, Y . Guo, X. Chen, M. Zhao, T. Chen, I. Stoica, H. Chen, L. Chen, et al. SnapStream: Efficient long sequence decoding on dataflow accelerators.arXiv preprint arXiv:2511.03092, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

M. Li, S. Zhang, T. Zhang, H. Duan, Y . Liu, and K. Chen. Needlebench: Evaluating LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025

work page 2025
[27]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[28]

C. Lin, J. Tang, S. Yang, H. Wang, T. Tang, B. Tian, I. Stoica, S. Han, and M. Gao. Twilight: Adaptive attention sparsity with hierarchical top-ppruning, 2025

work page 2025
[29]

Lin, ZhiqiBai, X

H. Lin, ZhiqiBai, X. Zhang, S. Yang, J. Wang, Y . Xu, J. Liu, Y . Zhao, X. Li, Y . Xu, W. Su, and B. Zheng. Reconstructing KV caches with cross-layer fusion for enhanced transformers. InThe Fourteenth International Conference on Learning Representations, 2026. 7

work page 2026
[30]

D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhang, et al. Retrievalattention: Accelerating long-context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

L. Liu, Z. Qu, Z. Chen, Y . Ding, and Y . Xie. Transformer acceleration with dynamic sparse attention.arXiv preprint arXiv:2110.11299, 2021

work page arXiv 2021
[32]

T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto- completion systems.arXiv preprint arXiv:2306.03091, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[34]

Q. Luo, Y . Ye, S. Liang, Z. Zhang, Y . Qin, Y . Lu, Y . Wu, X. Cong, Y . Lin, Y . Zhang, et al. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 436–464, 2024

work page 2024
[35]

Malinovskii, A

V . Malinovskii, A. Panferov, I. Ilin, H. Guo, P. Richtárik, and D. Alistarh. Pushing the limits of large language model quantization via the linearity theorem.arXiv preprint arXiv:2411.17525, 2024

work page arXiv 2024
[36]

Micikevicius, D

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Hei- necke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Siu, and H. Wu. Fp8 formats for deep learning, 2022

work page 2022
[37]

Nvidia, arm, and intel publish fp8 specification for standardiza- tion as an interchange format for ai

NVIDIA. Nvidia, arm, and intel publish fp8 specification for standardiza- tion as an interchange format for ai. https://developer.nvidia.com/blog/ nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/ , 2022

work page 2022
[38]

Optimizing inference for long context and large batch sizes with nvfp4 kv cache

NVIDIA. Optimizing inference for long context and large batch sizes with nvfp4 kv cache. https://developer.nvidia.com/blog/ optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/ , 2025

work page 2025
[39]

Quantization

NVIDIA. Quantization. https://nvidia.github.io/TensorRT-LLM/features/ quantization.html, 2026. Accessed: 2026-04-08

work page 2026
[40]

Speed up inference with sota quantization techniques in tensorrt- llm

NVIDIA Corporation. Speed up inference with sota quantization techniques in tensorrt- llm. https://nvidia.github.io/TensorRT-LLM/blogs/quantization-in-TRT-LLM. html, 2026. Describes post-training quantization (FP8, INT8, INT4), performance/accuracy trade-offs, and KV-cache quantization in TensorRT-LLM. Accessed: 2026-04-08

work page 2026
[41]

Pekelis, M

L. Pekelis, M. Feil, F. Moret, M. Huang, and T. Peng. Llama 3 gra- dient: A series of long context models. https://gradient.ai/blog/ scaling-rotational-embeddings-for-long-context-language-models , 2024. Gradient AI blog post

work page 2024
[42]

Penedo, H

G. Penedo, H. Kydlí ˇcek, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V . Werra, and T. Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024

work page 2024
[43]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[44]

A. Qiao, Z. Yao, S. Rajbhandari, and Y . He. SwiftKV: Fast prefill-optimized inference with knowledge-preserving model transformation. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25734–25753, Suzhou, China, Nov. 2025. Association for ...

work page 2025
[45]

Sheng, L

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, B. Chen, C. Barrett, J. E. Gonzalez, P. Liang, C. Re, I. Stoica, and C. Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[46]

M. Shi, T. Furon, and H. Jégou. A group testing framework for similarity search in high- dimensional spaces. InProceedings of the 22nd ACM International Conference on Multimedia, MM ’14, page 407–416, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014
[47]

Shutova, V

A. Shutova, V . Malinovskii, V . Egiazarian, D. Kuznedelev, D. Mazur, S. Nikita, I. Ermakov, and D. Alistarh. Cache me if you must: Adaptive key-value quantization for large language models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Mach...

work page 2025
[48]

Shutova, V

A. Shutova, V . Malinovskii, V . Egiazarian, D. Kuznedelev, D. Mazur, N. Surkov, I. Ermakov, and D. Alistarh. Cache me if you must: Adaptive key-value quantization for large language models.arXiv preprint arXiv:2501.19392, 2025

work page arXiv 2025
[49]

Sun, L.-W

H. Sun, L.-W. Chang, W. Bao, S. Zheng, N. Zheng, X. Liu, H. Dong, Y . Chi, and B. Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. InProceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025
[50]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[51]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[52]

Quantized KV cache

vLLM Project. Quantized KV cache. https://docs.vllm.ai/en/latest/features/ quantization/quantized_kvcache/, 2026. Accessed: 2026-04-08

work page 2026
[53]

Quantized kv cache

vLLM Team. Quantized kv cache. https://docs.vllm.ai/en/latest/features/ quantization/quantized_kvcache/, 2026. Accessed: 2026-04-08

work page 2026
[54]

M. Wang, L. Chen, F. Cheng, S. Liao, X. Zhang, B. Wu, H. Yu, N. Xu, L. Zhang, R. Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5627–5646, 2024

work page 2024
[55]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

work page 2025
[56]

D. Wu, H. Wang, W. Yu, Y . Zhang, K.-W. Chang, and D. Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.CoRR, 2024

work page 2024
[57]

Y . Wu, H. Wu, and K. Tu. A systematic study of cross-layer kv sharing for efficient llm inference, 2025

work page 2025
[58]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[59]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025
[60]

A. Yang, B. Yu, C. Li, D. Liu, F. Huang, H. Huang, J. Jiang, J. Tu, J. Zhang, J. Zhou, J. Lin, K. Dang, K. Yang, L. Yu, M. Li, M. Sun, Q. Zhu, R. Men, T. He, W. Xu, W. Yin, W. Yu, X. Qiu, X. Ren, X. Yang, Y . Li, Z. Xu, and Z. Zhang. Qwen2.5-1m technical report, 2025

work page 2025
[61]

Y . Yang, Z. Cao, Q. Chen, L. Qin, D. Yang, H. Zhao, and Z. Chen. Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

work page arXiv 2024
[62]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097. A...

work page 2025
[63]

T. Yuan, X. Ning, D. Zhou, Z. Yang, S. Li, M. Zhuang, Z. Tan, Z. Yao, D. Lin, B. Li, G. Dai, S. Yan, and Y . Wang. LV-eval: A balanced long-context benchmark with 5 length levels up to 256k. InSecond Conference on Language Modeling, 2025

work page 2025
[64]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems 36, 2023

work page 2023
[65]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[66]

Leave No Document Behind

L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. InConference on Neural Information Processing Systems (NeurIPS), 2024. 10 A Preliminary Benchmark Exploration & Configurations Before our primary investigati...

work page 2024

[1] [1]

R. Y . Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y . He. Deepspeed inference: Enabling efficient inference of trans- former models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022

work page 2022

[2] [2]

Ananthanarayanan and A

S. Ananthanarayanan and A. Sengupta. Understanding the physics of key-value cache compres- sion for LLMs through attention dynamics.arXiv preprint arXiv:2603.01426, 2026

work page arXiv 2026

[3] [3]

Ananthanarayanan, A

S. Ananthanarayanan, A. Sengupta, and T. Chakraborty. Understanding the physics of key-value cache compression for llms through attention dynamics, 2026

work page 2026

[4] [4]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

work page 2024

[5] [5]

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[6] [6]

Bianchi, M

O. Bianchi, M. J. Koretsky, M. Willey, C. X. Alvarado, T. Nayak, A. Asija, N. Kuznetsov, M. A. Nalls, F. Faghri, and D. Khashabi. Lost in the haystack: Smaller needles are more difficult for llms to find.arXiv preprint arXiv:2505.18148, abs/2505.18148, 2025

work page arXiv 2025

[7] [7]

A. Chen, Z. Chen, M. Zhang, D. Yang, and H. Zhao. The pitfalls of KV cache compression. arXiv preprint arXiv:2510.00231, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

A. Chen, R. Geh, A. Grover, G. V . den Broeck, and D. Israel. The pitfalls of kv cache compression, 2025

work page 2025

[9] [9]

R. Chen, Z. Wang, B. Cao, T. Wu, S. Zheng, X. Li, X. Wei, S. Yan, M. Li, and Y . Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[10] [10]

Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y . Wang. Finqa: A dataset of numerical reasoning over financial data.Proceedings of EMNLP 2021, 2021

work page 2021

[11] [11]

Dagdelen, A

J. Dagdelen, A. Dunn, S. Lee, N. Walker, A. S. Rosen, G. Ceder, K. A. Persson, and A. Jain. Structured information extraction from scientific text with large language models.Nature Communications, 15(1):1418, 2024

work page 2024

[12] [12]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Egiazarian, R

V . Egiazarian, R. L. Castro, D. Kuznedelev, A. Panferov, E. Kurtic, S. Pandit, A. Marques, M. Kurtz, S. Ashkboos, T. Hoefler, and D. Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2026

work page 2026

[14] [14]

GLM, :, A

T. GLM, :, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Sun, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. L...

work page 2024

[15] [15]

Hengle, P

A. Hengle, P. Bajpai, S. Dan, and T. Chakraborty. Multilingual needle in a haystack: Investigat- ing long-context behavior of multilingual large language models. In L. Chiruzzo, A. Ritter, and L. Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo...

work page 2025

[16] [16]

Hooper, S

C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y . S. Shao, K. Keutzer, and A. Gho- lami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. In Advances in Neural Information Processing Systems 37, 2024

work page 2024

[17] [17]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models? InProceedings of the First Conference on Language Modeling (COLM), 2024

work page 2024

[18] [18]

Jégou, R

H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. In2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 861–864. IEEE, 2011

work page 2011

[19] [19]

Jiang, Y

H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[20] [20]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[21] [21]

W. Lee, J. Lee, J. Seo, and J. Sim. Infinigen: Efficient generative inference of large language models with dynamic kv cache management. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024

work page 2024

[22] [22]

D. Li, R. Shao, A. Xie, Y . Sheng, L. Zheng, J. Gonzalez, I. Stoica, X. Ma, and H. Zhang. How long can context length of open-source LLMs truly promise? InNeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

work page 2023

[23] [23]

J. Li, N. Farahini, E. Iuliugin, M. Vesterlund, C. Häggström, G. Wang, S. Upasani, A. Sachdeva, R. Li, F. Fu, C. Wu, A. Siddiqua, J. Long, T. Zhao, M. Musaddiq, H. Zeffer, Y . Du, M. Wang, Q. Li, B. Li, U. Thakker, and R. Prabhakar. Snapstream: Efficient long sequence decoding on dataflow accelerators, 2025

work page 2025

[24] [24]

J. Li, M. Wang, Z. Zheng, and M. Zhang. Loogle: Can long-context language models understand long contexts?, 2024

work page 2024

[25] [25]

J. Li, Z. Wang, Y . Zhang, S. Liu, M. Liu, X. Li, J. Chen, Y . Shen, Z. Zhang, Y . Guo, X. Chen, M. Zhao, T. Chen, I. Stoica, H. Chen, L. Chen, et al. SnapStream: Efficient long sequence decoding on dataflow accelerators.arXiv preprint arXiv:2511.03092, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

M. Li, S. Zhang, T. Zhang, H. Duan, Y . Liu, and K. Chen. Needlebench: Evaluating LLM retrieval and reasoning across varying information densities.Transactions on Machine Learning Research, 2025

work page 2025

[27] [27]

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen. Snapkv: Llm knows what you are looking for before generation. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[28] [28]

C. Lin, J. Tang, S. Yang, H. Wang, T. Tang, B. Tian, I. Stoica, S. Han, and M. Gao. Twilight: Adaptive attention sparsity with hierarchical top-ppruning, 2025

work page 2025

[29] [29]

Lin, ZhiqiBai, X

H. Lin, ZhiqiBai, X. Zhang, S. Yang, J. Wang, Y . Xu, J. Liu, Y . Zhao, X. Li, Y . Xu, W. Su, and B. Zheng. Reconstructing KV caches with cross-layer fusion for enhanced transformers. InThe Fourteenth International Conference on Learning Representations, 2026. 7

work page 2026

[30] [30]

D. Liu, M. Chen, B. Lu, H. Jiang, Z. Han, Q. Zhang, Q. Chen, C. Zhang, B. Ding, K. Zhang, et al. Retrievalattention: Accelerating long-context llm inference via vector retrieval.arXiv preprint arXiv:2409.10516, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

L. Liu, Z. Qu, Z. Chen, Y . Ding, and Y . Xie. Transformer acceleration with dynamic sparse attention.arXiv preprint arXiv:2110.11299, 2021

work page arXiv 2021

[32] [32]

T. Liu, C. Xu, and J. McAuley. Repobench: Benchmarking repository-level code auto- completion systems.arXiv preprint arXiv:2306.03091, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[34] [34]

Q. Luo, Y . Ye, S. Liang, Z. Zhang, Y . Qin, Y . Lu, Y . Wu, X. Cong, Y . Lin, Y . Zhang, et al. Repoagent: An llm-powered open-source framework for repository-level code documentation generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 436–464, 2024

work page 2024

[35] [35]

Malinovskii, A

V . Malinovskii, A. Panferov, I. Ilin, H. Guo, P. Richtárik, and D. Alistarh. Pushing the limits of large language model quantization via the linearity theorem.arXiv preprint arXiv:2411.17525, 2024

work page arXiv 2024

[36] [36]

Micikevicius, D

P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Hei- necke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Siu, and H. Wu. Fp8 formats for deep learning, 2022

work page 2022

[37] [37]

Nvidia, arm, and intel publish fp8 specification for standardiza- tion as an interchange format for ai

NVIDIA. Nvidia, arm, and intel publish fp8 specification for standardiza- tion as an interchange format for ai. https://developer.nvidia.com/blog/ nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/ , 2022

work page 2022

[38] [38]

Optimizing inference for long context and large batch sizes with nvfp4 kv cache

NVIDIA. Optimizing inference for long context and large batch sizes with nvfp4 kv cache. https://developer.nvidia.com/blog/ optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/ , 2025

work page 2025

[39] [39]

Quantization

NVIDIA. Quantization. https://nvidia.github.io/TensorRT-LLM/features/ quantization.html, 2026. Accessed: 2026-04-08

work page 2026

[40] [40]

Speed up inference with sota quantization techniques in tensorrt- llm

NVIDIA Corporation. Speed up inference with sota quantization techniques in tensorrt- llm. https://nvidia.github.io/TensorRT-LLM/blogs/quantization-in-TRT-LLM. html, 2026. Describes post-training quantization (FP8, INT8, INT4), performance/accuracy trade-offs, and KV-cache quantization in TensorRT-LLM. Accessed: 2026-04-08

work page 2026

[41] [41]

Pekelis, M

L. Pekelis, M. Feil, F. Moret, M. Huang, and T. Peng. Llama 3 gra- dient: A series of long context models. https://gradient.ai/blog/ scaling-rotational-embeddings-for-long-context-language-models , 2024. Gradient AI blog post

work page 2024

[42] [42]

Penedo, H

G. Penedo, H. Kydlí ˇcek, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V . Werra, and T. Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024

work page 2024

[43] [43]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. YaRN: Efficient context window extension of large language models. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[44] [44]

A. Qiao, Z. Yao, S. Rajbhandari, and Y . He. SwiftKV: Fast prefill-optimized inference with knowledge-preserving model transformation. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25734–25753, Suzhou, China, Nov. 2025. Association for ...

work page 2025

[45] [45]

Sheng, L

Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, D. Y . Fu, Z. Xie, B. Chen, C. Barrett, J. E. Gonzalez, P. Liang, C. Re, I. Stoica, and C. Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InProceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[46] [46]

M. Shi, T. Furon, and H. Jégou. A group testing framework for similarity search in high- dimensional spaces. InProceedings of the 22nd ACM International Conference on Multimedia, MM ’14, page 407–416, New York, NY , USA, 2014. Association for Computing Machinery

work page 2014

[47] [47]

Shutova, V

A. Shutova, V . Malinovskii, V . Egiazarian, D. Kuznedelev, D. Mazur, S. Nikita, I. Ermakov, and D. Alistarh. Cache me if you must: Adaptive key-value quantization for large language models. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Conference on Mach...

work page 2025

[48] [48]

Shutova, V

A. Shutova, V . Malinovskii, V . Egiazarian, D. Kuznedelev, D. Mazur, N. Surkov, I. Ermakov, and D. Alistarh. Cache me if you must: Adaptive key-value quantization for large language models.arXiv preprint arXiv:2501.19392, 2025

work page arXiv 2025

[49] [49]

Sun, L.-W

H. Sun, L.-W. Chang, W. Bao, S. Zheng, N. Zheng, X. Liu, H. Dong, Y . Chi, and B. Chen. Shadowkv: Kv cache in shadows for high-throughput long-context llm inference. InProceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025

[50] [50]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[51] [51]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[52] [52]

Quantized KV cache

vLLM Project. Quantized KV cache. https://docs.vllm.ai/en/latest/features/ quantization/quantized_kvcache/, 2026. Accessed: 2026-04-08

work page 2026

[53] [53]

Quantized kv cache

vLLM Team. Quantized kv cache. https://docs.vllm.ai/en/latest/features/ quantization/quantized_kvcache/, 2026. Accessed: 2026-04-08

work page 2026

[54] [54]

M. Wang, L. Chen, F. Cheng, S. Liao, X. Zhang, B. Wu, H. Yu, N. Xu, L. Zhang, R. Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5627–5646, 2024

work page 2024

[55] [55]

J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

work page 2025

[56] [56]

D. Wu, H. Wang, W. Yu, Y . Zhang, K.-W. Chang, and D. Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.CoRR, 2024

work page 2024

[57] [57]

Y . Wu, H. Wu, and K. Tu. A systematic study of cross-layer kv sharing for efficient llm inference, 2025

work page 2025

[58] [58]

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[59] [59]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025

[60] [60]

A. Yang, B. Yu, C. Li, D. Liu, F. Huang, H. Huang, J. Jiang, J. Tu, J. Zhang, J. Zhou, J. Lin, K. Dang, K. Yang, L. Yu, M. Li, M. Sun, Q. Zhu, R. Men, T. He, W. Xu, W. Yin, W. Yu, X. Qiu, X. Ren, X. Yang, Y . Li, Z. Xu, and Z. Zhang. Qwen2.5-1m technical report, 2025

work page 2025

[61] [61]

Y . Yang, Z. Cao, Q. Chen, L. Qin, D. Yang, H. Zhao, and Z. Chen. Kvsharer: Efficient inference via layer-wise dissimilar kv cache sharing.arXiv preprint arXiv:2410.18517, 2024

work page arXiv 2024

[62] [62]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097. A...

work page 2025

[63] [63]

T. Yuan, X. Ning, D. Zhou, Z. Yang, S. Li, M. Zhuang, Z. Tan, Z. Yao, D. Lin, B. Li, G. Dai, S. Yan, and Y . Wang. LV-eval: A balanced long-context benchmark with 5 length levels up to 256k. InSecond Conference on Language Modeling, 2025

work page 2025

[64] [64]

Zhang, Y

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. Re, C. Barrett, Z. Wang, and B. Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems 36, 2023

work page 2023

[65] [65]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023

[66] [66]

Leave No Document Behind

L. Zheng, L. Yin, Z. Xie, C. L. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. InConference on Neural Information Processing Systems (NeurIPS), 2024. 10 A Preliminary Benchmark Exploration & Configurations Before our primary investigati...

work page 2024