LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Bingsheng He; Han Chen; Mian Lu; Pingyi Luo; Yuqiang Chen; Zicong Jiang; Zining Zhang

arxiv: 2503.19950 · v1 · pith:KRPZRQSKnew · submitted 2025-03-25 · 💻 cs.LG · cs.AI· cs.CL

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Han Chen , Zicong Jiang , Zining Zhang , Bingsheng He , Pingyi Luo , Mian Lu , Yuqiang Chen This is my paper

Pith reviewed 2026-05-22 22:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords KV cachequantization2-bitLLM inferencememory optimizationlog distributionaccuracy preservationlarge language models

0 comments

The pith

LogQuant's log-based filtering enables 2-bit KV cache quantization with higher accuracy than prior token-importance methods across the full context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LogQuant as a 2-bit quantization method for the KV cache in LLM inference. It employs a log-based filtering mechanism to selectively compress the cache throughout the entire context instead of depending on assumptions about later tokens being more important or on attention pattern predictions. This leads to better performance at the same or lower memory usage. Sympathetic readers would care because the method reportedly increases throughput by 25 percent, allows 60 percent larger batch sizes, and boosts accuracy by 40 to 200 percent on difficult tasks like math and code completion without raising memory needs.

Core claim

By applying a log-based filtering mechanism, LogQuant selectively compresses the KV Cache across the entire context in 2 bits, achieving better performance with the same or even reduced memory footprint compared to existing methods that assume later tokens are more important or attempt to predict important tokens based on earlier attention patterns.

What carries the argument

The log-based filtering mechanism that selectively compresses KV cache values in a log-distributed manner for 2-bit quantization.

If this is right

Throughput increases by 25% without additional memory consumption.
Batch size can be boosted by 60% at the same memory footprint.
Accuracy on math and code completion tasks improves by 40% to 200% at the same compression ratio.
Integration with frameworks like the transformers library is straightforward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the log distribution proves effective, it could be tested on other compression ratios or model architectures beyond the reported benchmarks.
The approach might allow longer context windows in LLMs by reducing per-token memory more efficiently.
Combining LogQuant with dynamic context management techniques could further optimize inference in resource-constrained settings.

Load-bearing premise

The log-based filtering mechanism selectively compresses KV cache values across the full context without introducing its own accuracy losses or computational overheads.

What would settle it

Running the same math and code completion benchmarks with LogQuant at 2-bit compression and comparing accuracy directly against existing 2-bit methods without the log filtering to check if the reported 40-200% gains hold.

Figures

Figures reproduced from arXiv: 2503.19950 by Bingsheng He, Han Chen, Mian Lu, Pingyi Luo, Yuqiang Chen, Zicong Jiang, Zining Zhang.

**Figure 2.** Figure 2: The maximum attention score of each token position across four consecutive decod [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention distribution across different token positions, represented as boxplots based on [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The attention coverage without the first two sink tokens for different selection meth [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Eviction and Quantization Loss on Attention Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: LogQuant’s KV cache compression workflow. The number of reserved original-precision [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy(EM) with different compression ratio in GSM8K tasks for different models. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: memory usage and throughput comparison between 2bit LogQuant and 16bit baseline [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogQuant uses log filtering for 2-bit KV cache quantization across the full context and reports solid throughput and accuracy gains over recency or attention baselines.

read the letter

The main point is that this paper describes a log-distributed filtering method to quantize the entire KV cache to 2 bits. It avoids the bottlenecks of dropping tokens by recency or by predicting importance from early attention patterns, and the benchmarks show 25% throughput improvement plus 60% larger batches at the same memory use. Accuracy on math and code tasks rises 40-200% at equivalent compression ratios compared with the baselines it tests against. The implementation hooks into the transformers library and a GitHub repo is linked, which makes it easy to try out. The full manuscript backs the central claim with method details and results, and no internal contradictions appear in the quantization or filtering logic. The log distribution is a straightforward way to handle value ranges in KV entries, and the reported numbers line up with the described approach. Minor soft spots exist around the size of the accuracy jumps. Those gains are large, so extra checks on run-to-run variance and exact baseline configurations would help readers trust the upper end of the range. A bit more on the added compute cost of the filtering step itself would also be useful, even though the throughput figures suggest the overhead stays manageable. This work targets engineers and researchers who optimize LLM inference for long contexts or bigger batches. Anyone already looking at KV cache compression will find the direct comparisons and integration notes practical. The empirical grounding and distinct technique make it worth a serious referee. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces LogQuant, a 2-bit quantization technique for KV cache in LLMs. It employs a log-based filtering mechanism to selectively compress KV values across the full context rather than relying on later-token importance assumptions or attention-pattern predictions. The approach is claimed to deliver 25% higher throughput, 60% larger batch sizes at fixed memory, and 40-200% accuracy gains on math and code-completion tasks relative to prior 2-bit methods, while integrating directly with the transformers library.

Significance. If the reported benchmarks are reproducible, the work offers a practical advance in memory-efficient LLM inference by providing a context-wide compression strategy that sidesteps common misprediction issues in token-selection baselines. The emphasis on end-to-end integration and concrete throughput/batch-size numbers strengthens its potential impact for deployment.

minor comments (2)

The abstract states quantitative gains (25% throughput, 40-200% accuracy) without referencing the corresponding experimental tables or figures; adding explicit cross-references would improve readability.
The final sentence of the abstract contains awkward phrasing ('Implementation can be available in https://github.com/Concyclics/LogQuantKV'); a direct statement that code will be released would be clearer.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the provided report, so there are no individual points requiring point-by-point rebuttal or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical 2-bit KV cache quantization method based on a log-distributed filtering approach, with performance claims resting on integration details and benchmark results rather than any derivation chain. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the provided text. The central claims reduce to the described algorithm and reported throughput/accuracy numbers, which are externally falsifiable and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.0 · 5732 in / 1319 out tokens · 31318 ms · 2026-05-22T22:04:24.485785+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the positions of the attention spikes follow a log distribution ... log-distributed token selection scheme ... base-2 logarithmic approach ... halving the density
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

LogQuant ... log-distributed high-attention pattern ... log2 sparsity selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gqa: Training generalized multi-query transformer models from multi-head check- points

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 4895–4901,

work page 2023
[3]

Hicham Badri and Appu Shaji

(Accessed on 09/26/2024). Hicham Badri and Appu Shaji. Half-quadratic quantization of large machine learning models, November

work page 2024
[4]

Qwen Technical Report

URL https://mobiusml.github.io/hqq_blog/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang

(Accessed on 09/26/2024). Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643,

work page arXiv 2024
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Retrieval-Augmented Generation for Large Language Models: A Survey

URL https://github.com/huggingface/ optimum-quanto. Accessed: 2024-09-06. 12 Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527,

work page arXiv
[10]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Acti- vationaware weight quantization for llm compression and acceleration. arxiv. arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving.CoRR, abs/2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532,

work page arXiv
[13]

Mini- cache: Kv cache compression in depth dimension for large language models

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model. CoRR, 2024a. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension ...

work page arXiv
[14]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

(Accessed on 09/26/2024). Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,

work page 2024
[15]

Fast Transformer Decoding: One Write-Head is All You Need

OpenAI. Models - openai api. https://platform.openai.com/docs/models/ gpt-4-and-gpt-4-turbo , 2024a. (Accessed on 09/26/2024). OpenAI. Openai o1 hub — openai. https://openai.com/o1/, 2024b. (Accessed on 09/26/2024). Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Accessed: 2024-09-

URL https://huggingface.co/blog/kv-cache-quantization. Accessed: 2024-09-

work page 2024
[17]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,

work page arXiv
[18]

Llm inference unveiled: Survey and roofline model insights

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

work page arXiv
[19]

Recent methods aim to compress the KV cache further while preserving accuracy

and Qserve (Lin et al., 2024), applied 4-bit quantization to the KV cache with minimal accuracy loss. Recent methods aim to compress the KV cache further while preserving accuracy. QAQ (Dong et al.,

work page 2024
[20]

KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy

improves accuracy by storing the quantization error of the KV cache as a sparse matrix with low-rank decomposition. KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy. A.3 TRAINING -REQUIRED APPROACHES An early memory-reducing attention design is Multi-Qu...

work page 2019

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Gqa: Training generalized multi-query transformer models from multi-head check- points

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 4895–4901,

work page 2023

[3] [3]

Hicham Badri and Appu Shaji

(Accessed on 09/26/2024). Hicham Badri and Appu Shaji. Half-quadratic quantization of large machine learning models, November

work page 2024

[4] [4]

Qwen Technical Report

URL https://mobiusml.github.io/hqq_blog/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang

(Accessed on 09/26/2024). Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643,

work page arXiv 2024

[7] [7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Retrieval-Augmented Generation for Large Language Models: A Survey

URL https://github.com/huggingface/ optimum-quanto. Accessed: 2024-09-06. 12 Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527,

work page arXiv

[10] [10]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Acti- vationaware weight quantization for llm compression and acceleration. arxiv. arXiv preprint arXiv:2306.00978,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving.CoRR, abs/2405.04532, 2024

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532,

work page arXiv

[13] [13]

Mini- cache: Kv cache compression in depth dimension for large language models

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model. CoRR, 2024a. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension ...

work page arXiv

[14] [14]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

(Accessed on 09/26/2024). Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,

work page 2024

[15] [15]

Fast Transformer Decoding: One Write-Head is All You Need

OpenAI. Models - openai api. https://platform.openai.com/docs/models/ gpt-4-and-gpt-4-turbo , 2024a. (Accessed on 09/26/2024). OpenAI. Openai o1 hub — openai. https://openai.com/o1/, 2024b. (Accessed on 09/26/2024). Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Accessed: 2024-09-

URL https://huggingface.co/blog/kv-cache-quantization. Accessed: 2024-09-

work page 2024

[17] [17]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,

work page arXiv

[18] [18]

Llm inference unveiled: Survey and roofline model insights

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

work page arXiv

[19] [19]

Recent methods aim to compress the KV cache further while preserving accuracy

and Qserve (Lin et al., 2024), applied 4-bit quantization to the KV cache with minimal accuracy loss. Recent methods aim to compress the KV cache further while preserving accuracy. QAQ (Dong et al.,

work page 2024

[20] [20]

KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy

improves accuracy by storing the quantization error of the KV cache as a sparse matrix with low-rank decomposition. KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy. A.3 TRAINING -REQUIRED APPROACHES An early memory-reducing attention design is Multi-Qu...

work page 2019