LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
Pith reviewed 2026-05-22 22:04 UTC · model grok-4.3
The pith
LogQuant's log-based filtering enables 2-bit KV cache quantization with higher accuracy than prior token-importance methods across the full context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a log-based filtering mechanism, LogQuant selectively compresses the KV Cache across the entire context in 2 bits, achieving better performance with the same or even reduced memory footprint compared to existing methods that assume later tokens are more important or attempt to predict important tokens based on earlier attention patterns.
What carries the argument
The log-based filtering mechanism that selectively compresses KV cache values in a log-distributed manner for 2-bit quantization.
If this is right
- Throughput increases by 25% without additional memory consumption.
- Batch size can be boosted by 60% at the same memory footprint.
- Accuracy on math and code completion tasks improves by 40% to 200% at the same compression ratio.
- Integration with frameworks like the transformers library is straightforward.
Where Pith is reading between the lines
- If the log distribution proves effective, it could be tested on other compression ratios or model architectures beyond the reported benchmarks.
- The approach might allow longer context windows in LLMs by reducing per-token memory more efficiently.
- Combining LogQuant with dynamic context management techniques could further optimize inference in resource-constrained settings.
Load-bearing premise
The log-based filtering mechanism selectively compresses KV cache values across the full context without introducing its own accuracy losses or computational overheads.
What would settle it
Running the same math and code completion benchmarks with LogQuant at 2-bit compression and comparing accuracy directly against existing 2-bit methods without the log filtering to check if the reported 40-200% gains hold.
Figures
read the original abstract
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LogQuant, a 2-bit quantization technique for KV cache in LLMs. It employs a log-based filtering mechanism to selectively compress KV values across the full context rather than relying on later-token importance assumptions or attention-pattern predictions. The approach is claimed to deliver 25% higher throughput, 60% larger batch sizes at fixed memory, and 40-200% accuracy gains on math and code-completion tasks relative to prior 2-bit methods, while integrating directly with the transformers library.
Significance. If the reported benchmarks are reproducible, the work offers a practical advance in memory-efficient LLM inference by providing a context-wide compression strategy that sidesteps common misprediction issues in token-selection baselines. The emphasis on end-to-end integration and concrete throughput/batch-size numbers strengthens its potential impact for deployment.
minor comments (2)
- The abstract states quantitative gains (25% throughput, 40-200% accuracy) without referencing the corresponding experimental tables or figures; adding explicit cross-references would improve readability.
- The final sentence of the abstract contains awkward phrasing ('Implementation can be available in https://github.com/Concyclics/LogQuantKV'); a direct statement that code will be released would be clearer.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the provided report, so there are no individual points requiring point-by-point rebuttal or manuscript changes at this stage.
Circularity Check
No significant circularity
full rationale
The manuscript describes an empirical 2-bit KV cache quantization method based on a log-distributed filtering approach, with performance claims resting on integration details and benchmark results rather than any derivation chain. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the provided text. The central claims reduce to the described algorithm and reported throughput/accuracy numbers, which are externally falsifiable and do not reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the positions of the attention spikes follow a log distribution ... log-distributed token selection scheme ... base-2 logarithmic approach ... halving the density
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
LogQuant ... log-distributed high-attention pattern ... log2 sparsity selection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gqa: Training generalized multi-query transformer models from multi-head check- points
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 4895–4901,
work page 2023
-
[3]
(Accessed on 09/26/2024). Hicham Badri and Appu Shaji. Half-quadratic quantization of large machine learning models, November
work page 2024
-
[4]
URL https://mobiusml.github.io/hqq_blog/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang
(Accessed on 09/26/2024). Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643,
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Retrieval-Augmented Generation for Large Language Models: A Survey
URL https://github.com/huggingface/ optimum-quanto. Accessed: 2024-09-06. 12 Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527,
-
[10]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Acti- vationaware weight quantization for llm compression and acceleration. arxiv. arXiv preprint arXiv:2306.00978,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532,
-
[13]
Mini- cache: Kv cache compression in depth dimension for large language models
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model. CoRR, 2024a. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension ...
-
[14]
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal
(Accessed on 09/26/2024). Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,
work page 2024
-
[15]
Fast Transformer Decoding: One Write-Head is All You Need
OpenAI. Models - openai api. https://platform.openai.com/docs/models/ gpt-4-and-gpt-4-turbo , 2024a. (Accessed on 09/26/2024). OpenAI. Openai o1 hub — openai. https://openai.com/o1/, 2024b. (Accessed on 09/26/2024). Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
URL https://huggingface.co/blog/kv-cache-quantization. Accessed: 2024-09-
work page 2024
-
[17]
Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,
-
[18]
Llm inference unveiled: Survey and roofline model insights
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,
-
[19]
Recent methods aim to compress the KV cache further while preserving accuracy
and Qserve (Lin et al., 2024), applied 4-bit quantization to the KV cache with minimal accuracy loss. Recent methods aim to compress the KV cache further while preserving accuracy. QAQ (Dong et al.,
work page 2024
-
[20]
improves accuracy by storing the quantization error of the KV cache as a sparse matrix with low-rank decomposition. KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy. A.3 TRAINING -REQUIRED APPROACHES An early memory-reducing attention design is Multi-Qu...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.