pith. sign in

arxiv: 2503.19950 · v1 · pith:KRPZRQSKnew · submitted 2025-03-25 · 💻 cs.LG · cs.AI· cs.CL

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Pith reviewed 2026-05-22 22:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords KV cachequantization2-bitLLM inferencememory optimizationlog distributionaccuracy preservationlarge language models
0
0 comments X

The pith

LogQuant's log-based filtering enables 2-bit KV cache quantization with higher accuracy than prior token-importance methods across the full context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LogQuant as a 2-bit quantization method for the KV cache in LLM inference. It employs a log-based filtering mechanism to selectively compress the cache throughout the entire context instead of depending on assumptions about later tokens being more important or on attention pattern predictions. This leads to better performance at the same or lower memory usage. Sympathetic readers would care because the method reportedly increases throughput by 25 percent, allows 60 percent larger batch sizes, and boosts accuracy by 40 to 200 percent on difficult tasks like math and code completion without raising memory needs.

Core claim

By applying a log-based filtering mechanism, LogQuant selectively compresses the KV Cache across the entire context in 2 bits, achieving better performance with the same or even reduced memory footprint compared to existing methods that assume later tokens are more important or attempt to predict important tokens based on earlier attention patterns.

What carries the argument

The log-based filtering mechanism that selectively compresses KV cache values in a log-distributed manner for 2-bit quantization.

If this is right

  • Throughput increases by 25% without additional memory consumption.
  • Batch size can be boosted by 60% at the same memory footprint.
  • Accuracy on math and code completion tasks improves by 40% to 200% at the same compression ratio.
  • Integration with frameworks like the transformers library is straightforward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the log distribution proves effective, it could be tested on other compression ratios or model architectures beyond the reported benchmarks.
  • The approach might allow longer context windows in LLMs by reducing per-token memory more efficiently.
  • Combining LogQuant with dynamic context management techniques could further optimize inference in resource-constrained settings.

Load-bearing premise

The log-based filtering mechanism selectively compresses KV cache values across the full context without introducing its own accuracy losses or computational overheads.

What would settle it

Running the same math and code completion benchmarks with LogQuant at 2-bit compression and comparing accuracy directly against existing 2-bit methods without the log filtering to check if the reported 40-200% gains hold.

Figures

Figures reproduced from arXiv: 2503.19950 by Bingsheng He, Han Chen, Mian Lu, Pingyi Luo, Yuqiang Chen, Zicong Jiang, Zining Zhang.

Figure 1
Figure 1. Figure 1: The observed log-distribution pattern is evident not only in the magnitude of attention [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The maximum attention score of each token position across four consecutive decod [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Attention distribution across different token positions, represented as boxplots based on [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The attention coverage without the first two sink tokens for different selection meth [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Eviction and Quantization Loss on Attention Distribution [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LogQuant’s KV cache compression workflow. The number of reserved original-precision [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy(EM) with different compression ratio in GSM8K tasks for different models. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: memory usage and throughput comparison between 2bit LogQuant and 16bit baseline [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces LogQuant, a 2-bit quantization technique for KV cache in LLMs. It employs a log-based filtering mechanism to selectively compress KV values across the full context rather than relying on later-token importance assumptions or attention-pattern predictions. The approach is claimed to deliver 25% higher throughput, 60% larger batch sizes at fixed memory, and 40-200% accuracy gains on math and code-completion tasks relative to prior 2-bit methods, while integrating directly with the transformers library.

Significance. If the reported benchmarks are reproducible, the work offers a practical advance in memory-efficient LLM inference by providing a context-wide compression strategy that sidesteps common misprediction issues in token-selection baselines. The emphasis on end-to-end integration and concrete throughput/batch-size numbers strengthens its potential impact for deployment.

minor comments (2)
  1. The abstract states quantitative gains (25% throughput, 40-200% accuracy) without referencing the corresponding experimental tables or figures; adding explicit cross-references would improve readability.
  2. The final sentence of the abstract contains awkward phrasing ('Implementation can be available in https://github.com/Concyclics/LogQuantKV'); a direct statement that code will be released would be clearer.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments appear in the provided report, so there are no individual points requiring point-by-point rebuttal or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical 2-bit KV cache quantization method based on a log-distributed filtering approach, with performance claims resting on integration details and benchmark results rather than any derivation chain. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the provided text. The central claims reduce to the described algorithm and reported throughput/accuracy numbers, which are externally falsifiable and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.0 · 5732 in / 1319 out tokens · 31318 ms · 2026-05-22T22:04:24.485785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 8 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

  2. [2]

    Gqa: Training generalized multi-query transformer models from multi-head check- points

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 4895–4901,

  3. [3]

    Hicham Badri and Appu Shaji

    (Accessed on 09/26/2024). Hicham Badri and Appu Shaji. Half-quadratic quantization of large machine learning models, November

  4. [4]

    Qwen Technical Report

    URL https://mobiusml.github.io/hqq_blog/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  6. [6]

    Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang

    (Accessed on 09/26/2024). Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643,

  7. [7]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  8. [8]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    URL https://github.com/huggingface/ optimum-quanto. Accessed: 2024-09-06. 12 Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997,

  9. [9]

    Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

    Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527,

  10. [10]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

  11. [11]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Acti- vationaware weight quantization for llm compression and acceleration. arxiv. arXiv preprint arXiv:2306.00978,

  12. [12]

    Qserve: W4A8KV4 quantization and system co-design for efficient LLM serving.CoRR, abs/2405.04532, 2024

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532,

  13. [13]

    Mini- cache: Kv cache compression in depth dimension for large language models

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture- of-experts language model. CoRR, 2024a. Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension ...

  14. [14]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal

    (Accessed on 09/26/2024). Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,

  15. [15]

    Fast Transformer Decoding: One Write-Head is All You Need

    OpenAI. Models - openai api. https://platform.openai.com/docs/models/ gpt-4-and-gpt-4-turbo , 2024a. (Accessed on 09/26/2024). OpenAI. Openai o1 hub — openai. https://openai.com/o1/, 2024b. (Accessed on 09/26/2024). Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,

  16. [16]

    Accessed: 2024-09-

    URL https://huggingface.co/blog/kv-cache-quantization. Accessed: 2024-09-

  17. [17]

    Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

    Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,

  18. [18]

    Llm inference unveiled: Survey and roofline model insights

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, et al. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363,

  19. [19]

    Recent methods aim to compress the KV cache further while preserving accuracy

    and Qserve (Lin et al., 2024), applied 4-bit quantization to the KV cache with minimal accuracy loss. Recent methods aim to compress the KV cache further while preserving accuracy. QAQ (Dong et al.,

  20. [20]

    KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy

    improves accuracy by storing the quantization error of the KV cache as a sparse matrix with low-rank decomposition. KiVi (Liu et al., 2024c) introduces a 2-bit quantization by retaining a recent window of full-precision tokens, balancing memory efficiency and accuracy. A.3 TRAINING -REQUIRED APPROACHES An early memory-reducing attention design is Multi-Qu...