Recognition: unknown
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Pith reviewed 2026-05-08 03:49 UTC · model grok-4.3
The pith
Multiple LLM agents can share one asymmetrically compressed KV cache, cutting memory use by 97.7% while keeping perplexity within 0.57% of baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PolyKV establishes that a shared, asymmetrically compressed KV cache pool can serve multiple independent inference agents by storing keys at int8 precision and values via TurboQuant 3-bit quantization, delivering a 97.7% memory reduction on Llama-3-8B with 15 agents at 4K context while limiting perplexity increase to 0.57% and preserving a mean BERTScore F1 of 0.928.
What carries the argument
The shared KV cache pool with asymmetric compression: int8 quantization on keys to preserve softmax stability, and TurboQuant (FWHT rotation followed by 3-bit Lloyd-Max quantization) on values, injected into multiple independent agent contexts.
If this is right
- Memory footprint remains constant regardless of how many agents read the same compressed pool.
- Perplexity degradation stays flat or improves slightly as context length grows from hundreds to thousands of tokens.
- The compression ratio of approximately 2.91x holds across the tested model scales and context sizes.
Where Pith is reading between the lines
- The design could support substantially larger numbers of concurrent agents on the same hardware without proportional memory growth.
- The asymmetric split suggests that preserving key precision is more important than value precision for maintaining attention quality.
- The shared-pool pattern may combine with existing KV eviction or paging methods to extend effective context length further.
Load-bearing premise
The assumption that int8 keys and 3-bit values retain enough information for the attention mechanism to produce stable generations across models, tasks, and context lengths.
What would settle it
Running the same 15-agent setup on additional models or on downstream tasks such as multi-turn dialogue or code completion and measuring whether the reported perplexity delta produces measurable drops in task accuracy.
read the original abstract
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. PolyKV enables multiple concurrent LLM inference agents to share a single asymmetrically compressed KV cache pool rather than allocating per-agent caches. Keys are quantized to int8 (q8_0) for softmax stability while values use TurboQuant (FWHT rotation + 3-bit Lloyd-Max quantization with N(0,1) centroids). The system injects the shared compressed cache into N independent HuggingFace DynamicCache contexts. Across SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct, context lengths 600-7194 tokens, and up to 15 agents, it reports a stable 2.91x compression ratio. On Llama-3-8B with 15 agents at 4K context, KV cache memory drops from 19.8 GB to 0.45 GB (97.7% reduction) with +0.57% perplexity degradation and mean BERTScore F1 of 0.928; the PPL delta does not worsen with agent count and can improve with longer coherent contexts.
Significance. If the reported memory-quality trade-off holds, PolyKV would enable substantially more efficient multi-agent LLM serving by eliminating redundant KV storage through sharing plus lossy asymmetric compression. The work supplies direct empirical measurements of memory usage, perplexity, and BERTScore across two model scales, multiple context lengths, and agent counts, which constitutes a concrete strength for an applied systems paper.
major comments (2)
- [Evaluation] Evaluation section: the central claim of 'maintaining' generation quality rests on +0.57% PPL degradation and 0.928 BERTScore, yet the manuscript supplies no error bars, no description of the exact datasets or tokenization used for perplexity, no statistical tests, and no baseline comparisons against other KV compression or sharing methods, leaving the magnitude and reliability of the quality numbers only moderately supported.
- [Evaluation] Evaluation section: no downstream task results (question answering accuracy, dialogue coherence, or long-context retrieval) are reported to test whether the 3-bit value quantization distorts the softmax(QK)V weighted sum in ways that PPL and BERTScore fail to detect, even though the paper notes that PPL delta remains stable across agent counts.
minor comments (1)
- The abstract lists context lengths as '600-7,194 tokens' but does not enumerate the precise lengths used in each experiment, which would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the evaluation section. We have revised the manuscript to improve clarity on our metrics and to explicitly discuss limitations. Point-by-point responses are provided below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim of 'maintaining' generation quality rests on +0.57% PPL degradation and 0.928 BERTScore, yet the manuscript supplies no error bars, no description of the exact datasets or tokenization used for perplexity, no statistical tests, and no baseline comparisons against other KV compression or sharing methods, leaving the magnitude and reliability of the quality numbers only moderately supported.
Authors: We agree that additional statistical details and context would strengthen the presentation. In the revised manuscript we have added error bars computed over five independent runs with varied context sampling seeds, an explicit description of the perplexity dataset (a 500-sequence subset drawn from the validation split of The Pile, tokenized with the model's native tokenizer), and a statement that formal statistical significance tests were omitted because the observed deltas are small and directionally consistent across all configurations. We have also included a direct baseline comparison against the uncompressed per-agent KV cache (the natural alternative) and against a uniform 4-bit symmetric quantization scheme. These additions supply the requested context while preserving the original experimental outcomes. revision: partial
-
Referee: [Evaluation] Evaluation section: no downstream task results (question answering accuracy, dialogue coherence, or long-context retrieval) are reported to test whether the 3-bit value quantization distorts the softmax(QK)V weighted sum in ways that PPL and BERTScore fail to detect, even though the paper notes that PPL delta remains stable across agent counts.
Authors: We acknowledge that task-specific metrics could offer further reassurance about the effect of value quantization on the attention output. Our evaluation deliberately employs perplexity (a direct measure of next-token modeling fidelity) and BERTScore (a semantic similarity metric on generated text) because they are standard, inexpensive proxies for the quality impact of KV-cache compression. The reported stability of the PPL delta across agent counts, together with its improvement on longer coherent contexts, indicates that any distortion does not compound under sharing. We have expanded the evaluation discussion and added a dedicated limitations paragraph that flags the absence of QA, dialogue, or retrieval benchmarks as future work. New downstream experiments lie outside the primary scope of this systems paper, which centers on memory reduction through shared asymmetric compression. revision: partial
- Results from additional downstream tasks (question answering accuracy, dialogue coherence, long-context retrieval) that were not part of the original experimental campaign.
Circularity Check
No circularity: purely empirical system with direct measurements
full rationale
The paper presents an engineering system for shared KV cache compression (int8 keys + TurboQuant 3-bit values via FWHT + Lloyd-Max) and evaluates it through direct runtime measurements of memory usage, perplexity, and BERTScore across models, context lengths, and agent counts. No derivation chain, first-principles equations, or predictions exist that could reduce to fitted inputs or self-citations. All reported gains (e.g., 97.7% memory reduction with +0.57% PPL) are explicit experimental outcomes, not constructed by definition or prior self-work.
Axiom & Free-Parameter Ledger
free parameters (1)
- Lloyd-Max quantization centroids =
N(0,1)
axioms (1)
- domain assumption Fast Walsh-Hadamard Transform followed by Lloyd-Max quantization yields acceptable error for value vectors in attention.
Forward citations
Cited by 2 Pith papers
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.
-
HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.
Reference graph
Works this paper leans on
-
[1]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
arXiv:2402.02750. 9 Hooper, C., et al. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. InNeurIPS
work page internal anchor Pith review arXiv
-
[2]
Kvquant: Towards 10 million context length llm inference with kv cache quantization,
arXiv:2401.18079. Zhang, Y., et al. Unifying KV Cache Compression for Large Language Models with LeanKV. arXiv:2412.03131, December
-
[3]
Tao, C., et al. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations. arXiv:2410.13212, October
-
[4]
arXiv:2501.16383. Saxena, A., et al. KVLinC: KV Cache Quantization with Hadamard Rotation and Linear Correction. arXiv:2510.05373, October
-
[5]
KVTC: KV Cache Transform Coding for Compact Storage in LLM Inference
Staniszewski, M., and Lancucki, L. KVTC: KV Cache Transform Coding for Compact Storage in LLM Inference. arXiv:2511.01815, November
-
[6]
TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization
Patel, A. TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization. arXiv:2603.27467, March
-
[7]
arXiv:2507.07400. Ye, Z., et al. KVCOMM: Online Cross-Context KV Cache Communication for Efficient LLM- Based Multi-Agent Systems. InNeurIPS
-
[8]
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems
arXiv:2510.12872. Kim, S., et al. LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents. arXiv:2602.01053, February
-
[9]
Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,
Anonymous. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices. arXiv:2603.04428, March
-
[10]
RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
Various. RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse. arXiv:2603.13289, February
-
[11]
See also: llama.cpp Discussion #20969;github. com/scos-lab/turboquant. Yoon, J. ITQ3_S: Interleaved Ternary Quantization with TurboQuant. arXiv:2603.27914, March
- [12]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.