pith. machine review for the scientific record. sign in

arxiv: 2604.24971 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.CL· cs.DC

Recognition: unknown

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.DC
keywords KV cache compressionmulti-agent inferenceasymmetric quantizationshared memory poolLLM inference optimizationTurboQuantconcurrent agents
0
0 comments X

The pith

Multiple LLM agents can share one asymmetrically compressed KV cache, cutting memory use by 97.7% while keeping perplexity within 0.57% of baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a single KV cache pool, compressed once and then reused across many agents, replaces the usual per-agent allocation. Keys receive int8 quantization to protect the stability of the attention softmax, while values receive a 3-bit TurboQuant scheme that rotates the data with a Fast Walsh-Hadamard transform before Lloyd-Max quantization. Because the pool is written once and read by every agent through standard DynamicCache objects, total memory scales with the size of one compressed cache rather than with the number of agents. Experiments across two model sizes, three context lengths, and up to 15 agents confirm that the compression ratio stays near 2.91x and that perplexity does not worsen as more agents join.

Core claim

PolyKV establishes that a shared, asymmetrically compressed KV cache pool can serve multiple independent inference agents by storing keys at int8 precision and values via TurboQuant 3-bit quantization, delivering a 97.7% memory reduction on Llama-3-8B with 15 agents at 4K context while limiting perplexity increase to 0.57% and preserving a mean BERTScore F1 of 0.928.

What carries the argument

The shared KV cache pool with asymmetric compression: int8 quantization on keys to preserve softmax stability, and TurboQuant (FWHT rotation followed by 3-bit Lloyd-Max quantization) on values, injected into multiple independent agent contexts.

If this is right

  • Memory footprint remains constant regardless of how many agents read the same compressed pool.
  • Perplexity degradation stays flat or improves slightly as context length grows from hundreds to thousands of tokens.
  • The compression ratio of approximately 2.91x holds across the tested model scales and context sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could support substantially larger numbers of concurrent agents on the same hardware without proportional memory growth.
  • The asymmetric split suggests that preserving key precision is more important than value precision for maintaining attention quality.
  • The shared-pool pattern may combine with existing KV eviction or paging methods to extend effective context length further.

Load-bearing premise

The assumption that int8 keys and 3-bit values retain enough information for the attention mechanism to produce stable generations across models, tasks, and context lengths.

What would settle it

Running the same 15-agent setup on additional models or on downstream tasks such as multi-turn dialogue or code completion and measuring whether the reported perplexity delta produces measurable drops in task accuracy.

read the original abstract

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. PolyKV enables multiple concurrent LLM inference agents to share a single asymmetrically compressed KV cache pool rather than allocating per-agent caches. Keys are quantized to int8 (q8_0) for softmax stability while values use TurboQuant (FWHT rotation + 3-bit Lloyd-Max quantization with N(0,1) centroids). The system injects the shared compressed cache into N independent HuggingFace DynamicCache contexts. Across SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct, context lengths 600-7194 tokens, and up to 15 agents, it reports a stable 2.91x compression ratio. On Llama-3-8B with 15 agents at 4K context, KV cache memory drops from 19.8 GB to 0.45 GB (97.7% reduction) with +0.57% perplexity degradation and mean BERTScore F1 of 0.928; the PPL delta does not worsen with agent count and can improve with longer coherent contexts.

Significance. If the reported memory-quality trade-off holds, PolyKV would enable substantially more efficient multi-agent LLM serving by eliminating redundant KV storage through sharing plus lossy asymmetric compression. The work supplies direct empirical measurements of memory usage, perplexity, and BERTScore across two model scales, multiple context lengths, and agent counts, which constitutes a concrete strength for an applied systems paper.

major comments (2)
  1. [Evaluation] Evaluation section: the central claim of 'maintaining' generation quality rests on +0.57% PPL degradation and 0.928 BERTScore, yet the manuscript supplies no error bars, no description of the exact datasets or tokenization used for perplexity, no statistical tests, and no baseline comparisons against other KV compression or sharing methods, leaving the magnitude and reliability of the quality numbers only moderately supported.
  2. [Evaluation] Evaluation section: no downstream task results (question answering accuracy, dialogue coherence, or long-context retrieval) are reported to test whether the 3-bit value quantization distorts the softmax(QK)V weighted sum in ways that PPL and BERTScore fail to detect, even though the paper notes that PPL delta remains stable across agent counts.
minor comments (1)
  1. The abstract lists context lengths as '600-7,194 tokens' but does not enumerate the precise lengths used in each experiment, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on the evaluation section. We have revised the manuscript to improve clarity on our metrics and to explicitly discuss limitations. Point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim of 'maintaining' generation quality rests on +0.57% PPL degradation and 0.928 BERTScore, yet the manuscript supplies no error bars, no description of the exact datasets or tokenization used for perplexity, no statistical tests, and no baseline comparisons against other KV compression or sharing methods, leaving the magnitude and reliability of the quality numbers only moderately supported.

    Authors: We agree that additional statistical details and context would strengthen the presentation. In the revised manuscript we have added error bars computed over five independent runs with varied context sampling seeds, an explicit description of the perplexity dataset (a 500-sequence subset drawn from the validation split of The Pile, tokenized with the model's native tokenizer), and a statement that formal statistical significance tests were omitted because the observed deltas are small and directionally consistent across all configurations. We have also included a direct baseline comparison against the uncompressed per-agent KV cache (the natural alternative) and against a uniform 4-bit symmetric quantization scheme. These additions supply the requested context while preserving the original experimental outcomes. revision: partial

  2. Referee: [Evaluation] Evaluation section: no downstream task results (question answering accuracy, dialogue coherence, or long-context retrieval) are reported to test whether the 3-bit value quantization distorts the softmax(QK)V weighted sum in ways that PPL and BERTScore fail to detect, even though the paper notes that PPL delta remains stable across agent counts.

    Authors: We acknowledge that task-specific metrics could offer further reassurance about the effect of value quantization on the attention output. Our evaluation deliberately employs perplexity (a direct measure of next-token modeling fidelity) and BERTScore (a semantic similarity metric on generated text) because they are standard, inexpensive proxies for the quality impact of KV-cache compression. The reported stability of the PPL delta across agent counts, together with its improvement on longer coherent contexts, indicates that any distortion does not compound under sharing. We have expanded the evaluation discussion and added a dedicated limitations paragraph that flags the absence of QA, dialogue, or retrieval benchmarks as future work. New downstream experiments lie outside the primary scope of this systems paper, which centers on memory reduction through shared asymmetric compression. revision: partial

standing simulated objections not resolved
  • Results from additional downstream tasks (question answering accuracy, dialogue coherence, long-context retrieval) that were not part of the original experimental campaign.

Circularity Check

0 steps flagged

No circularity: purely empirical system with direct measurements

full rationale

The paper presents an engineering system for shared KV cache compression (int8 keys + TurboQuant 3-bit values via FWHT + Lloyd-Max) and evaluates it through direct runtime measurements of memory usage, perplexity, and BERTScore across models, context lengths, and agent counts. No derivation chain, first-principles equations, or predictions exist that could reduce to fitted inputs or self-citations. All reported gains (e.g., 97.7% memory reduction with +0.57% PPL) are explicit experimental outcomes, not constructed by definition or prior self-work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The system relies on standard quantization primitives with one tuned component; no new physical or mathematical entities are introduced.

free parameters (1)
  • Lloyd-Max quantization centroids = N(0,1)
    Centroids for 3-bit value compression are tuned to N(0,1) distribution.
axioms (1)
  • domain assumption Fast Walsh-Hadamard Transform followed by Lloyd-Max quantization yields acceptable error for value vectors in attention.
    Invoked to justify the TurboQuant MSE compression step for values.

pith-pipeline@v0.9.0 · 5591 in / 1428 out tokens · 47004 ms · 2026-05-08T03:49:54.252071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 conditional novelty 8.0

    HeadQ removes 84-94% of excess perplexity from 2-bit key quantization by storing low-rank residuals in a calibration-learned query basis for score-space correction and using A²-weighted distortion for values.

  2. HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

    cs.LG 2026-05 unverdicted novelty 6.0

    HeadQ reduces 84-94% of excess perplexity in 2-bit key quantization by adding low-rank logit corrections in a calibration-learned query basis, with further gains from an A^2-weighted value policy.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    arXiv:2402.02750. 9 Hooper, C., et al. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. InNeurIPS

  2. [2]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization,

    arXiv:2401.18079. Zhang, Y., et al. Unifying KV Cache Compression for Large Language Models with LeanKV. arXiv:2412.03131, December

  3. [3]

    AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations

    Tao, C., et al. AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations. arXiv:2410.13212, October

  4. [4]

    Saxena, A., et al

    arXiv:2501.16383. Saxena, A., et al. KVLinC: KV Cache Quantization with Hadamard Rotation and Linear Correction. arXiv:2510.05373, October

  5. [5]

    KVTC: KV Cache Transform Coding for Compact Storage in LLM Inference

    Staniszewski, M., and Lancucki, L. KVTC: KV Cache Transform Coding for Compact Storage in LLM Inference. arXiv:2511.01815, November

  6. [6]

    TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization

    Patel, A. TurboAngle: Near-Lossless KV Cache Compression via Uniform Angle Quantization. arXiv:2603.27467, March

  7. [7]

    Ye, Z., et al

    arXiv:2507.07400. Ye, Z., et al. KVCOMM: Online Cross-Context KV Cache Communication for Efficient LLM- Based Multi-Agent Systems. InNeurIPS

  8. [8]

    KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems

    arXiv:2510.12872. Kim, S., et al. LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents. arXiv:2602.01053, February

  9. [9]

    Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,

    Anonymous. Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices. arXiv:2603.04428, March

  10. [10]

    RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

    Various. RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse. arXiv:2603.13289, February

  11. [11]

    com/scos-lab/turboquant

    See also: llama.cpp Discussion #20969;github. com/scos-lab/turboquant. Yoon, J. ITQ3_S: Interleaved Ternary Quantization with TurboQuant. arXiv:2603.27914, March

  12. [12]

    lmcache.ai. Various. EvicPress: Joint KV-Cache Compression and Eviction for Efficient LLM Serving. arXiv:2512.14946, December