pith. machine review for the scientific record. sign in

arxiv: 2605.03884 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.MA

Recognition: unknown

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords quantized KV cachemulti-agent LLMson-device inferencecache handoffmixed-precision quantizationtime-to-first-tokenedge AI systems
0
0 comments X

The pith

QKVShare enables quantized KV-cache handoff between on-device LLM agents by using token-level mixed-precision allocation and a CacheCard format, reducing time-to-first-token versus full re-prefill while keeping accuracy competitive under re

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM systems on edge devices must share latent context when one agent hands off to another, but current options force either expensive full re-prefill or costly full-precision KV cache transfer. QKVShare introduces a framework that applies token-level mixed-precision quantization to build a compact self-contained CacheCard and then injects it directly into the receiving agent's cache. Experiments on 150 GSM8K problems with an 8B instruct model show the approach lowers TTFT at every tested context length and maintains competitive quality even after repeated handoffs, with adaptive quantization pulling ahead of uniform quantization in deeper-hop higher-budget cases. If the method scales, multi-agent workflows on phones and similar devices could exchange state more fluidly without repeated full recomputation.

Core claim

QKVShare combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path to enable quantized KV-cache handoff. On 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re-prefill at every tested context, from 130.7 ms versus 150.2 ms at nominal 1K context to 397.1 ms versus 1029.7 ms at nominal 8K context. Stage timing shows that post-injection generation, not card creation, dominates the QKVSh

What carries the argument

QKVShare framework using token-level mixed-precision allocation to produce a CacheCard for handoff followed by direct injection into the receiving agent's KV cache

Load-bearing premise

Token-level mixed-precision allocation combined with CacheCard injection preserves enough model quality across repeated handoffs without requiring extensive controller ablations or direct runtime comparisons against other handoff strategies.

What would settle it

A controlled test that runs ten successive handoffs on the same 150 GSM8K problems and records final accuracy more than five points below the no-handoff baseline would falsify the quality-preservation claim.

Figures

Figures reproduced from arXiv: 2605.03884 by Pratik Honavar, Tejpratap GVSL.

Figure 2
Figure 2. Figure 2: Representative E1 accuracy curves for FP16 sharing, uniform quantization, and topology view at source ↗
read the original abstract

Multi-agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re-prefill or full-precision KV transfer. We study QKVShare, a framework for quantized KV-cache handoff between agents that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;. Stage timing shows that post-injection generation, not card creation, dominates the current QKVShare latency path. These results position quantized KV handoff as a promising on-device systems direction while also highlighting the need for stronger controller ablations and apples-to-apples runtime comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes QKVShare, a framework for efficient quantized KV-cache handoff between agents in multi-agent on-device LLM systems. It integrates token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible injection path. On 150 GSM8K problems with Llama-3.1-8B-Instruct, the work reports that adaptive quantization remains competitive under repeated handoffs (with clearest gains vs. uniform quantization in deeper-hop, higher-budget regimes) while reducing time-to-first-token (TTFT) relative to full re-prefill across contexts (e.g., 130.7 ms vs. 150.2 ms at 1K; 397.1 ms vs. 1029.7 ms at 8K), with post-injection generation dominating latency.

Significance. If the quality-preservation claim holds after adding the missing metrics, QKVShare could meaningfully advance practical multi-agent LLM deployments on edge devices by enabling low-latency context sharing without full re-prefill or high-precision KV transfer. The concrete TTFT comparisons and narrower empirical framing are positive; the work also correctly flags the need for stronger controller ablations.

major comments (2)
  1. [Evaluation results] Evaluation on 150 GSM8K problems: the central claim that 'adaptive quantization remains competitive under repeated handoff' is not supported by per-condition accuracy numbers, degradation deltas, or variance across hop counts and budgets. Only a qualitative statement is given, leaving the quality-preservation half of the contribution unquantified and load-bearing for the overall result.
  2. [Methods / adaptive allocation] Controller and allocation description: no details are provided on how the token-level adaptive bit allocation is decided, and the manuscript itself notes the absence of extensive controller ablations or apples-to-apples runtime comparisons against other handoff strategies. This prevents isolation of the mixed-precision allocator's contribution from other factors (model, task, injection path).
minor comments (3)
  1. [Abstract] Abstract: 're prefill' should be hyphenated as 're-prefill' for consistency with the rest of the text.
  2. [Latency evaluation] The latency results lack error bars, number of runs, or statistical tests, which would strengthen the reported TTFT reductions.
  3. [Results discussion] Post-hoc emphasis on deeper-hop/higher-budget settings for the clearest gains should be accompanied by a pre-specified analysis plan or full cross-condition table to avoid selection concerns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of QKVShare to advance practical multi-agent LLM deployments on edge devices. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses
  1. Referee: [Evaluation results] Evaluation on 150 GSM8K problems: the central claim that 'adaptive quantization remains competitive under repeated handoff' is not supported by per-condition accuracy numbers, degradation deltas, or variance across hop counts and budgets. Only a qualitative statement is given, leaving the quality-preservation half of the contribution unquantified and load-bearing for the overall result.

    Authors: We agree that the quality-preservation claim requires stronger quantitative backing beyond the qualitative statement in the current draft. In the revised manuscript we will add a dedicated results table presenting per-condition accuracy for adaptive versus uniform quantization across hop counts (1-5) and budget regimes, including degradation deltas relative to the full-precision baseline and standard deviations over the 150 GSM8K problems. These data are available from our existing experiments and will directly quantify the clearest gains in deeper-hop, higher-budget settings. revision: yes

  2. Referee: [Methods / adaptive allocation] Controller and allocation description: no details are provided on how the token-level adaptive bit allocation is decided, and the manuscript itself notes the absence of extensive controller ablations or apples-to-apples runtime comparisons against other handoff strategies. This prevents isolation of the mixed-precision allocator's contribution from other factors (model, task, injection path).

    Authors: We acknowledge that the manuscript provides only a high-level outline of the token-level adaptive bit allocation. We will expand the methods section (specifically the description of the allocation logic) to include the precise scoring function, importance metric, and bit-assignment rules used for each token. As the manuscript already flags the lack of extensive controller ablations, we will add a new subsection with preliminary runtime comparisons against uniform quantization and re-prefill baselines while noting that comprehensive apples-to-apples evaluations against alternative handoff strategies are left for future work to fully isolate the allocator's contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework with direct measurements

full rationale

The manuscript introduces QKVShare as an engineering framework combining token-level mixed-precision KV allocation, CacheCard representation, and HuggingFace cache injection. All reported outcomes are direct experimental measurements: TTFT latency reductions versus full re-prefill (e.g., 130.7 ms vs 150.2 ms at 1K context) and qualitative competitiveness statements versus uniform quantization on 150 GSM8K problems. No equations, first-principles derivations, or predictions are claimed; the text contains no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations. The derivation chain is therefore self-contained as implementation plus benchmarking, with no reduction of results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level CacheCard construct are detailed. Quantization bit choices and injection mechanics are likely implementation parameters but not enumerated.

invented entities (1)
  • CacheCard no independent evidence
    purpose: self-contained representation for quantized KV-cache handoff
    Introduced as the core packaging format enabling injection without full re-prefill.

pith-pipeline@v0.9.0 · 5526 in / 1320 out tokens · 71031 ms · 2026-05-07T16:15:24.365602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems

    Ye, H., Gao, Z., Ma, M., et al. “KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems.” NeurIPS 2025. arXiv:2510.12872

  2. [2]

    Cache-to-Cache: Direct Semantic Communication Between Large Language Models

    Fu, T., Min, Z., Zhang, H., et al. “Cache-to-Cache: Direct Semantic Communication Between Large Language Models.” ICLR 2026. arXiv:2510.03215

  3. [3]

    Lragent: Efficient KV cache sharing for multi-lora LLM agents

    Jeon, H., et al. “LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents.” arXiv:2602.01053, February 2026

  4. [4]

    DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

    Liu, Y., et al. “DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving.” NSDI

  5. [5]

    arXiv preprint arXiv:2511.20639 , year =

    “Latent Collaboration in Multi-Agent Systems (LatentMAS).” arXiv:2511.20639, November 2025

  6. [6]

    Latent Briefing: Efficient Memory Sharing for Multi -Agent Systems via KV Cache Compaction

    Ramp Labs. “Latent Briefing: Efficient Memory Sharing for Multi -Agent Systems via KV Cache Compaction.” 2026

  7. [7]

    Towards a Collaborative Memory for Agentic Workflow: Segment -Level KV Cache Sharing

    “Towards a Collaborative Memory for Agentic Workflow: Segment -Level KV Cache Sharing.” OpenReview, 2025

  8. [8]

    Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

    “Don’t Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs.” arXiv:2604.04722, April 2026

  9. [9]

    KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

    Hooper, C., Kim, S., et al. “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.” NeurIPS 2024

  10. [10]

    ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

    He, Y., et al. “ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.” NeurIPS 2024

  11. [11]

    Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

    Shutova, A., et al. “Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models.” ICML 2025

  12. [12]

    Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

    “Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization.” ISCA 2025

  13. [13]

    MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

    Tao, W., et al. “MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts.” ACL 2025

  14. [14]

    Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,

    Shkolnikov, Y. P. “Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi -Agent LLM Inference on Edge Devices.” arXiv:2603.04428, February 2026

  15. [15]

    MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization

    Li, S., et al. “MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization.” ACL 2025. QKVShare: Quantized KV-Cache Sharing for Multi-Agent On-Device LLMs 12

  16. [16]

    Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

    Li, H., et al. “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to- Live.” arXiv:2511.02230

  17. [17]

    Model Context Protocol (MCP)

    Anthropic. “Model Context Protocol (MCP).” 2024

  18. [18]

    Agent-to-Agent (A2A) Protocol

    Google. “Agent-to-Agent (A2A) Protocol.” 2025