arxiv: 2605.03884 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.MA

Recognition: unknown

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Pratik Honavar , Tejpratap GVSL

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:15 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords quantized KV cachemulti-agent LLMson-device inferencecache handoffmixed-precision quantizationtime-to-first-tokenedge AI systems

0 comments

The pith

QKVShare enables quantized KV-cache handoff between on-device LLM agents by using token-level mixed-precision allocation and a CacheCard format, reducing time-to-first-token versus full re-prefill while keeping accuracy competitive under re

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM systems on edge devices must share latent context when one agent hands off to another, but current options force either expensive full re-prefill or costly full-precision KV cache transfer. QKVShare introduces a framework that applies token-level mixed-precision quantization to build a compact self-contained CacheCard and then injects it directly into the receiving agent's cache. Experiments on 150 GSM8K problems with an 8B instruct model show the approach lowers TTFT at every tested context length and maintains competitive quality even after repeated handoffs, with adaptive quantization pulling ahead of uniform quantization in deeper-hop higher-budget cases. If the method scales, multi-agent workflows on phones and similar devices could exchange state more fluidly without repeated full recomputation.

Core claim

QKVShare combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path to enable quantized KV-cache handoff. On 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re-prefill at every tested context, from 130.7 ms versus 150.2 ms at nominal 1K context to 397.1 ms versus 1029.7 ms at nominal 8K context. Stage timing shows that post-injection generation, not card creation, dominates the QKVSh

What carries the argument

QKVShare framework using token-level mixed-precision allocation to produce a CacheCard for handoff followed by direct injection into the receiving agent's KV cache

Load-bearing premise

Token-level mixed-precision allocation combined with CacheCard injection preserves enough model quality across repeated handoffs without requiring extensive controller ablations or direct runtime comparisons against other handoff strategies.

What would settle it

A controlled test that runs ten successive handoffs on the same 150 GSM8K problems and records final accuracy more than five points below the no-handoff baseline would falsify the quality-preservation claim.

Figures

Figures reproduced from arXiv: 2605.03884 by Pratik Honavar, Tejpratap GVSL.

**Figure 2.** Figure 2: Representative E1 accuracy curves for FP16 sharing, uniform quantization, and topology view at source ↗

read the original abstract

Multi-agent LLM systems on edge devices need to hand off latent context efficiently, but the practical choices today are expensive re-prefill or full-precision KV transfer. We study QKVShare, a framework for quantized KV-cache handoff between agents that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. Our current results support a narrower but clearer story than the original draft: on 150 GSM8K problems with Llama-3.1-8B-Instruct, adaptive quantization remains competitive under repeated handoff and shows its clearest gains against uniform quantization in deeper-hop, higher budget settings; for handoff latency, the QKVShare path reduces TTFT relative to full re prefill at every tested context, from 130.7 ms vs. 150.2 ms at nominal 1K context to 397.1 ms vs. 1029.7 ms at nominal 8K context;. Stage timing shows that post-injection generation, not card creation, dominates the current QKVShare latency path. These results position quantized KV handoff as a promising on-device systems direction while also highlighting the need for stronger controller ablations and apples-to-apples runtime comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QKVShare delivers measurable TTFT cuts for quantized multi-agent KV handoff on Llama-3.1-8B but leaves the accuracy preservation story without the numbers needed to judge it.

read the letter

The main thing to know is that this paper shows a working pipeline for quantized KV-cache handoff between on-device agents that beats full re-prefill on time-to-first-token, yet the evidence that adaptive quantization keeps model quality intact after repeated transfers is stated rather than demonstrated with concrete figures. The new piece is the specific bundle of token-level mixed-precision allocation, a compact CacheCard representation, and direct Hugging Face cache injection, all aimed at multi-agent edge scenarios instead of just uniform quantization or re-prefill. That combination is not described in prior work the abstract cites, and the latency results are straightforward: on 150 GSM8K problems the method cuts TTFT from 150.2 ms to 130.7 ms at 1K context and from 1029.7 ms to 397.1 ms at 8K context, with stage timing indicating that post-injection generation dominates rather than card creation. Those are usable systems numbers for anyone measuring handoff costs. The soft spots sit on the quality side. The abstract claims adaptive quantization remains competitive and gains most versus uniform quantization in deeper-hop, higher-budget regimes, but supplies no per-condition accuracy values, degradation deltas, or variance. There are also no error bars, statistical tests, or controller ablations to separate the token allocator from the injection path or model specifics. The paper itself flags the lack of extensive ablations and apples-to-apples runtime comparisons, so the contribution of the mixed-precision logic is not isolated. If those details exist in the full text they need to be shown clearly; otherwise the central claim rests on an assumption. This work is for systems researchers and engineers focused on privacy-preserving multi-agent inference on edge devices who need practical ways to move context without full re-prefill. A reader already building on-device LLM stacks would pick up concrete latency trade-offs and an implementation sketch worth trying. It deserves a serious referee because it targets a real deployment bottleneck with initial empirical backing, even though the accuracy reporting must be expanded before the results can be taken as settled.

Referee Report

2 major / 3 minor

Summary. The paper proposes QKVShare, a framework for efficient quantized KV-cache handoff between agents in multi-agent on-device LLM systems. It integrates token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible injection path. On 150 GSM8K problems with Llama-3.1-8B-Instruct, the work reports that adaptive quantization remains competitive under repeated handoffs (with clearest gains vs. uniform quantization in deeper-hop, higher-budget regimes) while reducing time-to-first-token (TTFT) relative to full re-prefill across contexts (e.g., 130.7 ms vs. 150.2 ms at 1K; 397.1 ms vs. 1029.7 ms at 8K), with post-injection generation dominating latency.

Significance. If the quality-preservation claim holds after adding the missing metrics, QKVShare could meaningfully advance practical multi-agent LLM deployments on edge devices by enabling low-latency context sharing without full re-prefill or high-precision KV transfer. The concrete TTFT comparisons and narrower empirical framing are positive; the work also correctly flags the need for stronger controller ablations.

major comments (2)

[Evaluation results] Evaluation on 150 GSM8K problems: the central claim that 'adaptive quantization remains competitive under repeated handoff' is not supported by per-condition accuracy numbers, degradation deltas, or variance across hop counts and budgets. Only a qualitative statement is given, leaving the quality-preservation half of the contribution unquantified and load-bearing for the overall result.
[Methods / adaptive allocation] Controller and allocation description: no details are provided on how the token-level adaptive bit allocation is decided, and the manuscript itself notes the absence of extensive controller ablations or apples-to-apples runtime comparisons against other handoff strategies. This prevents isolation of the mixed-precision allocator's contribution from other factors (model, task, injection path).

minor comments (3)

[Abstract] Abstract: 're prefill' should be hyphenated as 're-prefill' for consistency with the rest of the text.
[Latency evaluation] The latency results lack error bars, number of runs, or statistical tests, which would strengthen the reported TTFT reductions.
[Results discussion] Post-hoc emphasis on deeper-hop/higher-budget settings for the clearest gains should be accompanied by a pre-specified analysis plan or full cross-condition table to avoid selection concerns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of QKVShare to advance practical multi-agent LLM deployments on edge devices. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Evaluation results] Evaluation on 150 GSM8K problems: the central claim that 'adaptive quantization remains competitive under repeated handoff' is not supported by per-condition accuracy numbers, degradation deltas, or variance across hop counts and budgets. Only a qualitative statement is given, leaving the quality-preservation half of the contribution unquantified and load-bearing for the overall result.

Authors: We agree that the quality-preservation claim requires stronger quantitative backing beyond the qualitative statement in the current draft. In the revised manuscript we will add a dedicated results table presenting per-condition accuracy for adaptive versus uniform quantization across hop counts (1-5) and budget regimes, including degradation deltas relative to the full-precision baseline and standard deviations over the 150 GSM8K problems. These data are available from our existing experiments and will directly quantify the clearest gains in deeper-hop, higher-budget settings. revision: yes
Referee: [Methods / adaptive allocation] Controller and allocation description: no details are provided on how the token-level adaptive bit allocation is decided, and the manuscript itself notes the absence of extensive controller ablations or apples-to-apples runtime comparisons against other handoff strategies. This prevents isolation of the mixed-precision allocator's contribution from other factors (model, task, injection path).

Authors: We acknowledge that the manuscript provides only a high-level outline of the token-level adaptive bit allocation. We will expand the methods section (specifically the description of the allocation logic) to include the precise scoring function, importance metric, and bit-assignment rules used for each token. As the manuscript already flags the lack of extensive controller ablations, we will add a new subsection with preliminary runtime comparisons against uniform quantization and re-prefill baselines while noting that comprehensive apples-to-apples evaluations against alternative handoff strategies are left for future work to fully isolate the allocator's contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical framework with direct measurements

full rationale

The manuscript introduces QKVShare as an engineering framework combining token-level mixed-precision KV allocation, CacheCard representation, and HuggingFace cache injection. All reported outcomes are direct experimental measurements: TTFT latency reductions versus full re-prefill (e.g., 130.7 ms vs 150.2 ms at 1K context) and qualitative competitiveness statements versus uniform quantization on 150 GSM8K problems. No equations, first-principles derivations, or predictions are claimed; the text contains no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations. The derivation chain is therefore self-contained as implementation plus benchmarking, with no reduction of results to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level CacheCard construct are detailed. Quantization bit choices and injection mechanics are likely implementation parameters but not enumerated.

invented entities (1)

CacheCard no independent evidence
purpose: self-contained representation for quantized KV-cache handoff
Introduced as the core packaging format enabling injection without full re-prefill.

pith-pipeline@v0.9.0 · 5526 in / 1320 out tokens · 71031 ms · 2026-05-07T16:15:24.365602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 7 canonical work pages · 2 internal anchors

[1]

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems

Ye, H., Gao, Z., Ma, M., et al. “KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM- based Multi-agent Systems.” NeurIPS 2025. arXiv:2510.12872

work page arXiv 2025
[2]

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Fu, T., Min, Z., Zhang, H., et al. “Cache-to-Cache: Direct Semantic Communication Between Large Language Models.” ICLR 2026. arXiv:2510.03215

work page arXiv 2026
[3]

Lragent: Eﬃcient KV cache sharing for multi-lora LLM agents

Jeon, H., et al. “LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents.” arXiv:2602.01053, February 2026

work page arXiv 2026
[4]

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving

Liu, Y., et al. “DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving.” NSDI
[5]

arXiv preprint arXiv:2511.20639 , year =

“Latent Collaboration in Multi-Agent Systems (LatentMAS).” arXiv:2511.20639, November 2025

work page arXiv 2025
[6]

Latent Briefing: Efficient Memory Sharing for Multi -Agent Systems via KV Cache Compaction

Ramp Labs. “Latent Briefing: Efficient Memory Sharing for Multi -Agent Systems via KV Cache Compaction.” 2026

2026
[7]

Towards a Collaborative Memory for Agentic Workflow: Segment -Level KV Cache Sharing

“Towards a Collaborative Memory for Agentic Workflow: Segment -Level KV Cache Sharing.” OpenReview, 2025

2025
[8]

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

“Don’t Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs.” arXiv:2604.04722, April 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Hooper, C., Kim, S., et al. “KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.” NeurIPS 2024

2024
[10]

ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

He, Y., et al. “ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.” NeurIPS 2024

2024
[11]

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Shutova, A., et al. “Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models.” ICML 2025

2025
[12]

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

“Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization.” ISCA 2025

2025
[13]

MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts

Tao, W., et al. “MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts.” ACL 2025

2025
[14]

Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,

Shkolnikov, Y. P. “Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi -Agent LLM Inference on Edge Devices.” arXiv:2603.04428, February 2026

work page arXiv 2026
[15]

MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization

Li, S., et al. “MobiLoRA: Accelerating LoRA-Based LLM Inference on Mobile Devices via Context-Aware KV Cache Optimization.” ACL 2025. QKVShare: Quantized KV-Cache Sharing for Multi-Agent On-Device LLMs 12

2025
[16]

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Li, H., et al. “Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to- Live.” arXiv:2511.02230

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Model Context Protocol (MCP)

Anthropic. “Model Context Protocol (MCP).” 2024

2024
[18]

Agent-to-Agent (A2A) Protocol

Google. “Agent-to-Agent (A2A) Protocol.” 2025

2025