LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Hyeongju Ha; Hyesung Jeon; Jae-Joon Kim

arxiv: 2602.01053 · v2 · pith:IY7VYG57new · submitted 2026-02-01 · 💻 cs.LG

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Hyesung Jeon , Hyeongju Ha , Jae-Joon Kim This is my paper

classification 💻 cs.LG

keywords cacheagentssharingmulti-loracomponentlragentacrossadapter

0 comments

read the original abstract

Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, cache differences across agents are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents. It decomposes the cache into two components, a shared base component derived from pretrained weights and an adapter-dependent component derived from LoRA weights. LRAgent reduces memory overhead by sharing the base component across agents and storing the adapter component in its inherent low-rank form. It also reduces computational overhead by sharing the low-rank cache, enabled by a shared-A multi-LoRA architecture. This avoids redundant computations for contexts that have already been processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs
cs.AI 2026-05 unverdicted novelty 6.0

QKVShare enables efficient quantized KV-cache handoff for on-device multi-agent LLMs, cutting TTFT versus re-prefill across tested contexts while adaptive quantization stays competitive with uniform baselines on GSM8K.
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
cs.DC 2026-04 unverdicted novelty 6.0

ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
cs.DC 2026-04 unverdicted novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...