pith. sign in

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it
abstract

Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that rely on external function calls. This workload creates severe performance challenges for the KV Cache: spatial contention leads to the eviction of critical agents' caches and temporal underutilization leaves the cache of agents stalled on long-running function calls idling in GPU memory. We present TokenCake, a KV-Cache-centric serving framework that bridges this gap by co-optimizing scheduling and memory management through an agent-aware design. TokenCake's Temporal Scheduler employs an event-driven, opportunistic policy to proactively offload idle KV Caches during function calls and uses predictive uploading to hide data transfer latency. TokenCake's Spatial Scheduler uses dynamic memory partitioning, guided by a hybrid priority metric combining graph structure and runtime state, to reserve GPU memory for critical-path agents. Our evaluation on representative multi-agent benchmarks shows that TokenCake reduces end-to-end latency by over 47.06% and improves effective GPU memory utilization by up to 16.9% compared to vLLM.

citation-role summary

background 3

citation-polarity summary

fields

cs.DC 3 cs.AI 2

years

2026 5

verdicts

UNVERDICTED 5

roles

background 3

polarities

background 3

clear filters

representative citing papers

A Policy-Driven Runtime Layer for Agentic LLM Serving

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.

citing papers explorer

Showing 5 of 5 citing papers after filters.

  • A Policy-Driven Runtime Layer for Agentic LLM Serving cs.AI · 2026-05-26 · unverdicted · none · ref 2 · internal anchor

    Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.

  • Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling cs.AI · 2026-04-19 · unverdicted · none · ref 5 · internal anchor

    Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.

  • Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines cs.DC · 2026-04-16 · unverdicted · none · ref 3 · internal anchor

    Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.

  • ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache cs.DC · 2026-04-07 · unverdicted · none · ref 6 · internal anchor

    ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.

  • TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 2 · internal anchor

    TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.