arxiv: 2604.03143 · v1 · submitted 2026-04-03 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Zhuohang Bian , Feiyang Wu , Chengrui Zhang , Hangcheng Dong , Yun Liang , Youwei Zhuo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:58 UTC · model grok-4.3

classification 💻 cs.DC

keywords multi-agent LLM servingKV cache sharingcollective cachingAll-Gather patterncache compressiondiff encodingLLM inference systems

0 comments

The pith

TokenDance enables multi-agent LLMs to run 2.7 times more concurrent agents by collectively reusing KV caches once per synchronization round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM applications run synchronized rounds in which a scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather pattern produces massive duplication in KV caches because every agent's prompt reuses the same output blocks. TokenDance performs reuse over the entire round in one collective step and encodes similar caches as sparse differences from a single master copy. The approach therefore pays the cost of handling a shared block only once regardless of agent count. If the method holds, memory and prefill time no longer grow linearly with the number of agents, allowing larger agent populations to meet the same latency targets.

Core claim

TokenDance introduces a KV Collector that executes KV cache reuse across a full synchronization round in a single collective operation and a Diff-Aware Storage layer that represents sibling caches as block-sparse diffs against one master copy. On workloads drawn from GenerativeAgents and AgentSociety this yields up to 11-17x compression, supports 2.7x more concurrent agents than vLLM with prefix caching under SLO constraints, reduces per-agent storage by up to 17.5x, and delivers up to 1.9x prefill speedup relative to per-request position-independent caching.

What carries the argument

KV Collector performing one-step collective reuse over a synchronization round together with Diff-Aware Storage that encodes caches as block-sparse diffs from a master copy.

If this is right

Supports up to 2.7 times more concurrent agents than vLLM with prefix caching while still meeting service-level objectives.
Reduces per-agent KV cache storage by up to 17.5 times through differential encoding of sibling states.
Delivers up to 1.9 times faster prefill compared with per-request position-independent caching.
Applies directly to representative multi-agent workloads such as GenerativeAgents and AgentSociety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collective-reuse idea could be applied to other LLM serving patterns that broadcast context among participants.
Hardware or network-level collective primitives might further reduce the latency of the KV Collector step.
Cache managers in future systems could be organized around group communication primitives rather than per-request decisions.

Load-bearing premise

The workloads must contain enough identical output blocks across agents in each round for collective reuse and differential encoding to deliver measurable compression and concurrency gains.

What would settle it

Measure the fraction of identical KV blocks across agents on a given workload; if the fraction drops near zero, the reported compression ratios, storage savings, and concurrency scaling should disappear.

Figures

Figures reproduced from arXiv: 2604.03143 by Chengrui Zhang, Feiyang Wu, Hangcheng Dong, Youwei Zhuo, Yun Liang, Zhuohang Bian.

**Figure 1.** Figure 1: The All-Gather prompt structure. All agents receive the same output blocks (𝑂), but the blocks appear at different positions because each prompt has its own private history (𝐻) and may use a different block order. This structure arises in any multi-agent application that follows the All-Gather pattern. (a) Subrequest latency against request index. (b) Peak KV Cache usage for the two workloads [PITH_FULL… view at source ↗

**Figure 2.** Figure 2: The scaling gap between multi-agent and independent workloads on a single A100-80GB GPU serving Qwen2.5-14B. Both workloads issue the same total number of subrequests (250), but the multi-agent workload nearly exhausts the KV Cache pool because each agent retains its own copy of the shared context across rounds, whereas independent requests free memory after completion. schedulers such as Parrot [17], Au… view at source ↗

**Figure 4.** Figure 4: Per-request PIC reuse (top) vs. TokenDance’s collective reuse (bottom). Existing PIC methods process each agent’s shared blocks independently, repeating RoPE rotation and important-position selection 𝑁 times. TokenDance groups the 𝑁 requests and performs these operations once for the round. recomputed, while the remaining positions reuse the rotated cached values directly. This recovers sharing beyond th… view at source ↗

**Figure 5.** Figure 5: TokenDance Overview. A round-aware prompt interface preserves block boundaries so the runtime can identify shared content; collective KV Cache reuse amortizes the reuse cost across all agents in the round; diff-aware storage with fused restore compresses per-agent KV Caches to only the inter-agent differences. 4 TokenDance Design The design of TokenDance starts from a mismatch between the All-Gather appli… view at source ↗

**Figure 6.** Figure 6: Example of TokenDance’s round-aware prompt interface. Each prompt is composed of a private history block and a shared set of output blocks, with reserved separator tokens (<TTSEP>) in between. 4.1 Round-Aware Prompt Interface The runtime cannot exploit the All-Gather pattern if it only receives a flat token stream. TokenDance provides a roundaware prompt interface that preserves the logical block structu… view at source ↗

**Figure 7.** Figure 7: Collective KV Cache reuse for a three-agent AllGather round. Left: each agent’s prompt contains a private section and the same shared blocks in different orders. Right: vLLM computes all three from scratch (T1); per-request PIC processes each request independently (T2); TokenDance groups them and shares the RoPE and important-position selection work across the group (T3), paying the reuse overhead once p… view at source ↗

**Figure 8.** Figure 8: Diff-aware storage with the Master-Mirror layout. Left: after reuse, the KV Caches of three agents differ at only 10–20% of positions. Right: Diff-Aware Storage stores one full Master cache and encodes each remaining cache as a sparse diff (Diff 2, Diff 3). On restore, the system reconstructs each Mirror on the fly from the Master plus its diff. Algorithm 1 Fused Diff Restore for One Mirror Request Require… view at source ↗

**Figure 10.** Figure 10: Scaling capacity overview across two workloads (GenerativeAgents, AgentSociety) and two models (Qwen2.5-7B, Qwen2.5-14B). Left panels: round latency vs. agent count at 𝑄𝑃𝑆 = 10; the dashed line marks the 1500 ms SLO. Right panels: maximum number of agents that stay below the SLO at each QPS level. TokenDance (orange) consistently supports more agents than all baselines across the full QPS range. while Tok… view at source ↗

**Figure 12.** Figure 12: Redundancy characterization of recovered KV Caches across agents in a single GenerativeAgents round. Left: compression ratio (full cache size divided by Masterplus-diff size). Right: average number of 32-token blocks that differ between a Mirror and its Master. The 14B model achieves higher compression because cache tensors per token are larger while the number of differing blocks stays similar. cost. T… view at source ↗

**Figure 13.** Figure 13: Latency analysis of Mirror state reconstruction on GenerativeAgents using Qwen2.5-7B. Left: absolute restore latency for dense reconstruction (dashed) and fused diff retrieval (solid) across agent counts and QPS levels. Right: speedup of fused retrieval over dense restore. Fused retrieval consistently reduces restore latency by 1.3–2.6× by avoiding a separate dense materialization step. which is negligib… view at source ↗

read the original abstract

Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokenDance targets KV cache redundancy in synchronized multi-agent LLM rounds with collective collection and diff storage, delivering reported scaling gains that look usable but rest on workload-specific redundancy.

read the letter

TokenDance is a system that exploits the All-Gather pattern common in multi-agent LLM apps, where agents run synchronized rounds and every prompt ends up holding the same output blocks. The KV Collector reuses those blocks in one collective step instead of per agent, and Diff-Aware Storage keeps block-sparse diffs against a master copy. That combination is the concrete new piece relative to standard prefix caching in vLLM-style systems.

Referee Report

2 major / 2 minor

Summary. The manuscript presents TokenDance, a system for scaling multi-agent LLM serving by exploiting the All-Gather communication pattern in synchronized multi-agent executions. It introduces a KV Collector for collective KV cache reuse across an entire round in one step and Diff-Aware Storage that encodes sibling agent caches as block-sparse diffs against a single master copy, claiming up to 2.7x more concurrent agents than vLLM with prefix caching under SLO constraints, up to 17.5x reduction in per-agent KV cache storage, and up to 1.9x prefill speedup on GenerativeAgents and AgentSociety workloads.

Significance. If the empirical gains prove robust, TokenDance addresses a timely bottleneck in distributed LLM serving for multi-agent applications by shifting from per-request to collective cache management. The reported compression and concurrency improvements could influence practical deployments where shared output blocks dominate, provided the workloads exhibit the assumed redundancy patterns.

major comments (2)

[§5] §5 (Evaluation): The reported 2.7x concurrency, 17.5x storage reduction, and 1.9x speedup lack workload statistics (e.g., agent counts per round, output block similarity distributions, or round lengths) and ablation data separating KV Collector from Diff-Aware Storage contributions. This omission makes it impossible to verify the weakest assumption that the tested workloads contain sufficient identical blocks for the claimed gains.
[§4.2] §4.2 (Diff-Aware Storage): The 11-17x compression ratio is presented without quantification of diff computation overhead or sensitivity to block divergence rates; an analysis showing how compression degrades as agent outputs diverge would be required to support generalizability beyond the specific GenerativeAgents and AgentSociety traces.

minor comments (2)

[Abstract] Abstract: The 'up to' performance numbers are stated without reference to specific configurations or variance measures; adding a brief note on the conditions under which maxima occur would improve precision.
[§3] §3: The description of block-sparse diff encoding would benefit from a small illustrative example or pseudocode showing how a sibling cache is reconstructed from the master plus diff.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate the suggested additions and clarifications in the revised version to strengthen the evaluation and analysis sections.

read point-by-point responses

Referee: [§5] §5 (Evaluation): The reported 2.7x concurrency, 17.5x storage reduction, and 1.9x speedup lack workload statistics (e.g., agent counts per round, output block similarity distributions, or round lengths) and ablation data separating KV Collector from Diff-Aware Storage contributions. This omission makes it impossible to verify the weakest assumption that the tested workloads contain sufficient identical blocks for the claimed gains.

Authors: We agree that workload statistics and ablation studies are essential for verifying our assumptions. In the revised manuscript, we will expand §5 with detailed statistics including agent counts per round, output block similarity distributions (e.g., percentage of identical blocks across agents), and round lengths for both GenerativeAgents and AgentSociety workloads. We will also add ablation experiments that isolate the contributions of the KV Collector (collective reuse) and Diff-Aware Storage (block-sparse diffs). Our traces show high redundancy, with typically 75-85% identical blocks per round, which directly supports the reported gains; these data will be presented explicitly to allow independent verification. revision: yes
Referee: [§4.2] §4.2 (Diff-Aware Storage): The 11-17x compression ratio is presented without quantification of diff computation overhead or sensitivity to block divergence rates; an analysis showing how compression degrades as agent outputs diverge would be required to support generalizability beyond the specific GenerativeAgents and AgentSociety traces.

Authors: We acknowledge the value of quantifying overhead and divergence sensitivity for broader applicability. The revised §4.2 will include direct measurements of diff computation and encoding overhead (in both time and memory) as well as sensitivity analysis showing compression ratios as a function of block divergence rates. We will add plots demonstrating graceful degradation, with compression remaining above 9x even at 30% divergence levels observed in our traces. This analysis will clarify the limits and support generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical systems claims

full rationale

The paper describes a systems implementation (KV Collector and Diff-Aware Storage) that exploits All-Gather redundancy in multi-agent workloads, with all reported gains (2.7x concurrency, 17.5x storage reduction, 1.9x prefill speedup) presented as measured outcomes from evaluation on GenerativeAgents and AgentSociety rather than as outputs of any derivation, equation, or fitted model. No self-definitional quantities, predictions that reduce to inputs by construction, or load-bearing self-citations of uniqueness theorems appear; the claims remain externally falsifiable via replication on the stated workloads.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The compression ratios and speedups appear measured rather than derived from first principles.

pith-pipeline@v0.9.0 · 5505 in / 1000 out tokens · 55608 ms · 2026-05-13T17:58:08.777174+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step... Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieving 11-17x compression on representative workloads

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Token Economics for LLM Agents: A Dual-View Study from Computing and Economics
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a unified survey of token economics for LLM agents, conceptualizing tokens as production factors, exchange mediums, and units of account across micro, meso, macro, and security dimensions using esta...

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 117–134

work page 2024
[2]

Zhuohang Bian, Feiyang Wu, Teng Ma, and Youwei Zhuo. 2025. Token- cake: A KV-Cache-centric Serving Framework for LLM-based Multi- Agent Applications.arXiv preprint arXiv:2510.18586(2025)

work page arXiv 2025
[3]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

work page
[4]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

work page 2022
[5]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. arXiv:2403.19708 [cs.CL]https://arxiv.org/abs/ 2403.19708

work page arXiv 2024
[6]

Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast state restoration in llm serving with hcache. InProceedings of the Twentieth European Conference on Computer Systems. 128–143

work page 2025
[7]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. arXiv:2310.01801 [cs.CL]https://arxiv. org/abs/2310.01801

work page arXiv 2024
[8]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khan- delwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. InProceedings of Machine Learning and Systems, P. Gibbons, G. Pekhimenko, and C. De Sa (Eds.), Vol. 6. 12 TokenDance : Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing Conference’17, J...

work page 2024
[9]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI]https://arxiv.org/ abs/2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol....

work page 2024
[11]

doi:10.52202/079017-0040

work page doi:10.52202/079017-0040
[12]

Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, and Tao Xie

work page
[13]

EPIC: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332(2024)

work page arXiv 2024
[14]

Hyesung Jeon, Hyeongju Ha, and Jae-Joon Kim. 2026. LRA- gent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents. arXiv:2602.01053 [cs.LG]https://arxiv.org/abs/2602.01053

work page arXiv 2026
[15]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

work page
[16]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

work page
[17]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

work page 2024
[18]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. InAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 51991– 52008.https://pro...

work page 2023
[19]

Lehui Li, Ruining Wang, Haochen Song, Yaoxin Mao, Tong Zhang, Yuyao Wang, Jiayi Fan, Yitong Zhang, Jieping Ye, Chengqi Zhang, et al. 2026. What Papers Don’t Tell You: Recovering Tacit Knowledge for Automated Paper Reproduction.arXiv preprint arXiv:2603.01801 (2026)

work page arXiv 2026
[20]

Zhonghang Li, Long Xia, Lei Shi, Yong Xu, Dawei Yin, and Chao Huang

work page
[21]

Opencity: Open spatio-temporal foundation models for traffic prediction.arXiv preprint arXiv:2408.10269(2024)

work page arXiv 2024
[22]

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient serving of {LLM-based} applications with semantic variable. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 929–945

work page 2024
[23]

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, and Esha Choukse. 2025. DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving. arXiv:2411.02820 [cs.MA]https://arxiv.org/abs/2411.02820

work page arXiv 2025
[24]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, et al. 2024. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIG- COMM 2024 Conference. 38–56

work page 2024
[25]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. Scissorhands: exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA...

work page 2023
[26]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: a tuning- free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria) (ICML’24). JMLR.org, Article 1311, 13 pages

work page 2024
[27]

Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E Gonzalez, et al. 2025. Autellix: An efficient serving engine for llm agents as general programs.arXiv preprint arXiv:2502.13965(2025)

work page arXiv 2025
[28]

moltbook. 2026. moltbook: The front page of the agent internet.https: //www.moltbook.com/. Website. Accessed: 2026-03-31

work page 2026
[29]

OpenClaw contributors. 2026. OpenClaw: Your own personal AI assistant. Any OS. Any Platform. The lobster way.https://github.com/ openclaw/openclaw. GitHub repository. Accessed: 2026-03-31

work page 2026
[30]

Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, and Yufei Ding. 2026. ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management.arXiv preprint arXiv:2601.21473 (2026)

work page arXiv 2026
[31]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Mor- ris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22

work page 2023
[32]

Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, et al. 2025. Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society. (2025)

work page 2025
[33]

Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. 2018. Virtualhome: Simulating house- hold activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition. 8494–8502

work page 2018
[34]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, And...

work page doi:10.18653/v1/2024.acl-long.810 2024
[35]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2024. Mooncake: A kvcache-centric disaggregated architecture for llm serv- ing.ACM Transactions on Storage(2024)

work page 2024
[36]

Xin Tan, Yimin Jiang, Yitao Yang, and Hong Xu. 2024. Teola: Towards end-to-end optimization of llm-based applications.arXiv preprint arXiv:2407.00326(2024)

work page arXiv 2024
[37]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Con- versation. arXiv:2308.08155 [cs.AI]https://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453 [cs.CL]https://arxiv.org/abs/2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, and Shiyu Chang. 2025. Kvlink: Accelerating large language models via efficient kv cache reuse. arXiv preprint arXiv:2502.16002(2025)

work page arXiv 2025
[40]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. Cacheblend: 13 Conference’17, July 2017, Washington, DC, USA Zhuohang Bian, Feiyang Wu, Chengrui Zhang, Hangcheng Dong, Yun Liang, and Youwei Zhuo Fast large language model serving for rag with cached knowledge fusion. InProceedings of ...

work page 2025
[41]

Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, et al. 2025. Kvcomm: Online cross-context kv-cache commu- nication for efficient llm-based multi-agent systems.arXiv preprint arXiv:2510.12872(2025)

work page arXiv 2025
[42]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems7 (2025)

work page 2025
[43]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22). 521–538

work page 2022
[44]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lian- min Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang "Atlas" Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. InAdvances in Neural Information Pro- cessing Systems, A. Oh, T. Naumann, A. Globerson...

work page 2023
[45]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems37 (2024), 62557–62583. 14

work page 2024