CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Chunming Hu; Dinghao Xue; Mingming Zhang; Renyu Yang; Tianyu Wo; Yuchen Teng; Zhuoren Ye

arxiv: 2606.24506 · v1 · pith:WSGHWPLFnew · submitted 2026-06-23 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Zhuoren Ye , Tianyu Wo , Dinghao Xue , Mingming Zhang , Yuchen Teng , Chunming Hu , Renyu Yang This is my paper

Pith reviewed 2026-06-25 22:50 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PF

keywords LLM servingMoEKV cacheGPU memorydisaggregationcold modelsmulti-LLM

0 comments

The pith

Disaggregating FFN weights and KV-cache into separate GPU pools lets CrossPool serve many cold MoE models with up to 10.4× lower P99 time-between-tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cold MoE models, which receive sparse requests, waste GPU memory when each reserves worst-case KV-cache capacity. A shared KV-cache pool can meet aggregate demand, but monolithic memory pools cause weights to compete with KV-cache and limit attention efficiency under low concurrency. CrossPool solves this by creating a weights pool for consolidated FFN weights across models and a separate KV-cache pool that keeps attention local. It adds a planner-virtualizer, layer-wise pipeline scheduler to hide transfers, and persistent kernels to cut control overhead. This setup supports bursty long-context traffic more efficiently than prior kvcache-based systems.

Core claim

CrossPool is a serving engine that disaggregates FFN weights into a consolidation pool and KV-cache into a dynamic pool for cold MoE models. It uses a KV-cache planner and virtualizer, a layer-wise pipeline scheduler, and persistent kernels with control lowering to achieve high GPU memory utilization and reduce P99 TBT by up to 10.4 times versus state-of-the-art systems.

What carries the argument

The separation of FFN weights and KV-cache into two distinct GPU memory pools, with attention computation localized to the KV-cache pool.

If this is right

Multiple cold models can share KV-cache capacity without per-model worst-case allocation.
Bursty long-context requests can be handled without latency violations from memory contention.
Hidden-state transfers between pools are hidden by the pipeline scheduler.
CPU-GPU control overhead is reduced through persistent kernels.
Overall, P99 TBT improves by up to 10.4× over monolithic approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The disaggregation technique could apply to dense models if similar memory competition occurs.
Cloud providers might adopt this to increase tenant density in LLM serving clusters.
Future hardware with faster inter-pool transfers could amplify the benefits.

Load-bearing premise

That peak KV-cache demands across cold models do not coincide, so aggregate provisioning suffices without missing per-request latency targets.

What would settle it

A workload trace where several cold MoE models simultaneously hit their maximum KV-cache usage, causing the shared pool to exceed capacity and increase P99 TBT beyond baseline levels.

Figures

Figures reproduced from arXiv: 2606.24506 by Chunming Hu, Dinghao Xue, Mingming Zhang, Renyu Yang, Tianyu Wo, Yuchen Teng, Zhuoren Ye.

**Figure 1.** Figure 1: Cold-model underutilization and accumulated KVcache usage under low RPS. Subfigure (a) summarizes data from OpenRouter [32], and (b) stacks active KV-cache bytes for four 7B models at 0.2 RPS over one hour. • KV-cache planner and virtualizer. CrossPool plans the shared KV-cache pool budget and parallelism offline, then exposes the pool through virtualized paging [28]. • Layer-wise pipeline scheduler. Cros… view at source ↗

**Figure 2.** Figure 2: KV-cache availability when serving a single request on 4 GPUs. Comparison of monolithic and disaggregated memory pools for weights and KV-cache. n_heads values of MHA, GQA and MQA are 4, 2 and 1, respectively. the pool to a high percentile of aggregate demand instead of the worst-case load for each model. 2.2 Mismatch between Algorithms and Systems Recent LLMs use diverse attention algorithms (e.g., GQA … view at source ↗

**Figure 3.** Figure 3: CrossPool system architecture every pool crossing, for every layer, and for every generated token. Even though each transfer is much smaller than moving KV-cache tensors, the repeated transfers accumulate into non-negligible communication overhead. C3: Increased graph capture complexity under mixed scheduling. Modern serving engines rely on CUDA graph capture to reduce launch overhead, but disaggregated e… view at source ↗

**Figure 4.** Figure 4: Layer-wise pipeline scheduler. It interleaves attention and FFN layers of two batches, allowing attention and FFN to be executed simultaneously on different batches from different models. Early exit is supported when one batch finishes all its layers. CUDA VMM APIs [28] to reserve a virtual KV address range for each model and map physical KV pages on demand. Attention operators see a normal paged KV-cach… view at source ↗

**Figure 5.** Figure 5: Persistent kernels for efficient graph execution. across the pool boundary. CrossPool then uses persistent kernels and control lowering to keep frequent scheduling and communication control on GPUs, reducing host intervention and CPU-GPU control transitions. Fig. 5b illustrates the design. CrossPool captures supported attention and FFN subgraphs during warmup and passes their graph handles to GPU-residen… view at source ↗

**Figure 6.** Figure 6: Maximum aggregate request rate estimated from sampled LongAlign context-length bins within each model’s nominal context window; vertical drops mark per-system capacity limits. Prompts are truncated to maximum context length of the two models [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Decode-side TBT on ShareGPT traces from 0.2 to 1.0 RPS per model. 5.2 Context-length Scalability [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to $10.4\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrossPool disaggregates weights from KV-cache for cold MoE serving and reports a large latency win, but the non-overlapping peak assumption is the part that still needs evidence.

read the letter

The paper's main move is to split static FFN weights into one GPU pool and dynamic KV-cache into another for serving many cold MoE models at once. They add a planner/virtualizer, layer-wise pipelining to hide transfers, and persistent kernels to cut control overhead. The result is a shared KV pool that only needs to cover aggregate active demand rather than per-model worst case, with a claimed 10.4x drop in P99 time-between-tokens versus prior kvcache systems.

The motivation holds up: cold models with sparse traffic do waste memory when each one reserves its own KV headroom, and separating the pools lets the transient part be shared. The scheduling pieces are straightforward engineering to make the split workable without blowing up latency. If the measurements are clean and the code is available, this is the sort of targeted systems tweak that could matter for operators running lots of low-concurrency MoE instances.

The soft spot is exactly the one the stress-test flags. The gains rest on cold models rarely hitting peak KV demand together; if bursty long-context traffic causes overlap, the shared pool becomes the bottleneck and the extra machinery cannot create capacity that isn't there. The abstract does not show ablations on concurrent spikes or workload traces, so it is hard to tell whether the 10.4x number is robust or tied to particular request patterns. Prior KV-sharing work is cited, so the contribution is the specific combination rather than the idea itself.

This is for people who run multi-model LLM serving stacks. A systems reader who cares about memory efficiency under sparse MoE traffic would get concrete architecture details worth looking at. It has enough of a system and a measurable claim to go to peer review, though the referees will want to see the evaluation details and any tests against overlapping load.

Referee Report

2 major / 1 minor

Summary. The paper presents CrossPool, a serving system for multiple cold MoE LLMs that disaggregates FFN weights into a consolidated weights pool and KV-cache into a dynamic shared pool, augmented by a KV-cache planner/virtualizer, layer-wise pipeline scheduler to hide transfers, and persistent kernels with control lowering. It claims this enables efficient support for bursty long-context requests and delivers up to 10.4× reduction in P99 time-between-tokens versus the state-of-the-art kvcache-based multi-LLM serving system.

Significance. If the empirical gains prove robust, the disaggregation approach could meaningfully improve GPU memory utilization in multi-tenant LLM serving for sparsely accessed models, particularly under variable long-context workloads. The work provides concrete system mechanisms (planner, virtualizer, scheduler, kernels) that address a practical tension between static weights and dynamic KV-cache.

major comments (2)

[Abstract] Abstract (motivation paragraph): the 10.4× P99 TBT claim and the ability to 'provision only aggregate active demand' rest on the premise that cold models rarely reach peak KV-cache demand simultaneously. No section, figure, or table in the manuscript provides a direct measurement or stress-test of peak-overlap frequency under the evaluated bursty long-context workloads; without this, the shared-pool benefit cannot be distinguished from workload selection effects.
[Evaluation] The layer-wise pipeline scheduler and persistent kernels are presented as compensating for hidden-state transfers and control overhead, yet the manuscript does not quantify the residual latency when the KV-cache pool itself becomes contended (i.e., when aggregate demand exceeds provisioned capacity). A load-bearing experiment comparing contended vs. non-contended regimes is required to substantiate that the scheduler can still meet per-request latency targets.

minor comments (1)

Notation for the two pools (weights pool vs. KV-cache pool) and the virtualizer interface should be defined once with consistent symbols rather than repeated descriptive phrases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The major comments point to gaps in our evaluation that we will address through additional analysis and experiments in the revised manuscript. We respond to each comment below.

read point-by-point responses

Referee: [Abstract] Abstract (motivation paragraph): the 10.4× P99 TBT claim and the ability to 'provision only aggregate active demand' rest on the premise that cold models rarely reach peak KV-cache demand simultaneously. No section, figure, or table in the manuscript provides a direct measurement or stress-test of peak-overlap frequency under the evaluated bursty long-context workloads; without this, the shared-pool benefit cannot be distinguished from workload selection effects.

Authors: We agree that a direct measurement of peak KV-cache demand overlap would strengthen the motivation for the shared pool. Our evaluated workloads are constructed from real-world traces of bursty long-context requests across multiple cold MoE models, where models are accessed sparsely. However, to directly address this, we will include in the revision a new analysis (e.g., a figure showing cumulative distribution of simultaneous peak demands across models) based on the workload traces used in our experiments. This will quantify the overlap frequency and support the claim that aggregate provisioning suffices. revision: yes
Referee: [Evaluation] The layer-wise pipeline scheduler and persistent kernels are presented as compensating for hidden-state transfers and control overhead, yet the manuscript does not quantify the residual latency when the KV-cache pool itself becomes contended (i.e., when aggregate demand exceeds provisioned capacity). A load-bearing experiment comparing contended vs. non-contended regimes is required to substantiate that the scheduler can still meet per-request latency targets.

Authors: The referee correctly identifies that our evaluation focuses on non-contended scenarios where the shared pool provisions aggregate demand. To substantiate the scheduler's effectiveness under contention, we will add experiments in the revised version that deliberately over-subscribe the KV-cache pool (e.g., by increasing request rates until aggregate demand exceeds capacity) and measure the resulting P99 TBT and whether latency targets are maintained. This will demonstrate the limits and robustness of the pipeline scheduler and persistent kernels. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation with no derivations or self-referential predictions

full rationale

The paper describes a serving engine (CrossPool) with memory disaggregation, a planner/virtualizer, scheduler, and kernels, then reports empirical latency gains (e.g., 10.4× P99 TBT) against a baseline. No equations, fitted parameters, or predictions appear; the central claim is an observed outcome of the implementation under stated traffic assumptions. No self-citations are invoked as load-bearing uniqueness theorems. The design is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the domain assumption that peak KV demands across cold models are sufficiently uncorrelated.

axioms (1)

domain assumption Cold models rarely reach peak KV-cache demand simultaneously
Stated in the motivation section of the abstract as the premise enabling shared-pool savings.

pith-pipeline@v0.9.1-grok · 5820 in / 1215 out tokens · 19224 ms · 2026-06-25T22:50:39.828203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, ...

work page doi:10.18653/v1/2023.emnlp- 2023
[2]

Alibaba Cloud. 2026. Alibaba Cloud Model Studio.https://modelstudio. alibabacloud.com. Accessed: 2026-05-08

2026
[3]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered. Accessed: 2026-05-08

2023
[4]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- code. Accessed: 2026-05-08

2026
[5]

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP (Findings) (Findings of ACL). Association for Computational Linguistics, 1376–1395

2024
[6]

ByteDance. 2026. Volcano Engine.https://www.volcengine.com. Ac- cessed: 2026-05-08

2026
[7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
[8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)

arXiv 2025
[9]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Tianle Li, et al . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat- GPT Quality.https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2026-05-08

2023
[10]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specializa- tion in Mixture-of-Experts Language Models. InACL (1). Association for Computational Lingui...

2024
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
[12]

InNeurIPS

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS
[13]

DeepInfra. 2026. DeepInfra.https://deepinfra.com. Accessed: 2026- 05-08

2026
[14]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.CoRRabs/2405.04434 (2024). arXiv:2405.04434 doi:10.48550/ARXIV.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
[15]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRR abs/2412.19437 (2024)

Pith/arXiv arXiv 2024
[16]

DeepSeek-AI. 2026. DeepSeek.https://chat.deepseek.com. Accessed: 2026-05-08

2026
[17]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence.https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical Report

2026
[18]

Hui Dong and Marvin K Nakayama. 2018. A tutorial on quantile estimation via Monte Carlo. InInternational Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer, 3– 30

2018
[19]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net, 11905–11917

2024
[20]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39.https: //jmlr.org/papers/v23/21-0998.html 8 Conference’17, July 2017, Washington, DC, USA

2022
[21]

Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu
[22]

InUSENIX ATC

Weaver: Efficient Multi-LLM Serving with Attention Offloading. InUSENIX ATC. USENIX Association, 587–595
[23]

Google. 2026. Gemini.https://gemini.google.com. Accessed: 2026-05- 08

2026
[24]

StepFun Inc. 2025. Step-3 is Large yet Affordable: Model-system Co- design for Cost-effective Decoding.CoRRabs/2507.19427 (2025)

arXiv 2025
[25]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[26]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
[27]

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InSOSP. ACM, 611–626
[28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InOSDI. USENIX Association, 155–172

2024
[29]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In9th International Confer- ence on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.https://o...

2021
[30]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InSIGCOMM. ACM, 38–56

2024
[31]

NVIDIA. 2025. NVSHMEM: GPU Programming Interface for Scalable Communication.https://docs.nvidia.com/nvshmem. Accessed: 2026- 05-08

2025
[32]

NVIDIA. 2026. CUDA Virtual Memory Management (VMM).https:// docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html. Ac- cessed: 2026-05-08

2026
[33]

OpenAI. 2026. ChatGPT.https://chatgpt.com. Accessed: 2026-05-08

2026
[34]

OpenAI. 2026. Codex.https://openai.com/codex. Accessed: 2026-05-08

2026
[35]

OpenClaw Contributors. 2026. OpenClaw.https://openclaw.ai. Ac- cessed: 2026-05-08

2026
[36]

OpenRouter. 2026. OpenRouter.https://openrouter.ai. Accessed: 2026-05-08

2026
[37]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. InFAST. USENIX Association, 155–170

2025
[38]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.CoRRabs/1911.02150 (2019)

Pith/arXiv arXiv 2019
[39]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

Pith/arXiv arXiv 2025
[40]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market. In SOSP. ACM, 1030–1045

2025
[41]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InMLSys. OpenReview.net/mlsys.org

2025
[42]

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, and Ying Sheng. 2026. Chimera: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning. In USENIX OSDI. USENIX Association

2026
[43]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InNeurIPS

2024
[44]

Zhipu AI and Tsinghua University. 2024. LongAlign-10k.https:// huggingface.co/datasets/zai-org/LongAlign-10k

2024
[45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism. ...

2025

[1] [1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, ...

work page doi:10.18653/v1/2023.emnlp- 2023

[2] [2]

Alibaba Cloud. 2026. Alibaba Cloud Model Studio.https://modelstudio. alibabacloud.com. Accessed: 2026-05-08

2026

[3] [3]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered. Accessed: 2026-05-08

2023

[4] [4]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- code. Accessed: 2026-05-08

2026

[5] [5]

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP (Findings) (Findings of ACL). Association for Computational Linguistics, 1376–1395

2024

[6] [6]

ByteDance. 2026. Volcano Engine.https://www.volcengine.com. Ac- cessed: 2026-05-08

2026

[7] [7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

[8] [8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)

arXiv 2025

[9] [9]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Tianle Li, et al . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat- GPT Quality.https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2026-05-08

2023

[10] [10]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specializa- tion in Mixture-of-Experts Language Models. InACL (1). Association for Computational Lingui...

2024

[11] [11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

[12] [12]

InNeurIPS

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS

[13] [13]

DeepInfra. 2026. DeepInfra.https://deepinfra.com. Accessed: 2026- 05-08

2026

[14] [14]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.CoRRabs/2405.04434 (2024). arXiv:2405.04434 doi:10.48550/ARXIV.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024

[15] [15]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRR abs/2412.19437 (2024)

Pith/arXiv arXiv 2024

[16] [16]

DeepSeek-AI. 2026. DeepSeek.https://chat.deepseek.com. Accessed: 2026-05-08

2026

[17] [17]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence.https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical Report

2026

[18] [18]

Hui Dong and Marvin K Nakayama. 2018. A tutorial on quantile estimation via Monte Carlo. InInternational Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer, 3– 30

2018

[19] [19]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net, 11905–11917

2024

[20] [20]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39.https: //jmlr.org/papers/v23/21-0998.html 8 Conference’17, July 2017, Washington, DC, USA

2022

[21] [21]

Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu

[22] [22]

InUSENIX ATC

Weaver: Efficient Multi-LLM Serving with Attention Offloading. InUSENIX ATC. USENIX Association, 587–595

[23] [23]

Google. 2026. Gemini.https://gemini.google.com. Accessed: 2026-05- 08

2026

[24] [24]

StepFun Inc. 2025. Step-3 is Large yet Affordable: Model-system Co- design for Cost-effective Decoding.CoRRabs/2507.19427 (2025)

arXiv 2025

[25] [25]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024

[26] [26]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

[27] [27]

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InSOSP. ACM, 611–626

[28] [28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InOSDI. USENIX Association, 155–172

2024

[29] [29]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In9th International Confer- ence on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.https://o...

2021

[30] [30]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InSIGCOMM. ACM, 38–56

2024

[31] [31]

NVIDIA. 2025. NVSHMEM: GPU Programming Interface for Scalable Communication.https://docs.nvidia.com/nvshmem. Accessed: 2026- 05-08

2025

[32] [32]

NVIDIA. 2026. CUDA Virtual Memory Management (VMM).https:// docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html. Ac- cessed: 2026-05-08

2026

[33] [33]

OpenAI. 2026. ChatGPT.https://chatgpt.com. Accessed: 2026-05-08

2026

[34] [34]

OpenAI. 2026. Codex.https://openai.com/codex. Accessed: 2026-05-08

2026

[35] [35]

OpenClaw Contributors. 2026. OpenClaw.https://openclaw.ai. Ac- cessed: 2026-05-08

2026

[36] [36]

OpenRouter. 2026. OpenRouter.https://openrouter.ai. Accessed: 2026-05-08

2026

[37] [37]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. InFAST. USENIX Association, 155–170

2025

[38] [38]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.CoRRabs/1911.02150 (2019)

Pith/arXiv arXiv 2019

[39] [39]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

Pith/arXiv arXiv 2025

[40] [40]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market. In SOSP. ACM, 1030–1045

2025

[41] [41]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InMLSys. OpenReview.net/mlsys.org

2025

[42] [42]

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, and Ying Sheng. 2026. Chimera: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning. In USENIX OSDI. USENIX Association

2026

[43] [43]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InNeurIPS

2024

[44] [44]

Zhipu AI and Tsinghua University. 2024. LongAlign-10k.https:// huggingface.co/datasets/zai-org/LongAlign-10k

2024

[45] [45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism. ...

2025