CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Chunming Hu; Dinghao Xue; Mingming Zhang; Renyu Yang; Tianyu Wo; Yuchen Teng; Zhuoren Ye

arxiv: 2606.24506 · v2 · pith:WSGHWPLFnew · submitted 2026-06-23 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Zhuoren Ye , Tianyu Wo , Dinghao Xue , Mingming Zhang , Yuchen Teng , Chunming Hu , Renyu Yang This is my paper

Pith reviewed 2026-06-29 05:27 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PF

keywords Multi-LLM servingCold MoE modelsKV-cache disaggregationWeight poolingGPU memory managementPipeline schedulerLong-context inferenceTail latency

0 comments

The pith

Disaggregating weights and KV-cache into separate GPU pools cuts P99 TBT up to 10.4x for cold MoE models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cold MoE models receive sparse requests and remain cold, so reserving worst-case KV-cache per model wastes memory when weights and KV-cache share one pool. CrossPool splits them into a weights pool that consolidates FFN weights across models and a KV-cache pool that allocates based on aggregate active demand. A planner, virtualizer, layer-wise pipeline scheduler, and persistent kernels hide transfers and lower control overhead. This design improves memory utilization and long-context support under low-concurrency traffic. The system outperforms prior KV-cached multi-LLM serving by reducing tail time-between-tokens up to 10.4 times.

Core claim

CrossPool separates FFN weights into a consolidated pool and KV-cache into a dynamic pool. The weights pool holds stable model parameters across cold MoE instances while the KV-cache pool provisions only the current aggregate demand. A KV-cache planner and virtualizer manage allocation, a layer-wise pipeline scheduler overlaps hidden-state movement, and persistent kernels with control lowering cut CPU-GPU overhead, allowing attention to stay local to the KV pool.

What carries the argument

The two-pool disaggregation of weights and KV-cache, managed by a KV-cache planner, virtualizer, and layer-wise pipeline scheduler that hides transfers.

If this is right

KV-cache capacity can be shared across models using total active demand instead of per-model peaks.
Attention computation stays local to the KV-cache pool, exposing more replicated capacity even at low concurrency.
Long-context requests become feasible without pre-reserving worst-case memory per model.
GPU memory utilization rises for the bursty, sparse request patterns typical of cold models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation could help dense models when KV demand fluctuates across instances.
Persistent kernels and lowered control overhead may reduce latency in other inference serving stacks.
Measuring exact transfer volume versus memory savings at different concurrency levels would reveal the operating range where the design wins.

Load-bearing premise

The cost of moving hidden states between the two pools plus the scheduler overhead stays small enough not to erase the gains from higher memory utilization under low-concurrency loads.

What would settle it

Run a low-concurrency workload with many cold MoE models and measure whether P99 TBT and throughput improve over a monolithic baseline; if they do not, the disaggregation benefit is offset by transfer costs.

Figures

Figures reproduced from arXiv: 2606.24506 by Chunming Hu, Dinghao Xue, Mingming Zhang, Renyu Yang, Tianyu Wo, Yuchen Teng, Zhuoren Ye.

**Figure 1.** Figure 1: Cold-model underutilization and accumulated KVcache usage under low RPS. Subfigure (a) summarizes data from OpenRouter [32], and (b) stacks active KV-cache bytes for four 7B models at 0.2 RPS over one hour. • KV-cache planner and virtualizer. CrossPool plans the shared KV-cache pool budget and parallelism offline, then exposes the pool through virtualized paging [28]. • Layer-wise pipeline scheduler. Cros… view at source ↗

**Figure 2.** Figure 2: KV-cache availability when serving a single request on 4 GPUs. Comparison of monolithic and disaggregated memory pools for weights and KV-cache. n_heads values of MHA, GQA and MQA are 4, 2 and 1, respectively. the pool to a high percentile of aggregate demand instead of the worst-case load for each model. 2.2 Mismatch between Algorithms and Systems Recent LLMs use diverse attention algorithms (e.g., GQA … view at source ↗

**Figure 3.** Figure 3: CrossPool system architecture every pool crossing, for every layer, and for every generated token. Even though each transfer is much smaller than moving KV-cache tensors, the repeated transfers accumulate into non-negligible communication overhead. C3: Increased graph capture complexity under mixed scheduling. Modern serving engines rely on CUDA graph capture to reduce launch overhead, but disaggregated e… view at source ↗

**Figure 4.** Figure 4: Layer-wise pipeline scheduler. It interleaves attention and FFN layers of two batches, allowing attention and FFN to be executed simultaneously on different batches from different models. Early exit is supported when one batch finishes all its layers. CUDA VMM APIs [28] to reserve a virtual KV address range for each model and map physical KV pages on demand. Attention operators see a normal paged KV-cach… view at source ↗

**Figure 5.** Figure 5: Persistent kernels for efficient graph execution. across the pool boundary. CrossPool then uses persistent kernels and control lowering to keep frequent scheduling and communication control on GPUs, reducing host intervention and CPU-GPU control transitions. Fig. 5b illustrates the design. CrossPool captures supported attention and FFN subgraphs during warmup and passes their graph handles to GPU-residen… view at source ↗

**Figure 6.** Figure 6: Maximum aggregate request rate estimated from sampled LongAlign context-length bins within each model’s nominal context window; vertical drops mark per-system capacity limits. Prompts are truncated to maximum context length of the two models [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Decode-side TBT on ShareGPT traces from 0.2 to 1.0 RPS per model. 5.2 Context-length Scalability [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to 10.4x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrossPool's weight/KV disaggregation plus layer-wise scheduler is a concrete new serving design for cold MoE traffic, but the 10.4x claim rests on unshown measurements of transfer overhead under sparse load.

read the letter

The paper's core move is to split FFN weights into one GPU pool and KV-cache into another, then add a planner, virtualizer, and layer-wise pipeline scheduler to move hidden states without killing latency. This directly targets the memory waste that happens when many sparse MoE models sit cold and rarely hit peak KV demand at once. The approach is new in how it combines disaggregation with persistent kernels and control lowering for this exact workload.

It improves on monolithic pooling by letting the KV pool provision only aggregate active demand instead of per-model worst-case capacity. The layer-wise scheduler is the part meant to hide cross-pool movement, which matters because cold traffic gives little concurrency to overlap with.

The main weakness is that the abstract states the 10.4x P99 TBT win but supplies no workload traces, hardware specs, or breakdown of transfer time versus compute time. The stress-test point lands: under the low-concurrency regime the paper itself calls out, there may not be enough parallel requests or layers to mask even modest NVLink transfers. If the full paper shows per-layer transfer costs staying well below attention/FFN time with actual numbers, the claim holds; right now that evidence is missing.

This is for systems builders who run fleets of MoE models with bursty, mostly-idle traffic. It is worth a serious referee because the architecture is distinct from prior KV-cache work and the problem is practical, even though the current write-up leaves the central performance assumption unverified.

Referee Report

1 major / 1 minor

Summary. The paper presents CrossPool, a multi-LLM serving engine for cold MoE models that disaggregates FFN weights into a consolidated weights pool and KV-cache into a separate dynamic pool. It introduces a KV-cache planner/virtualizer, a layer-wise pipeline scheduler to overlap hidden-state transfers, and persistent kernels with control lowering. The central empirical claim is that this design outperforms the state-of-the-art KV-cache-based multi-LLM serving system, reducing P99 TBT by up to 10.4x while better supporting bursty long-context requests under low-concurrency traffic.

Significance. If the performance results hold under representative workloads, the disaggregation approach could meaningfully improve GPU memory utilization for sparse MoE deployments where monolithic pooling wastes capacity on worst-case KV reservations. The engineering focus on hiding cross-pool transfers via scheduling is a practical contribution to systems for cold models.

major comments (1)

[Evaluation] Evaluation section (and associated figures/tables reporting the 10.4x P99 TBT result): the central claim that disaggregation delivers net gains rests on the layer-wise pipeline scheduler successfully masking hidden-state transfer latency. The manuscript should include an explicit ablation or per-layer timing breakdown (e.g., transfer time vs. FFN/attention compute time) under the low-concurrency, sparse-request workloads that define the motivating case; without this, it is not possible to confirm that the reported speedup is not offset by cross-pool movement costs.

minor comments (1)

[Abstract] Abstract: experimental setup details (workload traces, hardware configuration, baseline implementation, and overhead measurements) are absent, making it difficult for readers to assess the 10.4x claim without immediately turning to the full evaluation section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation. We address the concern about substantiating the layer-wise pipeline scheduler's effectiveness below.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and associated figures/tables reporting the 10.4x P99 TBT result): the central claim that disaggregation delivers net gains rests on the layer-wise pipeline scheduler successfully masking hidden-state transfer latency. The manuscript should include an explicit ablation or per-layer timing breakdown (e.g., transfer time vs. FFN/attention compute time) under the low-concurrency, sparse-request workloads that define the motivating case; without this, it is not possible to confirm that the reported speedup is not offset by cross-pool movement costs.

Authors: We agree that an explicit ablation study and per-layer timing breakdown would provide stronger evidence that the reported speedups are not offset by transfer costs. In the revised manuscript we will add a dedicated subsection (and associated figure) presenting per-layer measurements of hidden-state transfer time versus FFN/attention compute time, plus an ablation comparing the full system against a variant without the layer-wise scheduler, all evaluated under the low-concurrency, bursty long-context workloads that motivate the work. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems paper with no self-referential derivations

full rationale

The paper describes a serving architecture (disaggregated weights/KV pools, KV planner, layer-wise pipeline scheduler, persistent kernels) and supports its performance claims (up to 10.4x P99 TBT reduction) via experimental comparisons against baselines. No equations, fitted parameters, or derivations appear that reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on measured overheads and utilization gains under cold MoE workloads, which are externally falsifiable via replication on the described hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems paper; no mathematical free parameters, axioms, or invented physical entities are introduced. The design rests on standard GPU memory management assumptions and the empirical observation that cold-model KV peaks do not coincide.

pith-pipeline@v0.9.1-grok · 5820 in / 1172 out tokens · 20574 ms · 2026-06-29T05:27:35.747557+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, ...

work page doi:10.18653/v1/2023.emnlp- 2023
[2]

Alibaba Cloud. 2026. Alibaba Cloud Model Studio.https://modelstudio. alibabacloud.com. Accessed: 2026-05-08

2026
[3]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered. Accessed: 2026-05-08

2023
[4]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- code. Accessed: 2026-05-08

2026
[5]

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP (Findings) (Findings of ACL). Association for Computational Linguistics, 1376–1395

2024
[6]

ByteDance. 2026. Volcano Engine.https://www.volcengine.com. Ac- cessed: 2026-05-08

2026
[7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
[8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)

work page arXiv 2025
[9]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Tianle Li, et al . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat- GPT Quality.https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2026-05-08

2023
[10]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specializa- tion in Mixture-of-Experts Language Models. InACL (1). Association for Computational Lingui...

2024
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
[12]

InNeurIPS

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS
[13]

DeepInfra. 2026. DeepInfra.https://deepinfra.com. Accessed: 2026- 05-08

2026
[14]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.CoRRabs/2405.04434 (2024). arXiv:2405.04434 doi:10.48550/ARXIV.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
[15]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRR abs/2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

DeepSeek-AI. 2026. DeepSeek.https://chat.deepseek.com. Accessed: 2026-05-08

2026
[17]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence.https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical Report

2026
[18]

Hui Dong and Marvin K Nakayama. 2018. A tutorial on quantile estimation via Monte Carlo. InInternational Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer, 3– 30

2018
[19]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net, 11905–11917

2024
[20]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39.https: //jmlr.org/papers/v23/21-0998.html 8 Conference’17, July 2017, Washington, DC, USA

2022
[21]

Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu
[22]

InUSENIX ATC

Weaver: Efficient Multi-LLM Serving with Attention Offloading. InUSENIX ATC. USENIX Association, 587–595
[23]

Google. 2026. Gemini.https://gemini.google.com. Accessed: 2026-05- 08

2026
[24]

StepFun Inc. 2025. Step-3 is Large yet Affordable: Model-system Co- design for Cost-effective Decoding.CoRRabs/2507.19427 (2025)

work page arXiv 2025
[25]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
[26]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
[27]

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InSOSP. ACM, 611–626
[28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InOSDI. USENIX Association, 155–172

2024
[29]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In9th International Confer- ence on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.https://o...

2021
[30]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InSIGCOMM. ACM, 38–56

2024
[31]

NVIDIA. 2025. NVSHMEM: GPU Programming Interface for Scalable Communication.https://docs.nvidia.com/nvshmem. Accessed: 2026- 05-08

2025
[32]

NVIDIA. 2026. CUDA Virtual Memory Management (VMM).https:// docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html. Ac- cessed: 2026-05-08

2026
[33]

OpenAI. 2026. ChatGPT.https://chatgpt.com. Accessed: 2026-05-08

2026
[34]

OpenAI. 2026. Codex.https://openai.com/codex. Accessed: 2026-05-08

2026
[35]

OpenClaw Contributors. 2026. OpenClaw.https://openclaw.ai. Ac- cessed: 2026-05-08

2026
[36]

OpenRouter. 2026. OpenRouter.https://openrouter.ai. Accessed: 2026-05-08

2026
[37]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. InFAST. USENIX Association, 155–170

2025
[38]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.CoRRabs/1911.02150 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market. In SOSP. ACM, 1030–1045

2025
[41]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InMLSys. OpenReview.net/mlsys.org

2025
[42]

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, and Ying Sheng. 2026. Chimera: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning. In USENIX OSDI. USENIX Association

2026
[43]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InNeurIPS

2024
[44]

Zhipu AI and Tsinghua University. 2024. LongAlign-10k.https:// huggingface.co/datasets/zai-org/LongAlign-10k

2024
[45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism. ...

2025

[1] [1]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, ...

work page doi:10.18653/v1/2023.emnlp- 2023

[2] [2]

Alibaba Cloud. 2026. Alibaba Cloud Model Studio.https://modelstudio. alibabacloud.com. Accessed: 2026-05-08

2026

[3] [3]

anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered. Accessed: 2026-05-08

2023

[4] [4]

Anthropic. 2026. Claude Code.https://claude.com/product/claude- code. Accessed: 2026-05-08

2026

[5] [5]

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP (Findings) (Findings of ACL). Association for Computational Linguistics, 1376–1395

2024

[6] [6]

ByteDance. 2026. Volcano Engine.https://www.volcengine.com. Ac- cessed: 2026-05-08

2026

[7] [7]

Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang

[8] [8]

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)

work page arXiv 2025

[9] [9]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Tianle Li, et al . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat- GPT Quality.https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2026-05-08

2023

[10] [10]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specializa- tion in Mixture-of-Experts Language Models. InACL (1). Association for Computational Lingui...

2024

[11] [11]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

[12] [12]

InNeurIPS

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS

[13] [13]

DeepInfra. 2026. DeepInfra.https://deepinfra.com. Accessed: 2026- 05-08

2026

[14] [14]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.CoRRabs/2405.04434 (2024). arXiv:2405.04434 doi:10.48550/ARXIV.2405.04434

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024

[15] [15]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRR abs/2412.19437 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

DeepSeek-AI. 2026. DeepSeek.https://chat.deepseek.com. Accessed: 2026-05-08

2026

[17] [17]

DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence.https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical Report

2026

[18] [18]

Hui Dong and Marvin K Nakayama. 2018. A tutorial on quantile estimation via Monte Carlo. InInternational Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer, 3– 30

2018

[19] [19]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net, 11905–11917

2024

[20] [20]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39.https: //jmlr.org/papers/v23/21-0998.html 8 Conference’17, July 2017, Washington, DC, USA

2022

[21] [21]

Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu

[22] [22]

InUSENIX ATC

Weaver: Efficient Multi-LLM Serving with Attention Offloading. InUSENIX ATC. USENIX Association, 587–595

[23] [23]

Google. 2026. Gemini.https://gemini.google.com. Accessed: 2026-05- 08

2026

[24] [24]

StepFun Inc. 2025. Step-3 is Large yet Affordable: Model-system Co- design for Cost-effective Decoding.CoRRabs/2507.19427 (2025)

work page arXiv 2025

[25] [25]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024

[26] [26]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

[27] [27]

Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InSOSP. ACM, 611–626

[28] [28]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InOSDI. USENIX Association, 155–172

2024

[29] [29]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In9th International Confer- ence on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.https://o...

2021

[30] [30]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InSIGCOMM. ACM, 38–56

2024

[31] [31]

NVIDIA. 2025. NVSHMEM: GPU Programming Interface for Scalable Communication.https://docs.nvidia.com/nvshmem. Accessed: 2026- 05-08

2025

[32] [32]

NVIDIA. 2026. CUDA Virtual Memory Management (VMM).https:// docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html. Ac- cessed: 2026-05-08

2026

[33] [33]

OpenAI. 2026. ChatGPT.https://chatgpt.com. Accessed: 2026-05-08

2026

[34] [34]

OpenAI. 2026. Codex.https://openai.com/codex. Accessed: 2026-05-08

2026

[35] [35]

OpenClaw Contributors. 2026. OpenClaw.https://openclaw.ai. Ac- cessed: 2026-05-08

2026

[36] [36]

OpenRouter. 2026. OpenRouter.https://openrouter.ai. Accessed: 2026-05-08

2026

[37] [37]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. InFAST. USENIX Association, 155–170

2025

[38] [38]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.CoRRabs/1911.02150 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[39] [39]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market. In SOSP. ACM, 1030–1045

2025

[41] [41]

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InMLSys. OpenReview.net/mlsys.org

2025

[42] [42]

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, and Ying Sheng. 2026. Chimera: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning. In USENIX OSDI. USENIX Association

2026

[43] [43]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InNeurIPS

2024

[44] [44]

Zhipu AI and Tsinghua University. 2024. LongAlign-10k.https:// huggingface.co/datasets/zai-org/LongAlign-10k

2024

[45] [45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism. ...

2025