pith. sign in

arxiv: 2605.21100 · v1 · pith:37ZQDNJOnew · submitted 2026-05-20 · 💻 cs.DC

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

Pith reviewed 2026-05-21 01:47 UTC · model grok-4.3

classification 💻 cs.DC
keywords mixture of expertscontext parallelismdata parallelismexpert parallelismkv cache managementmodel servingtail latencyservice level objectives
0
0 comments X

The pith

NanoCP uses request-level dynamic context parallelism to balance KV cache and batch sizes in MoE serving, achieving higher request rates and lower tail latency under strict SLOs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern MoE serving systems struggle with balancing attention latency, which grows with KV cache size, and MoE communication latency, which grows with batch size, when requests are bound to single instances. NanoCP introduces dynamic context parallelism that assigns a context-parallel degree to each request proportional to its KV footprint, spreading long-context requests across instances while keeping short ones local. This decouples MoE communication from KV cache placement and liquefies the KV cache across the cluster for better balance. An ahead-of-time graph engine with a custom routing-based communication backend makes this dynamic parallelism compatible with static execution. The approach delivers up to 3.27 times higher request rates under TPOT SLOs and reduces P99 tail latency by up to 2.12 times.

Core claim

By decoupling MoE communication from KV cache placement via dynamic context parallelism, where each request receives a context-parallel degree sized to its KV footprint, NanoCP balances per-instance KV occupancy and batch sizes. Long requests distribute attention across multiple instances and short requests stay local. This is enabled by an ahead-of-time graph engine paired with a custom routing-based communication backend that bridges the dynamic parallelism to static execution without prohibitive overheads.

What carries the argument

Dynamic context parallelism (DCP) that assigns each request a variable context-parallel degree based on its KV cache size to balance attention computation and expert communication across the cluster.

If this is right

  • Maintains up to 1.88×–3.27× higher request rates under strict TPOT SLOs.
  • Reduces P99 tail latency by up to 1.79×–2.12× by mitigating EP stragglers.
  • Balances KV cache occupancy and batch sizes across instances without additional load-balancing costs.
  • Effectively liquefies the KV cache across the cluster for long-context requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying this per-request dynamic parallelism could improve serving efficiency for other large language models with varying context lengths.
  • Future systems might integrate similar decoupling mechanisms to handle memory fragmentation in distributed inference setups.
  • Testing DCP on different hardware configurations could reveal scalability limits of the AOT graph engine.

Load-bearing premise

That an ahead-of-time graph engine paired with a custom routing-based communication backend can bridge per-request dynamic context parallelism to static execution without introducing prohibitive overheads that would erase the reported gains.

What would settle it

A direct measurement of the communication and execution overhead introduced by the AOT graph engine and custom backend when DCP is active, compared against a baseline without dynamic parallelism.

Figures

Figures reproduced from arXiv: 2605.21100 by Binbin Lin, Dahua Lin, Haojie Duanmu, Hao Liu, Hui Wang, Jiangfei Duan, Jiefei Chen, Jinming Ma, Qinxiu Cheng, Xingcheng Zhang, Xiuhong Li, Zhilin Pei.

Figure 1
Figure 1. Figure 1: Existing load balancing strategies: (a1) least cache routes each incoming request to the instance with the most free KV blocks, (a2) least batch routes it to the instance with the smallest running batch size, and (b) Helix applies a uniform CP degree to all requests; (c) NanoCP. assigns larger CP degrees to long requests to better balance attention computation, while assigning smaller CP degrees to short r… view at source ↗
Figure 2
Figure 2. Figure 2: Data-expert parallel architecture. 2 Background and Motivation 2.1 Variable-Length Workloads Modern LLM serving systems must handle requests with highly variable sequence lengths, ranging from short con￾versational prompts [64, 72, 74] to much longer multi-modal and agentic workloads [15, 29, 45]. As shown by the Open￾Router statistics in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Microbenchmarks of attention and DeepEP com￾munication latency [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two types of imbalance in MoE decode (DS/DR: Dispatch Send/Receive; CS/CR: Combine Send/Receive). requests while DP instance 1 handles 98. DP instance 0 there￾fore spends much longer in dispatch receive (DR), delaying its combine send (CS). As a result, DP instance 1, despite finishing its MLP and combine send quickly, is blocked be￾fore combine receive (CR) and remains idle. This straggler arises from bat… view at source ↗
Figure 5
Figure 5. Figure 5: Limitations of existing request-level load balancing strategies. 8K×128 16K×64 32K×32 64K×16 128K×8 256K×4 512K×2 1M×1 Sequence length × batch size per GPU 0 600 1,200 1,800 2,400 Latency (us) Attention Computation CP Communication DP CP2 CP4 CP8 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-layer attention latency breakdown of Helix CP under varying sequence lengths and batch sizes (total ∼1M tokens per GPU). also varies significantly under this policy: most instances remain around 40–55 𝜇s, while the straggler reaches 539.5 𝜇s. The mean is 184.3 𝜇s, indicating that balancing attention computation would reduce the maximum latency by 65.8%. Head-of-line blocking. Request-level routing also… view at source ↗
Figure 8
Figure 8. Figure 8: NanoCP system overview. NanoCP keeps one MoE binding per request but chooses the KV binding size dynamically, allowing long requests to use CP while keeping short requests local. To jointly balance attention computation and MoE com￾munication with low overhead, we propose dual-balanced scheduling with dynamic context parallelism (DCP). The key idea is to decouple a request’s MoE binding from its KV bind￾in… view at source ↗
Figure 9
Figure 9. Figure 9: Routing table derivation. (a) Request partition with MoE binding assignment (parenthesized). (b) Binding configuration. (c)–(d) Per-instance Q-Route and Res-Route tables for Instance 2. 4.2.3 Intra-node Placement and KV Partitioning. Within the selected node, the scheduler first chooses the instance that will execute the request’s MoE communication (line 9). It then selects the remaining participating inst… view at source ↗
Figure 11
Figure 11. Figure 11: The micro-architecture of the routing-based com￾munication backend. 5.3 Routing-based Communication Backend To efficiently execute the dynamic routing between decou￾pled KV binding and MoE binding within CUDA graphs, existing communication libraries face a fundamental mis￾match. Specifically, they suffer from two primary limitations. First, general collective libraries [10, 24, 40, 47] are primarily optim… view at source ↗
Figure 12
Figure 12. Figure 12: The end-to-end generation performance in H200 platform. DCP CP8 CP4 CP2 DP 0 500 1000 1500 2000 L a t e n c y (μ s) 2.2x 1.8x 1.6x 1.3x Ours vLLM Long=1 DCP CP8 CP4 CP2 DP 2x 1.6x 1.4x 1.2x Ours vLLM Long=3 DCP CP8 CP4 CP2 DP 1.9x 1.6x 1.4x 1.2x Ours vLLM Long=5 DCP CP8 CP4 CP2 DP 1.9x 1.6x 1.3x 1.1x Ours vLLM Long=7 Attention Dispatch+Combine CP Cost Others [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Micro-benchmark batch size to execute dispatch and combine, achieving lower batch size imbalance (8.54% vs. 47.40%). High load [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (a) Batch size and KV cache load balance across 32 instances under low load. (b) Head-of-line blocking under high load: free memory blocks vs. head-of-queue demand per node. 0 1000 2000 3000 ShareGPT4o Ours (DCP) vLLM (CP8) vLLM (DP-LeastBatch) vLLM (DP-LeastCache) 0 1000 2000 Issue1% 0 8 16 24 31 0 500 1000 1500 Issue5% 0 8 16 24 31 0 8 16 24 31 0 8 16 24 31 Instance ID L a t e n c y (μ s) Attention Disp… view at source ↗
Figure 17
Figure 17. Figure 17: Performance of NanoCP’s routing-based back￾end vs. NCCL for Query all-to-all transfer. 0 5 10 CP>1 Cnt CP=1 CP=4 CP=5 CP=6 CP=7 CP=8 6000 7000 8000 9000 10000 Decode Iter 20 25 30 35 A2A Lat. (us) DP0 DP1 DP2 DP3 0 1000 2000 CP=1 Cnt [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
Figure 16
Figure 16. Figure 16: Schedule overhead comparison [PITH_FULL_IMAGE:figures/full_fig_p012_16.png] view at source ↗
read the original abstract

Modern serving systems for Mixture-of-Experts (MoE) models adopt hybrid data-expert parallelism: expert parallelism (EP) shards experts across GPUs to scale capacity, while data parallelism (DP) replicates attention layers across instances to process independent requests. Existing systems bind each request's attention, MoE communication, and KV cache to a single instance. Because attention latency scales with KV cache size while MoE communication latency scales with batch size, this binding cannot balance both simultaneously, producing EP stragglers; it also fragments KV memory across instances, inflating tail latency under long contexts. While existing context parallelism (CP) mitigates these constraints, its uniform parallelism degree incurs prohibitive communication and attention-side overheads. We present \work, which decouples MoE communication from KV cache placement and achieves dual balance through dynamic context parallelism (DCP). DCP assigns each request a context-parallel degree sized to its KV footprint: long requests distribute attention across multiple instances; short requests remain local. This dynamic parallelism effectively liquefies the KV cache across the cluster, balancing both the per-instance KV cache occupancy and batch sizes without unnecessary load-balancing costs. To bridge DCP with static execution, \work introduces an ahead-of-time (AOT) graph engine paired with a custom routing-based communication backend. Experimental results show that \work maintains up to $1.88\times$--$3.27\times$ higher request rates under strict time-per-output-token (TPOT) service level objectives (SLOs). Furthermore, \work significantly mitigates stragglers, reducing P99 tail latency by up to $1.79\times$--$2.12\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces NanoCP for serving Mixture-of-Experts models under hybrid data-expert parallelism. It proposes request-level dynamic context parallelism (DCP) that assigns per-request context-parallel degrees based on KV footprint, decoupling MoE communication from KV cache placement. An ahead-of-time graph engine and custom routing-based communication backend are used to map dynamic assignments to static execution. The central claims are up to 1.88×–3.27× higher request rates under strict TPOT SLOs and up to 1.79×–2.12× reduction in P99 tail latency.

Significance. If validated, the result would be significant for large-scale MoE inference serving. It directly targets the tension between attention latency (scaling with KV size) and MoE communication latency (scaling with batch size) that produces stragglers and fragmented KV memory in existing systems. The dynamic, request-granular approach to context parallelism, combined with the AOT engine to preserve static execution, offers a concrete path to better cluster-wide balance without uniform parallelism overheads.

major comments (2)
  1. [Evaluation] Evaluation section: the central performance claims (1.88×–3.27× request-rate improvement and 1.79×–2.12× P99 reduction) rest on the premise that the AOT graph engine and routing backend add negligible overhead relative to the DCP gains. No isolated latency breakdown (routing decision time, graph instantiation cost, or per-component communication volume) is reported that would allow attribution of the measured improvements specifically to DCP rather than to unstated factors or baseline differences.
  2. [System Design] Section describing the AOT graph engine: the claim that dynamic per-request context-parallel assignments can be bridged to static execution without prohibitive overhead is load-bearing for the entire contribution. The manuscript provides no quantitative characterization of recompilation or routing costs under varying request lengths and arrival patterns, leaving open whether these costs remain small when KV-cache placement and attention topology change on a per-request basis.
minor comments (3)
  1. [Abstract] Abstract: the experimental claims would be easier to assess if the abstract briefly stated the model sizes, number of GPUs, workload characteristics (context lengths, batch sizes), and the exact baselines against which the 1.88×–3.27× and 1.79×–2.12× numbers are measured.
  2. Notation: the phrase 'liquefies the KV cache' is informal; a precise description of how DCP redistributes KV occupancy and attention communication across instances would improve clarity.
  3. [Evaluation] Figure captions and tables: ensure that all reported speedups include error bars or confidence intervals and explicitly state the number of runs or statistical significance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below and have revised the manuscript to strengthen the evaluation and system characterization as suggested.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central performance claims (1.88×–3.27× request-rate improvement and 1.79×–2.12× P99 reduction) rest on the premise that the AOT graph engine and routing backend add negligible overhead relative to the DCP gains. No isolated latency breakdown (routing decision time, graph instantiation cost, or per-component communication volume) is reported that would allow attribution of the measured improvements specifically to DCP rather than to unstated factors or baseline differences.

    Authors: We agree that an isolated latency breakdown is necessary to rigorously attribute gains to DCP. In the revised manuscript we have added a new microbenchmark subsection (Section 5.4) with measurements of routing decision time (0.15–0.3 ms per request), AOT graph instantiation cost (0.4–1.2 ms depending on request length), and per-component communication volume. These results confirm that the combined overhead remains below 4 % of end-to-end latency and does not explain the reported speedups. revision: yes

  2. Referee: [System Design] Section describing the AOT graph engine: the claim that dynamic per-request context-parallel assignments can be bridged to static execution without prohibitive overhead is load-bearing for the entire contribution. The manuscript provides no quantitative characterization of recompilation or routing costs under varying request lengths and arrival patterns, leaving open whether these costs remain small when KV-cache placement and attention topology change on a per-request basis.

    Authors: We acknowledge the importance of quantifying these costs under realistic dynamics. The revised evaluation (Section 5.5) now includes experiments with request-length distributions drawn from production traces and Poisson arrivals at varying rates. The new data show average recompilation cost below 1.8 % of inference time and routing overhead that scales sub-linearly with context-parallel degree changes, remaining under 3 ms even at the highest measured dynamism. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on experimental measurements, not derivations or self-referential definitions

full rationale

The paper presents a systems contribution for MoE serving via dynamic context parallelism (DCP), an AOT graph engine, and routing backend. All performance claims (1.88×–3.27× request rates, 1.79×–2.12× P99 latency reduction) are stated as outcomes of experimental evaluation under TPOT SLOs. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central mechanism (decoupling MoE communication from KV placement via per-request DCP) is described as an engineering design choice whose benefits are measured directly rather than derived by construction from prior results or inputs. This is a standard empirical systems paper; the derivation chain is absent, so no reduction to inputs occurs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on domain assumptions about latency scaling and introduces new system components without independent evidence outside the claimed experiments.

axioms (1)
  • domain assumption Attention latency scales with KV cache size while MoE communication latency scales with batch size.
    Invoked in the abstract to explain why binding requests to single instances produces stragglers.
invented entities (2)
  • Dynamic Context Parallelism (DCP) no independent evidence
    purpose: Assigns per-request context-parallel degree based on KV footprint to liquefy KV cache across the cluster.
    Core new mechanism introduced to achieve dual balance of KV occupancy and batch sizes.
  • Ahead-of-time (AOT) graph engine no independent evidence
    purpose: Bridges dynamic context parallelism with static execution.
    New component required to make the dynamic scheme practical.

pith-pipeline@v0.9.0 · 5875 in / 1407 out tokens · 34950 ms · 2026-05-21T01:47:43.311472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 11 internal anchors

  1. [1]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134.https://www.usenix...

  2. [2]

    2025.No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

    Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, and Esha Choukse. 2025.No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409.17264

  3. [3]

    2026.State of AI: An Empirical 100 Trillion Token Study with OpenRouter

    Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. 2026.State of AI: An Empirical 100 Trillion Token Study with OpenRouter. arXiv:2601.10088 doi:10.48550/arXiv.2601.10088

  4. [4]

    2025.Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding

    Nidhi Bhatia, Ankit More, Ritika Borkar, Tiyasa Mitra, Ramon Matas, Ritchie Zhao, Maximilian Golub, Dheevatsa Mudigere, Brian Phar- ris, and Bita Darvish Rouhani. 2025.Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding. arXiv:2507.07120 doi:10.48550/arXiv.2507.07120

  5. [5]

    Chang Chen, Tiancheng Chen, Jiangfei Duan, Qianchao Zhu, Zerui Wang, Qinghao Hu, Peng Sun, Xiuhong Li, Chao Yang, and Torsten Hoefler. 2025. Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training.ArXiv preprintabs/2509.21841 (2025)

  6. [6]

    Chang Chen, Min Li, Zhihua Wu, Dianhai Yu, and Chao Yang. 2022. TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training. In Advances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, ...

  7. [7]

    Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, Yuening Zhu, Qingliang Ou, Jiaqi Liao, Xianglin Chen, Zhiyuan Ai, Yongwei Wu, and Mingxing Zhang. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In Proceedings of the ACM SIGOPS 31s...

  8. [8]

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. 2025. ShareGPT- 4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation. arXiv:2506.18095 [cs.CV]https://arxiv.org/abs/2506.18095

  9. [9]

    LMDeploy Contributors. 2023. LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM.https://github.com/InternLM/lmdeploy

  10. [10]

    Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. 2023. MSCCLang: Microsoft Collective Communication Language. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2(Vancouver, BC, Canada)(ASPLOS 2023). ACM, New York, NY, USA, 502–514. doi:10....

  11. [11]

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. 2023. Flash-Decoding for long-context inference. PyTorch Blog

  12. [12]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. 2024.DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]https://arxiv.org/abs/2412.19437

  13. [13]

    DeepSeek-AI. 2024. EPLB: Expert Parallelism Load Balancer.https: //github.com/deepseek-ai/EPLB

  14. [14]

    DeepSeek-AI. 2025. DeepGEMM: Clean and Efficient FP8 GEMM Kernels with Fine-Grained Scaling.https://github.com/deepseek-ai/ DeepGEMM. GitHub repository

  15. [15]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zi- fan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. 2025.SWE-Bench Pro: Can AI Agents Solve ...

  16. [16]

    William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch trans- formers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

  17. [17]

    Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with mixture-of-experts.Pro- ceedings of Machine Learning and Systems5 (2023), 288–304

  18. [18]

    Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, and Bin Cui. 2024. Enabling parallelism hot switching for efficient training of large lan- guage models. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 178–194

  19. [19]

    Abhishek Ghosh, Ajay Nayak, Ashish Panwar, and Arkaprava Basu

  20. [20]

    arXiv:2503.19779 [cs.LG]https://arxiv.org/abs/2503.19779

    PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch. arXiv:2503.19779 [cs.LG]https://arxiv.org/abs/2503.19779

  21. [21]

    MoETuner: Optimized mixture of expert serving with balanced expert placement and token routing,

    Seokjin Go and Divya Mahajan. 2025.MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing. arXiv:2502.06643 [cs.LG]https://arxiv.org/abs/2502.06643

  22. [22]

    Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Ad- moni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, James Dinan, Xiaofan Li, Manjunath Gorentla Venkata, and Gil Bloch. 2026. NCCL EP: Towards a Unified Expert Parallel Communication A...

  23. [23]

    Shwai He, Weilin Cai, Jiayi Huang, and Ang Li. 2026. Capacity- Aware Inference: Mitigating the Straggler Effect in Mixture of Experts. arXiv:2503.05066 [cs.LG]https://arxiv.org/abs/2503.05066

  24. [24]

    Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. 2025. DEEPSERVE: serverless large language model serving at scale. In Proceedings of the 2025 USENIX Conferenc...

  25. [25]

    Changho Hwang, Peng Cheng, Roshan Dathathri, Abhinav Jangda, Saeed Maleki, Madan Musuvathi, Olli Saarikivi, Aashaka Shah, Ziyue Yang, Binyang Li, Caio Rocha, Qinghua Zhou, Mahdieh Ghazimir- saeed, Sreevatsa Anantharamu, and Jithin Jose. 2026. MSCCL++: Rethinking GPU Communication Abstractions for AI Inference. In Conference’17, July 2017, Washington, DC, ...

  26. [26]

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2024. System optimizations for enabling training of extreme long sequence transformer models. InProceedings of the 43rd ACM Symposium on Principles of Distributed Computing. 121–130

  27. [27]

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.ArXiv preprintabs/2401.04088 (2024)

  28. [28]

    Chenyu Jiang, Zhenkun Cai, Ye Tian, Zhen Jia, Yida Wang, and Chuan Wu. 2025. DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 221–236

  29. [29]

    Shengyu Liu Jiashi Li. 2025. FlashMLA: Efficient Multi-head Latent Attention Kernels.https://github.com/deepseek-ai/FlashMLA

  30. [30]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  31. [31]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  32. [32]

    InProceedings of the 29th symposium on operating systems principles

    Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

  33. [33]

    Cong Li, Yuzhe Yang, Xuegui Zheng, Qifan Yang, Yijin Guan, Size Zheng, Li-Wen Chang, Shufan Liu, Xin Liu, and Guangyu Sun. 2025. Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism.ArXiv preprintabs/2511.06247 (2025)

  34. [34]

    2026.Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

    Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, and Pengfei Zheng. 2026.Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling. arXiv:2503.04398 [cs.LG]https://arxiv. org/abs/2503.04398

  35. [35]

    Nandor Licker, Kevin Hu, Vladimir Zaytsev, and Lequn Chen

  36. [36]

    fabric-lib: RDMA Point-to-Point Communication for LLM Systems

    RDMA Point-to-Point Communication for LLM Systems. arXiv:2510.27656 [cs.DC]https://arxiv.org/abs/2510.27656

  37. [37]

    Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache.ArXiv preprintabs/2401.02669 (2024)

  38. [38]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2024. RingAttention with Blockwise Transformers for Near-Infinite Context. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  39. [39]

    2025.Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving

    Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, Peng Sun, Zhenhua Han, Tian Tang, Xiaohe Hu, Yanmin Jia, Yan Zhang, He Liu, Mingjun Zhang, Yiqi Zhang, Qiaoling Chen, Shenggan Cheng, Mingyu Gao, Yang You, and Siyuan Feng. 2025.Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving. arXiv:2509.17863 [cs.DC] https://arxiv.org/a...

  40. [40]

    Jinming Ma, Jiefei Chen, Xiuhong Li, Jiangfei Duan, Haojie Duanmu, Xingcheng Zhang, Chao Yang, and Dahua Lin. 2025. Tropical: Enhanc- ing SLO Attainment in Disaggregated LLM Serving via SLO-Aware Multiplexing. In2025 62nd ACM/IEEE Design Automation Conference (DAC). 1–7. doi:10.1109/DAC63849.2025.11132617

  41. [41]

    Ziming Mao, Yihan Zhang, Chihan Cui, Zhen Huang, Kaichao You, Zhongjie Chen, Zhiying Xu, Zhenyu Gu, Scott Shenker, Costin Raiciu, Yang Zhou, and Ion Stoica. 2026. UCCL-EP: Portable Expert-Parallel Communication. arXiv:2512.19849 [cs.DC]https://arxiv.org/abs/2512. 19849

  42. [42]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 561–57...

  43. [43]

    NVIDIA Corporation. 2026. NVIDIA NCCL Source Code.https:// github.com/NVIDIA/nccl

  44. [44]

    Gabriele Oliaro, Xupeng Miao, Xinhao Cheng, Vineeth Kada, Mengdi Wu, Ruohan Gao, Yingyi Huang, Remi Delacourt, April Yang, Yingcheng Wang, Colin Unger, and Zhihao Jia. 2025. FlexLLM: Token- Level Co-Serving of LLM Inference and Finetuning with SLO Guaran- tees. arXiv:2402.18789 [cs.DC]https://arxiv.org/abs/2402.18789

  45. [45]

    OpenAI. 2026. GPT-5.4 Model.https://developers.openai.com/api/ docs/models/gpt-5.4. Official model documentation for GPT-5.4, including snapshotgpt-5.4-2026-03-05; accessed 2026-04-16

  46. [46]

    OpenRouter. 2026. OpenRouter Rankings: Model Usage Statistics. https://openrouter.ai/rankings. Accessed: February 23, 2026

  47. [47]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

  48. [48]

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. 2025. A New Era of Intelligence with Gemini 3.https://blog.google/products- and-platforms/products/gemini/gemini-3/. Google Blog, accessed 2026-04-16

  49. [49]

    PyTorch Foundation. 2023. Accelerating Generative AI with PyTorch II: GPT, Fast.https://pytorch.org/blog/accelerating-generative-ai-2/. Accessed: 2026-04-16

  50. [50]

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Moon- cake: Trading More Storage for Less Computation — A KVCache- centric Architecture for Serving LLM Chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25). USENIX Association, Santa Clara, CA, 155–170.https://w...

  51. [51]

    Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

  52. [52]

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yux- iong He. 2022. DeepSpeed-MoE: Advancing Mixture-of-Experts Infer- ence and Training to Power Next-Generation AI Scale. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings o...

  53. [53]

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large lan- guage model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191

  54. [54]

    Xinru Tang, Jingxiang Hou, Dingcheng Jiang, Taiquan Wei, Jiaxin Liu, Jinyi Deng, Huizheng Wang, Qize Yang, Haoran Shang, Chao Li, et al

  55. [55]

    MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference.ArXiv preprintabs/2510.25258 (2025)

  56. [56]

    Xiaojuan Tang, Fanxu Meng, Pingzhi Tang, Yuxuan Wang, Di Yin, Xing Sun, and Muhan Zhang. 2026. TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (USA)(ASPLOS ’26). Associati...

  57. [57]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.ArXiv preprintabs/2403.05530 (2024)

  58. [58]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. 2026. Kimi K2. 5: Visual Agentic Intelligence.ArXiv preprintabs/2602.02276 (2026)

  59. [59]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chen- zhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, ...

  60. [60]

    Kimi K2: Open Agentic Intelligence

    Kimi K2: Open Agentic Intelligence. arXiv:2507.20534 [cs.LG] https://arxiv.org/abs/2507.20534

  61. [61]

    StepFun Team. 2025. Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding. arXiv:2507.19427 [cs.LG]https: //arxiv.org/abs/2507.19427

  62. [62]

    2025.Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference

    Jian Tian, Shuailong Li, Yang Cao, Wenbo Cui, Minghan Zhu, Wenkang Wu, Jianming Zhang, Yanpeng Wang, Zhiwen Xiao, Zhenyu Hou, and Dou Shen. 2025.Staggered Batch Scheduling: Co-optimizing Time-to-First-Token and Throughput for High-Efficiency LLM Inference. arXiv:2512.16134 [cs.DC]https://arxiv.org/abs/2512.16134

  63. [63]

    Ilyas Turimbetov, Mohamed Wahib, and Didem Unat. 2025. A Device- Side Execution Model for Multi-GPU Task Graphs. InProceedings of the 39th ACM International Conference on Supercomputing (ICS ’25). Association for Computing Machinery, New York, NY, USA, 384–396. doi:10.1145/3721145.3730426

  64. [64]

    Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. 2025. Harnessing inter-gpu shared memory for seamless moe communication-computation fusion. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Pro- gramming. 170–182

  65. [65]

    Ne Wang, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Ruiting Zhou, and Bo Li. 2025. SP-MoE: Expediting Mixture-of-Experts Training with Optimized Pipelining Planning. InIEEE INFOCOM 2025-IEEE Conference on Computer Communications. IEEE, 1–10

  66. [66]

    Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xue- feng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. 2025. Flexsp: Accelerating large language model training via flexible sequence par- allelism. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Sys- tems, Volume 2...

  67. [67]

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654

  68. [68]

    Bingyang Wu, Zili Zhang, Yinmin Zhong, Guanzhe Huang, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2025. TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving. ArXiv preprintabs/2508.17219 (2025)

  69. [69]

    Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, and Thai Le. 2025. ShareChat: A Dataset of Chatbot Conversations in the Wild.ArXiv preprintabs/2512.17843 (2025)

  70. [70]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.ArXiv preprintabs/2505.09388 (2025)

  71. [71]

    Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, and Jianyu Huang. 2025. Context par- allelism for scalable million-token inference.Proceedings of Machine Learning and Systems7 (2025)

  72. [72]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Asso- ciation, Carlsbad, CA, 521–538.https://www.usenix.org/conference/ osdi22/presentation/yu

  73. [73]

    Xingkai Yu. [n. d.]. Nano-vLLM: A lightweight vLLM implementation built from scratch.https://github.com/GeeeekExplorer/nano-vllm. GitHub repository

  74. [74]

    Zhexiang Zhang, Ye Wang, Xiangyu Wang, Yumiao Zhao, Jingzhe Jiang, Qizhen Weng, Shaohuai Shi, Yin Chen, and Minchen Yu. 2025. Janus: Disaggregating Attention and Experts for Scalable MoE Inference. arXiv:2512.13525 [cs.DC]https://arxiv.org/abs/2512.13525

  75. [75]

    Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Huazuo Gao, Jiashi Li, Liyue Zhang, Panpan Huang, Shangyan Zhou, Shirong Ma, et al . 2025. Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1731–1745

  76. [76]

    Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. 2025. DeepEP: an efficient expert-parallel communication library.https://github.com/ deepseek-ai/DeepEP

  77. [77]

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. WildChat: 1M ChatGPT Interaction Logs in the Wild. InThe Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net

  78. [78]

    Bojian Zheng, Cody Hao Yu, Jie Wang, Yaoyao Ding, Yizhi Liu, Yida Wang, and Gennady Pekhimenko. 2023. Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture(Toronto, ON, Canada)(MICRO ’23). Association for Computing Machinery, New Yo...

  79. [79]

    Xing, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. 2024. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, ...

  80. [80]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Efficient Execution of Structured Language Model Programs. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Process...

Showing first 80 references.