pith. sign in

arxiv: 2605.21312 · v1 · pith:Y5HJIO25new · submitted 2026-05-20 · 💻 cs.DC · cs.AI· cs.LG

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Pith reviewed 2026-05-21 03:43 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG
keywords LLM serving simulationdisaggregated inferencediscrete-event simulatorperformance modelingGPU clusterinference optimizationstateful workloads
0
0 comments X

The pith

Frontier simulator models disaggregated LLM serving with under 4% throughput error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Frontier, a discrete-event simulator for modern LLM inference serving that handles disaggregated execution patterns, complex parallelism, runtime optimizations, and stateful workloads such as reasoning and RL rollouts. Existing simulators rely on monolithic-replica abstractions and average-case analytical proxies that produce high errors in latency and throughput predictions and can even reverse optimization conclusions. Frontier instead uses role-specific workers to model co-location, Prefill-Decode Disaggregation, and Attention-FFN Disaggregation while embedding optimizations like CUDA Graphs inside the scheduler-batch-engine loop. If the accuracy claims hold, designers could explore large configuration spaces for production systems without repeated hardware experiments.

Core claim

Frontier features a disaggregated abstraction that models co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers. It incorporates key runtime optimizations such as CUDA Graphs and speculative decoding within the scheduler-batch-engine loop and supports stateful requests for emerging workloads. It provides accurate and generalizable predictions of computation, communication, and memory costs. On a 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%, reducing end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation compared with state-of-the-art tools.}

What carries the argument

disaggregated abstraction that models co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) using role-specific cluster workers inside a discrete-event scheduler-batch-engine loop

If this is right

  • It scales to simulations of over 1K GPUs on commodity CPUs.
  • It enables SLA-dependent Pareto frontier exploration for serving configurations.
  • It supports validation of agentic reasoning scheduling.
  • It allows reconfiguration analysis for RL post-training.
  • It facilitates studies of heterogeneous disaggregated allocation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cost-model structure could be used to predict energy or power draw under the same disaggregated setups without new hardware runs.
  • Accuracy on the reported testbed suggests the simulator might support what-if studies for next-generation accelerators or network fabrics.
  • Production traces with bursty or multi-tenant traffic could serve as an independent check on whether the current cost models need refinement.

Load-bearing premise

The cost models for computation, communication, and memory generalize accurately to diverse workload compositions and serving scenarios beyond the specific testbed configurations used for validation.

What would settle it

Run Frontier predictions on a fresh hardware platform or workload mix (for example, a cluster with different GPU interconnects or a combined agentic-reasoning plus RL-rollout trace) and check whether throughput and latency errors remain below 4% and 7% respectively.

Figures

Figures reproduced from arXiv: 2605.21312 by Hong Xu, Xin Tan, Yangtao Deng, Yibo Zhu, Yicheng Feng, Yimin Jiang.

Figure 1
Figure 1. Figure 1: Measured vLLM TPOT with and without CUDA Graph under differ￾ent workloads (64 requests per work￾load, mean ISL/OSL, tested on 8×A800- SXM GPUs). Left: co-location. Right: PDD. Percentages show reduction. Mode ISL/OSL Padding Inflation Co-location 2048/256 7K 22.6% 256/2048 100K 38.7% 512/512 29K 45.8% 1024/1024 28K 22.6% PDD 2048/256 14K 42.6% 256/2048 111K 42.5% 512/512 37K 57.2% 1024/1024 52K 40.0% [PIT… view at source ↗
Figure 4
Figure 4. Figure 4: Fidelity gaps caused by simplified modeling. Left: Relying on coarse proxies (total token count) fails to capture batch het￾erogeneity, yielding coarse-grained performance estimates. Right: Analytical KV-cache modeling overestimates effective memory bud￾get, leading to a cascade of errors from admission control to over￾optimistic throughput projection. activation transfers become causal edges, and MoE EP i… view at source ↗
Figure 5
Figure 5. Figure 5: Decision drift from fidelity gaps on Llama-3.1-8B over 16 H800 GPUs (co-location). The simulator-selected best config￾uration lies inside the frozen SLA region, but the corresponding vLLM ground-truth point moves outside the SLA because overly optimistic comp-op and KV-cache budget predictions underesti￾mate request latency. Note that we maintain feature parity in this evaluation by disabling vLLM optimiza… view at source ↗
Figure 6
Figure 6. Figure 6: Frontier system architecture. a single domain. Whenever a role 𝑐 hosts both domains, the two sharding products must span the same device set: tpattn · dpattn = tpffn · epffn, 𝑐 ∈ {C, P, D}. (1) The per-replica world size of each cluster role is then 𝑊 𝑐 𝑅 = ( pp · tpattn · dpattn, 𝑐 ∈ {C, P, D, A}, pp · tpffn · epffn, 𝑐 = F, (2) where the two branches agree on C/P/D by Eq. 1. Each role 𝑐 instantiates 𝑁 𝑐 𝑅… view at source ↗
Figure 9
Figure 9. Figure 9: Decode throughput. +CG bars stack eager through￾put with the CUDA Graph gain; top labels show total +CG/eager speedup. Arch. Workload E-vLLM CG-vLLM ΔFrontier Co-loc. Prefill-heavy 294.8k 307.3k -6.9k (2.25%) Decode-heavy 294.8k 420.5k -10.7k (2.55%) SharedGPT 69.6k 84.2k +0.3k (0.40%) Disagg. Prefill-heavy 147.5k 150.1k -1.2k (0.80%) Decode-heavy 147.5k 167.9k +0.6k (0.37%) SharedGPT 34.9k 42.9k -30 (0.07… view at source ↗
Figure 8
Figure 8. Figure 8: Available KV-cache blocks over time under co￾location (Qwen3-30B MoE, SharedGPT trace), with a max gap of 294 blocks (115.6 MB; ΔMB = 0.393Δ𝐵, where Δ𝐵 is the block gap). Mode Parallel vLLM ΔFrontier ΔAnalytical Co-loc. (1,8,1,8) 31k +7 (0.02%) +4.4k (14.10%) (4,2,1,2) 58.0k +1.0k (1.76%) +12.5k (21.38%) Disagg. (2,2,2,4) 27.0k +0.5k (1.89%) +7.6k (27.95%) (1,4,1,4) 12.0k -1 (0.01%) +4.6k (39.73%) [PITH_F… view at source ↗
Figure 11
Figure 11. Figure 11: End-to-end fidelity on a 16-card H800 testbed for co-location and PDD across prefill-heavy, decode-heavy, balanced, and SharedGPT workloads. Panels report TTFT, TPOT, throughput, and E2E makespan for both Llama3.1-8B (dense) and Qwen3-30B MoE; dashed bars indicate simulators that do not support the serving architecture or model family. All data are normalized against the ground truth (vLLM), represented b… view at source ↗
Figure 12
Figure 12. Figure 12: AFD fidelity on Step3-316B (16 H800 GPUs) against the ground truth across prefill-heavy, decode-heavy, and balanced workloads. AFD-TP and AFD-EP report throughput (decode toks/s). of per-iteration cost, KV-cache block capacity, and admission￾watermark preemption intact; small per-operator errors bounded in §5.1 therefore do not amplify through the feedback path into the qualitatively different regimes tha… view at source ↗
Figure 14
Figure 14. Figure 14: Heterogeneous Qwen3-235B-A22B [12] allocation ex￾poses which PDD and AFD role assignments convert hardware discounts into cost efficiency. (i.e., 50 toks/s/user), decode batches shrink and throughput is capped. PDD removes that interference by giving prefill and decode separate clusters, which makes it strong when TTFT is loose and most GPUs can be kept on decode. AFD further separates decode-attention fr… view at source ↗
Figure 15
Figure 15. Figure 15: Phase-aware scheduling for multi-round reasoning. Scenario. Multi-round agentic reasoning workloads no longer behave like a single prompt followed by a single decode stream. Taking a coding agent such as Claude Code [1] as an example, each request progresses through two phases: internal planning and final response generation. During plan￾ning, the agent runs multiple thinking rounds marked by <thinking> t… view at source ↗
Figure 17
Figure 17. Figure 17: H20 operator fidelity CDFs. Curves show Frontier abso￾lute percentage error for attention, linear ops, and MoE on the H20 BF16 and FP8. Prefill-heavy Decode-heavy Hybrid/mixed SharedGPT Workload 0 2 4 6 8 10 Mean relative error (%) H20 Dense 3.6% 6.5% 4.5% 4.4% 2.8% 6.0% 4.0% 5.1% Co-location PDD Prefill-heavy Decode-heavy Hybrid/mixed SharedGPT Workload H20 MoE 3.8% 0.5% 0.9% 2.7% 5.8% 7.0% 0.9% 3.1% [P… view at source ↗
Figure 19
Figure 19. Figure 19: Example of three serving architectures in Frontier. A.3 PDD and AFD Scheduling Workflow Frontier operates as a discrete-event simulator (DES). This subsection expands the control and execution-plane mech￾anisms into a concrete workflow-level algorithmic descrip￾tion. We show the example of three serving architectures of Frontier in [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Skip-join MLFQ on the agentic trace: p95 aTTFT im￾proves marginally and hidden planning throughput regresses. Blue￾bar annotations show the relative change over vLLM. (matching the <thinking> blocks in the scenario descrip￾tion) followed by one answer-visible round. Round 𝑟 ′ con￾tributes ℓ𝑟,𝑟′ new prompt tokens (after same-request prefix reuse) and 𝑜𝑟,𝑟′ decode tokens. Following the main definition, aTTF… view at source ↗
Figure 21
Figure 21. Figure 21: Frontier online SharedGPT qps64 scheduler comparison. Panel (a) reports normalized macro metrics, and the bar labels show multipliers relative to vLLM. Panels (b)–(d) retain the micro￾scheduling view of batch size, backlog, and prefill/decode mixing over time (s). setting was selected because it sustains a saturated queue￾ing regime while still allowing both schedulers to complete cleanly [PITH_FULL_IMAG… view at source ↗
read the original abstract

Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Frontier, a discrete-event simulator for modern LLM inference serving. It introduces disaggregated abstractions for co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD), models role-specific workers, incorporates runtime optimizations such as CUDA Graphs and speculative decoding in the scheduler-batch-engine loop, and supports stateful requests. On a 16-H800 GPU testbed, it reports average throughput error below 4%, reducing end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation relative to prior simulators. It claims scalability to over 1K GPUs and enables new use cases including SLA-dependent Pareto exploration and RL post-training reconfiguration.

Significance. If the reported accuracy holds with proper separation of calibration and validation data, Frontier would be a useful tool for design-space exploration in disaggregated and stateful LLM serving systems, addressing gaps in monolithic abstractions of existing simulators. The concrete hardware error metrics and explicit support for PDD/AFD are positive features. The significance is tempered by the need to confirm that the cost models for computation, communication, and memory are predictive rather than fitted to the reported testbed runs.

major comments (2)
  1. [§5] §5 (Evaluation): The central accuracy claims (throughput error <4%, latency reductions to 6.4% and 2.6%) rest on cost models whose derivation is not explicitly separated from the validation traces on the 16-H800 testbed. Without held-out workloads, cross-validation, or independent calibration data, the low errors risk measuring fit quality rather than generalization, directly affecting the claim of 'accurate and generalizable predictions across diverse serving scenarios.'
  2. [§4.2] §4.2 (Cost Models): The models for computation, communication, and memory are presented as generalizable, yet the manuscript provides no explicit parameter counts, fitting procedure, or sensitivity analysis showing independence from the specific 16-H800 configurations used for the error metrics. This is load-bearing for the disaggregation and optimization claims.
minor comments (2)
  1. [Abstract] The abstract states scalability to over 1K GPUs on commodity CPUs, but the main text should include concrete simulation runtime or memory usage figures for that scale to support the claim.
  2. [Figures/Tables] Figure captions and table legends should clarify whether error bars represent standard deviation across multiple runs or workload variations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments on evaluation methodology and cost model transparency are well-taken and point to areas where additional clarity will strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The central accuracy claims (throughput error <4%, latency reductions to 6.4% and 2.6%) rest on cost models whose derivation is not explicitly separated from the validation traces on the 16-H800 testbed. Without held-out workloads, cross-validation, or independent calibration data, the low errors risk measuring fit quality rather than generalization, directly affecting the claim of 'accurate and generalizable predictions across diverse serving scenarios.'

    Authors: We agree that explicit separation of derivation from validation is essential to support generalization claims. The cost models combine analytical formulations (FLOPs, bandwidth, memory access patterns) with micro-benchmark measurements collected on smaller-scale hardware prior to the 16-H800 end-to-end runs; the latter serve strictly as validation. To address the concern directly, we will revise §5 to document this separation, add held-out workload results, and include a brief cross-validation summary. These additions will better substantiate the reported accuracy figures as predictive rather than fitted. revision: yes

  2. Referee: [§4.2] §4.2 (Cost Models): The models for computation, communication, and memory are presented as generalizable, yet the manuscript provides no explicit parameter counts, fitting procedure, or sensitivity analysis showing independence from the specific 16-H800 configurations used for the error metrics. This is load-bearing for the disaggregation and optimization claims.

    Authors: We thank the referee for highlighting this gap in presentation. The models are constructed from hardware-derived analytical expressions supplemented by limited empirical calibration on non-overlapping small-scale traces. We will expand §4.2 (and add an appendix if space requires) with explicit parameter counts, the precise fitting procedure, and a sensitivity analysis demonstrating robustness across GPU counts and configurations. This revision will make the independence from the 16-H800 testbed explicit and reinforce the generalizability needed for the disaggregation claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: accuracy claims rest on external hardware measurements

full rationale

The paper's central results are empirical error metrics (throughput <4%, latency reductions to 6.4%/2.6%) obtained by running the simulator against direct measurements on a physical 16-H800 GPU testbed. These comparisons use held-out execution traces rather than internal equations or parameters fitted to the same validation runs. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation of the cost models or disaggregation abstractions; the simulator's fidelity is presented as an external benchmark outcome, not a closed loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on standard discrete-event modeling assumptions rather than new free parameters or invented physical entities; the simulator itself is the engineered artifact.

axioms (1)
  • domain assumption Discrete-event simulation with role-specific workers can faithfully capture dynamics of co-location, PDD, and AFD in LLM serving
    Invoked as the foundation for modeling scheduler-batch-engine loop and cost predictions.

pith-pipeline@v0.9.0 · 5832 in / 1267 out tokens · 35668 ms · 2026-05-21T03:43:00.953786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 9 internal anchors

  1. [1]

    claude code

    2025. claude code. Website.https://code.claude.com/

  2. [2]

    cloudgpu

    2025. cloudgpu. Website.https://cloudgpu.app/

  3. [3]

    2025. dynamo. Website.https://www.nvidia.com/en-us/ai/dynamo/

  4. [4]

    htsim Network Simulator

    2025. htsim Network Simulator. Website.https://github.com/ Broadcom/csg-htsim

  5. [5]

    huggingface

    2025. huggingface. Website.https://huggingface.co/

  6. [6]

    Llama-3.1-405B-FP8

    2025. Llama-3.1-405B-FP8. Website.https://huggingface.co/meta- llama/Llama-3.1-405B-FP8

  7. [7]

    Llama-3.1-8B

    2025. Llama-3.1-8B. Website.https://huggingface.co/meta-llama/ Llama-3.1-8B

  8. [8]

    Llama-3.3-70B-Instruct

    2025. Llama-3.3-70B-Instruct. Website.https://huggingface.co/meta- llama/Llama-3.3-70B-Instruct

  9. [9]

    NCCL workspace buffer

    2025. NCCL workspace buffer. Website.https://docs.nvidia.com/ deeplearning/nccl/user-guide/docs/usage/bufferreg.html

  10. [10]

    Nvidia CUDA Graph

    2025. Nvidia CUDA Graph. Website.https://docs.nvidia.com/cuda/ cuda-programming-guide/04-special-topics/cuda-graphs.html

  11. [11]

    Nvidia TensorRT-LLM

    2025. Nvidia TensorRT-LLM. Website.https://github.com/NVIDIA/ TensorRT-LLM

  12. [12]

    Qwen3-235B-A22B

    2025. Qwen3-235B-A22B. Website.https://huggingface.co/Qwen/ Qwen3-235B-A22B

  13. [13]

    Qwen3-30B-A3B

    2025. Qwen3-30B-A3B. Website.https://huggingface.co/Qwen/ Qwen3-30B-A3B

  14. [14]

    sglang admission

    2025. sglang admission. Website.https://github.com/sgl-project/ sglang/blob/main/docs/advanced_features/server_arguments.md

  15. [15]

    SharedGPT trace

    2025. SharedGPT trace. Website.https://docs.vllm.ai/en/v0.12.0/ benchmarking/cli/

  16. [16]

    vllm watermark

    2025. vllm watermark. Website.https://docs.vllm.ai/en/v0.9.0/api/ vllm/core/block_manager.html

  17. [17]

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tu- manov. 2024. Vidur: A large-scale simulation framework for llm inference.Proceedings of Machine Learning and Systems6 (2024), 351– 366

  18. [18]

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. 2024. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI 24). 117–134

  19. [19]

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Effi- cient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369(2023)

  20. [20]

    Amey Agrawal, Mayank Yadav, Sukrit Kumar, Anirudha Agrawal, Garv Ghai, Souradeep Bera, Elton Pinto, Sirish Gambhira, Mohammad Adain, Kasra Sohrab, Chus Antonanzas, and Alexey Tumanov. 2026. Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving. arXiv:2601.00397 [cs.DC]https://arxiv.org/abs/2601.00397

  21. [21]

    Jaehong Cho, Hyunmin Choi, Guseul Heo, and Jongse Park. 2026. LLM- ServingSim 2.0: A Unified Simulator for Heterogeneous and Disag- gregated LLM Serving Infrastructure.arXiv preprint arXiv:2602.23036 (2026)

  22. [22]

    Fernando J Corbató, Marjorie Merwin-Daggett, and Robert C Daley

  23. [23]

    InProceedings of the May 1-3, 1962, spring joint computer conference

    An experimental time-sharing system. InProceedings of the May 1-3, 1962, spring joint computer conference. 335–344

  24. [24]

    Tri Dao. 2024. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InInternational Conference on Learning Representations, Vol. 2024. 35549–35562. 13

  25. [25]

    Jiangfei Duan, Xiuhong Li, Ping Xu, Xingcheng Zhang, Shengen Yan, Yun Liang, and Dahua Lin. 2024. Proteus: Simulating the performance of distributed DNN training.IEEE Transactions on parallel and dis- tributed systems35, 10 (2024), 1867–1878

  26. [26]

    Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, and Hong Xu

  27. [27]

    Echo: Simulating distributed training at scale.arXiv preprint arXiv:2412.12487(2024)

  28. [28]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforce- ment learning.arXiv preprint arXiv:2501.12948(2025)

  29. [29]

    Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020. Accel-sim: An extensible simulation framework for validated gpu modeling. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473–486

  30. [30]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  31. [31]

    InProceedings of the 29th symposium on operating systems principles

    Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626

  32. [32]

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668 (2020)

  33. [33]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192 [cs.LG]https://arxiv.org/abs/2211.17192

  34. [34]

    Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed {MoE} training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 945–959

  35. [35]

    Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Par- avecino. 2024. APEX: An extensible and dynamism-aware simula- tor for automated parallel execution in LLM serving.arXiv preprint arXiv:2411.17651(2024)

  36. [36]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al . 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  37. [37]

    Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2022. Galvatron: Efficient transformer training over multiple gpus using automatic parallelism.arXiv preprint arXiv:2211.13878(2022)

  38. [38]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. InProceedings of the 27th ACM symposium on operating sys- tems principles. 1–15

  39. [39]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library.Advances in neural informa- tion processing systems32 (2019)

  40. [40]

    Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator simula- tor.arXiv preprint arXiv:1811.02883(2018)

  41. [41]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053(2019)

  42. [42]

    Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, and Abhinav Bhatele. 2023. A hybrid tensor- expert-data parallelism approach to optimize mixture-of-experts train- ing. InProceedings of the 37th International Conference on Supercom- puting. 203–214

  43. [43]

    StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Chang...

  44. [44]

    Xin Tan, Yicheng Feng, Yu Zhou, Yimin Jiang, Yibo Zhu, and Hong Xu. 2026. OrchestrRL: Dynamic Compute and Network Orchestration for Disaggregated RL.arXiv preprint arXiv:2601.01209(2026)

  45. [45]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  46. [46]

    William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Su- darshan Srinivasan, and Tushar Krishna. 2023. Astra-sim2. 0: Model- ing hierarchical networks and disaggregated systems for large-model training at scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 283–294

  47. [47]

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. 2023. Fast distributed inference serving for large language models.arXiv preprint arXiv:2305.05920(2023)

  48. [48]

    Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, Qin Zhou, Xumeng Chen, Ilya Sher- styuk, Haorui Li, Rishi Thakkar, Ben Hamm, Yuanzhe Li, Xue Huang, Wenpeng Wu, Anish Shanbhag, Harry Kim, Chuan Chen, and Junjie 14 Lai. 2026. AIConfigurator: Lightning-Fast Configuration Optimiza- tion for Multi-Framework LLM S...

  49. [49]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629(2022)

  50. [50]

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. 2025. Flashinfer: Efficient and customizable attention engine for llm inference serving.Proceedings of Machine Learning and Systems7 (2025)

  51. [51]

    Zili Zhang, Yinmin Zhong, Chengxu Yang, Chao Jin, Bingyang Wu, Xinming Wei, Yuliang Liu, and Xin Jin. 2026. Heddle: A Dis- tributed Orchestration System for Agentic RL Rollout.arXiv preprint arXiv:2603.28101(2026)

  52. [52]

    Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578

  53. [53]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. 2024. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems37 (2024), 62557–62583

  54. [54]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

  55. [55]

    In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pp

    Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.arXiv preprint arXiv:2504.02263(2025). 15 0 10 20 30 40 Frontier APE (%) 0.0 0.2 0.4 0.6 0.8 1.0CDF p90 H20 BF16 Attention...