ASAP: A Disaggregated and Asynchronous Inference System for MoE Prefill

Han Li; Lele Li; Ming Yan; Qiang Hu; Shuang Chen; Weiwei Chen; Xin Ye; Zhibin Yu

arxiv: 2606.22541 · v1 · pith:F5CLQJJSnew · submitted 2026-06-21 · 💻 cs.DC

ASAP: A Disaggregated and Asynchronous Inference System for MoE Prefill

Weiwei Chen , Shuang Chen , Lele Li , Qiang Hu , Han Li , Xin Ye , Ming Yan , Zhibin Yu This is my paper

Pith reviewed 2026-06-26 09:40 UTC · model grok-4.3

classification 💻 cs.DC

keywords Mixture-of-Experts servingprefill optimizationasynchronous inferencedisaggregated executionexpert parallelismdata parallelismsynchronization barriersSLO throughput

0 comments

The pith

ASAP disaggregates attention and MoE stages into an asynchronous pipeline to remove global synchronization barriers during prefill.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ASAP as a system to accelerate the prefill phase of Mixture-of-Experts models under online serving conditions. Hybrid parallelism currently pairs data-parallel attention stages with expert-parallel MoE stages, but variance in request arrivals and lengths creates DP imbalance that forces global synchronization stalls. These stalls degrade time-to-first-token and overall throughput. ASAP separates the stages and replaces the barriers with a fully asynchronous execution pipeline built from specialized communication primitives plus four coordinated optimizations in scheduling and execution. On CloudMatrix384 hardware the design raises SLO-compliant prefill throughput by 90 percent relative to prior synchronous systems.

Core claim

Disaggregating attention DP groups from expert-parallel MoE stages and replacing their global synchronization barriers with an asynchronous pipeline, achieved through specialized async communication primitives and four coordinated optimizations in request scheduling and model execution, removes the stalls that arise from request variance in online MoE serving.

What carries the argument

The suite of specialized asynchronous communication primitives together with four coordinated optimizations across request scheduling and model execution that enable the disaggregated asynchronous pipeline between attention and MoE stages.

If this is right

SLO-compliant prefill throughput rises by 90 percent on the evaluated hardware.
Time-to-first-token improves because request-length variance no longer forces global stalls.
The hybrid parallelism strategy can be retained without paying the previous synchronization cost.
Online serving of large MoE models becomes feasible at higher request rates without additional hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disaggregation pattern could be applied to the decode phase if similar stage imbalance appears there.
Clusters with heterogeneous interconnects might see larger gains because the async design reduces the frequency of global barriers.
Request schedulers in other hybrid-parallel systems could adopt the four optimizations independently of the communication primitives.

Load-bearing premise

The reduction in synchronization stalls will not be offset by new communication or scheduling overheads introduced by the asynchronous design.

What would settle it

A direct measurement, under the same request traces, of end-to-end prefill latency and throughput when the async primitives are replaced by their synchronous equivalents while keeping all other scheduling changes fixed.

Figures

Figures reproduced from arXiv: 2606.22541 by Han Li, Lele Li, Ming Yan, Qiang Hu, Shuang Chen, Weiwei Chen, Xin Ye, Zhibin Yu.

**Figure 1.** Figure 1: Comparison between current synchronous and ideal asynchronous execution under heterogeneous request batches. Synchronous execution (Figure 1a) requires all request batches (launched on different attention DP groups) to synchronize before the MoE computation. An ideal asynchronous execution pipeline (Figure 1b) eliminates the barrier, and allows faster request batches to progress at their own pace. super-l… view at source ↗

**Figure 4.** Figure 4: Impact of batch size on attention latency under a fixed total sequence length of 32k tokens. 0 25K 50K 75K 100K Sequence Length 0.0 0.5 1.0 1.5 2.0 PDF ×10 4 PDF of Prompt Length CDF of Prompt Length 0.00 0.25 0.50 0.75 1.00 CDF Min: 31 Max: ~100K Avg: ~5K [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: Latency scaling with increasing sequence length. 2.1 LLM Inference Preliminaries Prefill and Decode Phases. In LLM serving, a user request first undergoes the prefill phase to generate the initial token, followed by the decode phase to generate subsequent tokens autoregressively [8, 51]. Inference latency is primarily characterized by Time-to-First-Token (TTFT) for prefill and Time-Per-Output-Token (TPOT)… view at source ↗

**Figure 6.** Figure 6: ASAP Overview. Huawei Cloud 1 traces. While the mean prompt length sits at 5k tokens, the distribution exhibits a massive variance— ranging from a mere 31 tokens to a heavy tail extending up to 100k tokens. As established in Section 2.2.1, excluding the impact of prefix cache, simply matching the total token count across attention DP groups is still fundamentally inadequate. Perfect load balancing would re… view at source ↗

**Figure 7.** Figure 7: Asynchronous communication primitives and the shared buffer design. The buffer structure on each device is detailed in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Dual-batch interleaving and the triple-stream design for communication-computation overlapping. For each attention DP group, two request batches are co-scheduled to interleave their layerwise attention processing together. All devices employ a triplestream design to further overlap asynchronous communication primitives with the main active computation kernels. As illustrated in [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 10.** Figure 10: The layer-oblivious design of the MoE Super Kernel allows ahead-of-time dispatching, eliminating device idling between kernels. • MoE-limited cases: Conversely, if attention latency of the interleaved batch is shorter than MoE latency, then idling of attention devices will persist. However, with lengthaware batching, attention execution typically surpasses several milliseconds, comfortably overlapping wi… view at source ↗

**Figure 11.** Figure 11: Hardware architecture of the CloudMatrix384 supernode. The supernode comprises 48 nodes, each integrating 8 NPUs. All 384 NPUs are interconnected via a two-level hierarchical UB plane, providing 400 GB/s bandwidth between any NPU pair. 3.5 Putting It All Together Upon arrival, user requests are aggregated via length-aware batching to ensure optimal MoE token density. Once a pair of batches is formed, they… view at source ↗

**Figure 14.** Figure 14: Communication latency with increasing token count. 0.5K 1K 2K 4K 6K 8K 10K 12K 14K 16K 18K 20K 22K 24K 26K 28K 30K 32K Sequence Length 0 1,000 2,000 3,000 4,000 5,000 Latency(ms) Default TTFT Synchronization delay Request queuing delay Aeolus TTFT Aeolus Non-kernel delay Kernel time [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Latency decomposition at QPS=4. The left and right bars at each sequence length show mean TTFT under Default and ASAP, respectively. For Default, TTFT is decomposed into kernel time, request queuing delay, and synchronization waiting delay. For ASAP, TTFT is decomposed into kernel and non-kernel delay. sorb heterogeneous workloads that traditionally trigger severe synchronization stalls. 5.3 Latency Deco… view at source ↗

**Figure 17.** Figure 17: Comparison of mean TTFT with/without communication-computation overlapping. 0 2 4 6 8 10 12 14 16 18 20 RPS 0 2,000 4,000 Mean Latency (ms) W/O MoE Super Kernel W MoE Super Kernel 2 4 8 300 700 [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models have become the de facto standard for scaling large language models. To maintain computational efficiency, modern MoE serving systems typically employ a hybrid parallelism strategy, combining Data Parallelism (DP) for attention stages with Expert Parallelism (EP) for MoE stages. However, this design necessitates frequent global synchronization barriers between attention DP groups and experts. In online serving, significant variance in request arrival rates and sequence lengths inherently leads to DP imbalance, causing severe synchronization stalls that degrade Time-to-First-Token (TTFT) and system throughput. We present ASAP, an asynchronous inference system specifically designed to accelerate the prefill phase of MoE models. ASAP disaggregates the attention and MoE stages and implements a fully asynchronous execution pipeline. This is achieved through a suite of specialized asynchronous communication primitives and four coordinated optimizations across request scheduling and model execution, which collectively dismantle global synchronization barriers. We implement and evaluate ASAP on CloudMatrix384 super-nodes, demonstrating that it improves SLO-compliant prefill throughput by 90% compared to state-of-the-art synchronous serving solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ASAP claims a 90% prefill throughput lift for MoE via disaggregation and async primitives, but the abstract gives no workload details or overhead breakdown so the net gain is unproven.

read the letter

The main point is that ASAP disaggregates attention DP groups from expert-parallel MoE stages and replaces global sync barriers with async pipelines and four scheduling optimizations, claiming 90% better SLO-compliant prefill throughput on CloudMatrix384 hardware.

What is new is the specific tailoring of async communication primitives to MoE prefill imbalance caused by request variance and length differences. The paper correctly flags how hybrid DP-EP setups create stalls in online serving, and the disaggregation approach is a direct response to that.

The work is clear on the problem and on the high-level architecture. It treats a real production constraint in current MoE systems without overclaiming theoretical novelty.

The soft spot is the evaluation. The 90% figure appears only in the abstract with no description of workloads, baselines, or any split between time saved from removed syncs and time added by new queuing, data movement, or routing coordination. The stress-test concern lands: if the async primitives cost more than the stalls they remove, the reported gain shrinks or disappears. Without that breakdown the central claim cannot be assessed.

This is for systems researchers and engineers who run MoE inference at scale and need concrete ideas for reducing TTFT variance. A reader already familiar with disaggregated serving will see the incremental engineering steps but will still want the missing measurement details.

The paper deserves a serious referee because the problem is practical and the proposed fix is coherent, even if the current evidence is incomplete. I would send it to review with instructions to supply full experimental methodology and overhead accounting.

Referee Report

2 major / 1 minor

Summary. The paper claims that hybrid DP-EP parallelism in MoE serving creates global synchronization barriers due to request variance, degrading TTFT and throughput; ASAP addresses this by disaggregating attention and MoE stages into a fully asynchronous pipeline using specialized communication primitives plus four coordinated optimizations in scheduling and execution, yielding a 90% gain in SLO-compliant prefill throughput versus state-of-the-art synchronous systems when evaluated on CloudMatrix384 super-nodes.

Significance. If the reported throughput improvement is substantiated with complete methodology and overhead accounting, the work would be a meaningful systems contribution to MoE inference, directly targeting a practical bottleneck in online serving of large expert-parallel models and demonstrating the viability of disaggregation plus asynchrony for prefill acceleration.

major comments (2)

[Abstract] Abstract: the central 90% SLO-compliant prefill throughput claim is load-bearing yet presented with no description of workload (request arrival rates, sequence-length variance), MoE model size, exact synchronous baselines, SLO definition, or measurement methodology, preventing assessment of whether the async primitives deliver a net gain over removed sync stalls.
[Evaluation] Evaluation: the manuscript provides no quantitative breakdown (e.g., time saved from barrier removal versus added queuing, communication, or scheduling overhead from the new primitives and disaggregation) under the high-variance conditions identified as the original problem; without this, the assumption that the four optimizations produce strictly lower net cost remains unverified.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly enumerated the four coordinated optimizations rather than referring to them only generically.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of major revision. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central 90% SLO-compliant prefill throughput claim is load-bearing yet presented with no description of workload (request arrival rates, sequence-length variance), MoE model size, exact synchronous baselines, SLO definition, or measurement methodology, preventing assessment of whether the async primitives deliver a net gain over removed sync stalls.

Authors: The abstract is written to be concise. Workload parameters (arrival rates, sequence-length distributions), model sizes, baseline configurations, SLO definitions, and measurement methodology are described in the Evaluation section. To improve standalone readability of the abstract, we will add a brief clause summarizing the workload and evaluation setup. revision: yes
Referee: [Evaluation] Evaluation: the manuscript provides no quantitative breakdown (e.g., time saved from barrier removal versus added queuing, communication, or scheduling overhead from the new primitives and disaggregation) under the high-variance conditions identified as the original problem; without this, the assumption that the four optimizations produce strictly lower net cost remains unverified.

Authors: We agree that an explicit decomposition of net benefit (barrier removal savings versus added queuing, communication, and scheduling costs) under high request variance would strengthen the evaluation. We will add this breakdown, including per-component timing measurements on the same high-variance traces used for the main results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems evaluation with direct measurements

full rationale

The paper describes an implementation of ASAP with disaggregation and asynchronous primitives for MoE prefill, then reports measured throughput gains (90% SLO-compliant prefill throughput) on CloudMatrix384 hardware versus synchronous baselines. No equations, fitted parameters, derivations, or self-citations appear in the provided text; the central claim rests on external benchmarking rather than any reduction of outputs to inputs by construction. This is a standard empirical systems result with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described beyond the high-level system components.

pith-pipeline@v0.9.1-grok · 5741 in / 1058 out tokens · 18378 ms · 2026-06-26T09:40:47.138236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 7 linked inside Pith

[1]

https://www.mindspore.cn/tutorials/experts /en/r2.3.1/operation/op_custom_ascendc.htm l, 2025

Ascend C Custom Operator Development and Usage. https://www.mindspore.cn/tutorials/experts /en/r2.3.1/operation/op_custom_ascendc.htm l, 2025

2025
[2]

https://www.hiascend.com/cann, 2025

Compute Architecture for Neural Networks (CANN). https://www.hiascend.com/cann, 2025

2025
[3]

https://www.hiasce nd.com/document/detail/zh/canncommercial/8 3RC1/API/ascendcopapi/atlasascendc_api_07_ 0102.html, 2025

Data Movement in Ascend C. https://www.hiasce nd.com/document/detail/zh/canncommercial/8 3RC1/API/ascendcopapi/atlasascendc_api_07_ 0102.html, 2025

2025
[4]

https://www

Memory Management of Ascend NPU. https://www. hiascend.com/document/detail/zh/canncommer cial/83RC1/API/appdevgapi/aclpythondevg_01 _0110.html, 2025

2025
[5]

Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

Pith/arXiv arXiv 2023
[6]

Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

arXiv 2023
[7]

Characterizing cloud-native llm inference at bytedance and exposing optimization challenges and opportunities for future ai accelerators

Jingwei Cai, Dehao Kong, Hantao Huang, Zishan Jiang, Zixuan Ma, Qingyu Guo, Zhenxing Zhang, Guiming Shi, Mingyu Gao, and Kaisheng Ma. Characterizing cloud-native llm inference at bytedance and exposing optimization challenges and opportunities for future ai accelerators. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA),...

2026
[8]

Amali: An analytical model for accurately modeling llm inference on modern gpus

Shiheng Cao, Junmin Wu, Junshi Chen, Hong An, and Zhibin Yu. Amali: An analytical model for accurately modeling llm inference on modern gpus. InProceedings of the 52nd Annual International Symposium on Com- puter Architecture, ISCA ’25, page 1495–1508, New York, NY , USA, 2025. Association for Computing Ma- chinery

2025
[9]

Generating long sequences with sparse trans- formers.arXiv preprint arXiv:1904.10509, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers.arXiv preprint arXiv:1904.10509, 2019. 12

Pith/arXiv arXiv 1904
[10]

Lazy batching: An sla-aware batching system for cloud ma- chine learning inference

Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud ma- chine learning inference. In2021 IEEE International Symposium on High-Performance Computer Architec- ture (HPCA), pages 493–506, 2021

2021
[11]

Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing, 2024

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, and Tong Yang. Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing, 2024

2024
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135, 2022

Pith/arXiv arXiv 2022
[13]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, and etc Bing Xue. Deepseek-v3 technical report, 2025

2025
[14]

Longnet: Scaling transformers to 1,000,000,000 tokens, 2023

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023

2023
[15]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

arXiv 2022
[16]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity.J

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(1), January 2022

2022
[17]

Parameter-efficient mixture-of- experts architecture for pre-trained language models

Ze-Feng Gao, Peiyu Liu, Wayne Xin Zhao, Zhong-Yi Lu, and Ji-Rong Wen. Parameter-efficient mixture-of- experts architecture for pre-trained language models. arXiv preprint arXiv:2203.01104, 2022

arXiv 2022
[18]

Tokenweave: Efficient compute-communication overlap for distributed llm inference.arXiv preprint arXiv:2505.11329, 2025

Raja Gond, Nipun Kwatra, and Ramachandran Ram- jee. Tokenweave: Efficient compute-communication overlap for distributed llm inference.arXiv preprint arXiv:2505.11329, 2025

Pith/arXiv arXiv 2025
[19]

Past-future scheduler for llm serving under sla guar- antees

Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zai- jun Wang, Xiuhong Li, Hailong Yang, and Xianglong Liu. Past-future scheduler for llm serving under sla guar- antees. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’25, page 798–813, New York, NY , USA, 202...

2025
[20]

Flashdecoding++: Faster large language model infer- ence on gpus.arXiv preprint arXiv:2311.01282, 2024

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model infer- ence on gpus.arXiv preprint arXiv:2311.01282, 2024

arXiv 2024
[21]

Memserve: Context caching for disaggregated llm serving with elastic mem- ory pool.arXiv preprint arXiv:2406.17565, 2024

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yun- gang Bao, and Ninghui Sun. Memserve: Context caching for disaggregated llm serving with elastic mem- ory pool.arXiv preprint arXiv:2406.17565, 2024

arXiv 2024
[22]

Deepserve: serverless large language model serving at scale

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. Deepserve: serverless large language model serving at scale. InProceedings of the 2025 USENIX Conference on Us...

2025
[23]

Ad- vancing transformer architecture in long-context large language models: A comprehensive survey, 2024

Yunpeng Huang, Jingwei Xu, Junyu Lai, Zixu Jiang, Taolue Chen, Zenan Li, Yuan Yao, Xiaoxing Ma, Lijuan Yang, Hao Chen, Shupeng Li, and Penghao Zhao. Ad- vancing transformer architecture in long-context large language models: A comprehensive survey, 2024

2024
[24]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

2020
[25]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

2023
[26]

Lightseq: : Sequence level parallelism for distributed training of long context transformers

Dacheng Li, Rulin Shao, Anze Xie, Eric Xing, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: : Sequence level parallelism for distributed training of long context transformers. InWorkshop on Advancing Neural Network Training: Computa- tional Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023), 2023

2023
[27]

Sequence parallelism: Long se- quence training from system perspective.arXiv preprint arXiv:2105.13120, 2022

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yong- bin Li, and Yang You. Sequence parallelism: Long se- quence training from system perspective.arXiv preprint arXiv:2105.13120, 2022

arXiv 2022
[28]

Ub-mesh: A hierarchically localized nd-fullmesh data center network architecture.IEEE Micro, 45(5):20–29, 2025

Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng 13 Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu, Lv Chen, Lei Chen, Buyun Wang, Pei Wu, Junen Gao...

2025
[29]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669, 2024

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669, 2024

arXiv 2024
[30]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, and Chen Dong. Deepseek-v3. 2: Push- ing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025
[31]

Revealing the challenges of attention-ffn disaggregation for modern moe models and hardware systems.arXiv preprint arXiv:2602.09721, 2026

Guowei Liu, Hongming Li, Yaning Guo, Yongxi Lyu, Mo Zhou, Yi Liu, Zhaogeng Li, and Yanpeng Wang. Revealing the challenges of attention-ffn disaggregation for modern moe models and hardware systems.arXiv preprint arXiv:2602.09721, 2026

arXiv 2026
[32]

Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:310.01889, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:310.01889, 2023

2023
[33]

Expert-as-a-service: Towards efficient, scalable, and robust large-scale moe serving.arXiv preprint arXiv:2509.17863, 2025

Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, and etc Peng Sun. Expert-as-a-service: Towards efficient, scalable, and robust large-scale moe serving.arXiv preprint arXiv:2509.17863, 2025

arXiv 2025
[34]

Moe-gps: Guid- lines for prediction strategy for dynamic expert duplica- tion in moe load balancing, 2025

Haiyue Ma, Zhixu Du, and Yiran Chen. Moe-gps: Guid- lines for prediction strategy for dynamic expert duplica- tion in moe load balancing, 2025

2025
[35]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications. arXiv preprint arXiv:2503.07137, 2025

arXiv 2025
[36]

Mar- coni: Prefix caching for the era of hybrid llms

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Mar- coni: Prefix caching for the era of hybrid llms. In M. Za- haria, G. Joshi, and Y . Lin, editors,Proceedings of Ma- chine Learning and Systems, volume 7. MLSys, 2025

2025
[37]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture, ISCA ’24, page 118–132. IEEE Press, 2025

2025
[38]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association

2025
[39]

DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the ...

2022
[40]

Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

Pith/arXiv arXiv 1909
[41]

Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:505.21411, 2025

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jin- peng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, and Yunhe Wang. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint ...

2025
[42]

Kimi k2: Open agentic intelligence, 2025

Kimi Team, Yifan Bai, Yiping Bao, and etc Guan- duo Chen. Kimi k2: Open agentic intelligence, 2025

2025
[43]

Thompson

W.J. Thompson. Poisson distributions.Computing in Science and Engineering, 3(3):78–82, 2001

2001
[44]

Step-3 is large yet affordable: Model-system co-design for cost-effective decoding

Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, and Siyu Chen. Step-3 is large yet affordable: Model-system co-design for cost-effective decoding. arXiv preprint arXiv:2507.19427, 2025

arXiv 2025
[45]

Flexsp: Accelerating large language model training via flexible sequence parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’2...

2025
[46]

{WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, and Chris Cai. {WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 785–801, 2025

2025
[47]

Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Prin- ciples, SOSP ’24, page 640–654, New York, NY , USA,
[48]

Association for Computing Machinery
[49]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, pages 1030–1045, 2025

2025
[50]

Servegen: Workload characteriza- tion and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. Servegen: Workload characteriza- tion and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

Pith/arXiv arXiv 2025
[51]

Qwen3 technical report, 2025

An Yang, Anfeng Li, and etc Baosong Yang. Qwen3 technical report, 2025

2025
[52]

Gonzalez, Clark Bar- rett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Bar- rett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Infor- mation Processing Systems, NIPS ’24, Red Hoo...

2024
[53]

Bucketserve: Bucket-based dynamic batching for smart and efficient llm inference serving.arXiv preprint arXiv:2507.17120, 2025

Wanyi Zheng, Minxian Xu, Shengye Song, and Kejiang Ye. Bucketserve: Bucket-based dynamic batching for smart and efficient llm inference serving.arXiv preprint arXiv:2507.17120, 2025

arXiv 2025
[54]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

2024
[55]

Sampleattention: Near- lossless acceleration of long context llm inference with adaptive structured sparse attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. Sampleattention: Near- lossless acceleration of long context llm inference with adaptive structured sparse attention. In M. Zaharia, G. Joshi, and Y . Lin, editors,Proceedings of Machine Learning and Systems, volume 7. MLSys, 2025

2025
[56]

Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Ce- sar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, and Yang Cheng. Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism. InProceedings of the ACM SIGCOMM 2025 Conference, pages 592–608, 2025

2025
[57]

Serving large language models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708, 2025

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, and Shirui Lu. Serving large language models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708, 2025. 15

arXiv 2025

[1] [1]

https://www.mindspore.cn/tutorials/experts /en/r2.3.1/operation/op_custom_ascendc.htm l, 2025

Ascend C Custom Operator Development and Usage. https://www.mindspore.cn/tutorials/experts /en/r2.3.1/operation/op_custom_ascendc.htm l, 2025

2025

[2] [2]

https://www.hiascend.com/cann, 2025

Compute Architecture for Neural Networks (CANN). https://www.hiascend.com/cann, 2025

2025

[3] [3]

https://www.hiasce nd.com/document/detail/zh/canncommercial/8 3RC1/API/ascendcopapi/atlasascendc_api_07_ 0102.html, 2025

Data Movement in Ascend C. https://www.hiasce nd.com/document/detail/zh/canncommercial/8 3RC1/API/ascendcopapi/atlasascendc_api_07_ 0102.html, 2025

2025

[4] [4]

https://www

Memory Management of Ascend NPU. https://www. hiascend.com/document/detail/zh/canncommer cial/83RC1/API/appdevgapi/aclpythondevg_01 _0110.html, 2025

2025

[5] [5]

Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

Pith/arXiv arXiv 2023

[6] [6]

Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers.arXiv preprint arXiv:2311.09431, 2023

arXiv 2023

[7] [7]

Characterizing cloud-native llm inference at bytedance and exposing optimization challenges and opportunities for future ai accelerators

Jingwei Cai, Dehao Kong, Hantao Huang, Zishan Jiang, Zixuan Ma, Qingyu Guo, Zhenxing Zhang, Guiming Shi, Mingyu Gao, and Kaisheng Ma. Characterizing cloud-native llm inference at bytedance and exposing optimization challenges and opportunities for future ai accelerators. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA),...

2026

[8] [8]

Amali: An analytical model for accurately modeling llm inference on modern gpus

Shiheng Cao, Junmin Wu, Junshi Chen, Hong An, and Zhibin Yu. Amali: An analytical model for accurately modeling llm inference on modern gpus. InProceedings of the 52nd Annual International Symposium on Com- puter Architecture, ISCA ’25, page 1495–1508, New York, NY , USA, 2025. Association for Computing Ma- chinery

2025

[9] [9]

Generating long sequences with sparse trans- formers.arXiv preprint arXiv:1904.10509, 2019

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse trans- formers.arXiv preprint arXiv:1904.10509, 2019. 12

Pith/arXiv arXiv 1904

[10] [10]

Lazy batching: An sla-aware batching system for cloud ma- chine learning inference

Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy batching: An sla-aware batching system for cloud ma- chine learning inference. In2021 IEEE International Symposium on High-Performance Computer Architec- ture (HPCA), pages 493–506, 2021

2021

[11] [11]

Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing, 2024

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, and Tong Yang. Prediction is all moe needs: Expert load distribution goes from fluctuating to stabilizing, 2024

2024

[12] [12]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.arXiv preprint arXiv:2205.14135, 2022

Pith/arXiv arXiv 2022

[13] [13]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, and etc Bing Xue. Deepseek-v3 technical report, 2025

2025

[14] [14]

Longnet: Scaling transformers to 1,000,000,000 tokens, 2023

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023

2023

[15] [15]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

arXiv 2022

[16] [16]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity.J

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(1), January 2022

2022

[17] [17]

Parameter-efficient mixture-of- experts architecture for pre-trained language models

Ze-Feng Gao, Peiyu Liu, Wayne Xin Zhao, Zhong-Yi Lu, and Ji-Rong Wen. Parameter-efficient mixture-of- experts architecture for pre-trained language models. arXiv preprint arXiv:2203.01104, 2022

arXiv 2022

[18] [18]

Tokenweave: Efficient compute-communication overlap for distributed llm inference.arXiv preprint arXiv:2505.11329, 2025

Raja Gond, Nipun Kwatra, and Ramachandran Ram- jee. Tokenweave: Efficient compute-communication overlap for distributed llm inference.arXiv preprint arXiv:2505.11329, 2025

Pith/arXiv arXiv 2025

[19] [19]

Past-future scheduler for llm serving under sla guar- antees

Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zai- jun Wang, Xiuhong Li, Hailong Yang, and Xianglong Liu. Past-future scheduler for llm serving under sla guar- antees. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’25, page 798–813, New York, NY , USA, 202...

2025

[20] [20]

Flashdecoding++: Faster large language model infer- ence on gpus.arXiv preprint arXiv:2311.01282, 2024

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model infer- ence on gpus.arXiv preprint arXiv:2311.01282, 2024

arXiv 2024

[21] [21]

Memserve: Context caching for disaggregated llm serving with elastic mem- ory pool.arXiv preprint arXiv:2406.17565, 2024

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yun- gang Bao, and Ninghui Sun. Memserve: Context caching for disaggregated llm serving with elastic mem- ory pool.arXiv preprint arXiv:2406.17565, 2024

arXiv 2024

[22] [22]

Deepserve: serverless large language model serving at scale

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, and Yizhou Shan. Deepserve: serverless large language model serving at scale. InProceedings of the 2025 USENIX Conference on Us...

2025

[23] [23]

Ad- vancing transformer architecture in long-context large language models: A comprehensive survey, 2024

Yunpeng Huang, Jingwei Xu, Junyu Lai, Zixu Jiang, Taolue Chen, Zenan Li, Yuan Yao, Xiaoxing Ma, Lijuan Yang, Hao Chen, Shupeng Li, and Penghao Zhao. Ad- vancing transformer architecture in long-context large language models: A comprehensive survey, 2024

2024

[24] [24]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

2020

[25] [25]

Efficient memory man- agement for large language model serving with page- dattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machinery

2023

[26] [26]

Lightseq: : Sequence level parallelism for distributed training of long context transformers

Dacheng Li, Rulin Shao, Anze Xie, Eric Xing, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: : Sequence level parallelism for distributed training of long context transformers. InWorkshop on Advancing Neural Network Training: Computa- tional Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023), 2023

2023

[27] [27]

Sequence parallelism: Long se- quence training from system perspective.arXiv preprint arXiv:2105.13120, 2022

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yong- bin Li, and Yang You. Sequence parallelism: Long se- quence training from system perspective.arXiv preprint arXiv:2105.13120, 2022

arXiv 2022

[28] [28]

Ub-mesh: A hierarchically localized nd-fullmesh data center network architecture.IEEE Micro, 45(5):20–29, 2025

Heng Liao, Bingyang Liu, Xianping Chen, Zhigang Guo, Chuanning Cheng, Jianbing Wang, Xiangyu Chen, Peng 13 Dong, Rui Meng, Wenjie Liu, Zhe Zhou, Ziyang Zhang, Yuhang Gai, Cunle Qian, Yi Xiong, Zhongwu Cheng, Jing Xia, Yuli Ma, Xi Chen, Wenhua Du, Shizhong Xiao, Chungang Li, Yong Qin, Liudong Xiong, Zhou Yu, Lv Chen, Lei Chen, Buyun Wang, Pei Wu, Junen Gao...

2025

[29] [29]

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669, 2024

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, and Wei Lin. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache.arXiv preprint arXiv:2401.02669, 2024

arXiv 2024

[30] [30]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, and Chen Dong. Deepseek-v3. 2: Push- ing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025

[31] [31]

Revealing the challenges of attention-ffn disaggregation for modern moe models and hardware systems.arXiv preprint arXiv:2602.09721, 2026

Guowei Liu, Hongming Li, Yaning Guo, Yongxi Lyu, Mo Zhou, Yi Liu, Zhaogeng Li, and Yanpeng Wang. Revealing the challenges of attention-ffn disaggregation for modern moe models and hardware systems.arXiv preprint arXiv:2602.09721, 2026

arXiv 2026

[32] [32]

Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:310.01889, 2023

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:310.01889, 2023

2023

[33] [33]

Expert-as-a-service: Towards efficient, scalable, and robust large-scale moe serving.arXiv preprint arXiv:2509.17863, 2025

Ziming Liu, Boyu Tian, Guoteng Wang, Zhen Jiang, and etc Peng Sun. Expert-as-a-service: Towards efficient, scalable, and robust large-scale moe serving.arXiv preprint arXiv:2509.17863, 2025

arXiv 2025

[34] [34]

Moe-gps: Guid- lines for prediction strategy for dynamic expert duplica- tion in moe load balancing, 2025

Haiyue Ma, Zhixu Du, and Yiran Chen. Moe-gps: Guid- lines for prediction strategy for dynamic expert duplica- tion in moe load balancing, 2025

2025

[35] [35]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

Siyuan Mu and Sen Lin. A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications. arXiv preprint arXiv:2503.07137, 2025

arXiv 2025

[36] [36]

Mar- coni: Prefix caching for the era of hybrid llms

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Mar- coni: Prefix caching for the era of hybrid llms. In M. Za- haria, G. Joshi, and Y . Lin, editors,Proceedings of Ma- chine Learning and Systems, volume 7. MLSys, 2025

2025

[37] [37]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InProceedings of the 51st Annual Interna- tional Symposium on Computer Architecture, ISCA ’24, page 118–132. IEEE Press, 2025

2025

[38] [38]

Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: Trading more storage for less computation — a KVCache-centric architecture for serving LLM chatbot. In23rd USENIX Confer- ence on File and Storage Technologies (FAST 25), pages 155–170, Santa Clara, CA, February 2025. USENIX Association

2025

[39] [39]

DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the ...

2022

[40] [40]

Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2020

Pith/arXiv arXiv 1909

[41] [41]

Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:505.21411, 2025

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jin- peng Li, Hui Zang, Fei Mi, Xiaojun Meng, Zhicheng Liu, Hanting Chen, Binfan Zheng, Can Chen, Youliang Yan, Ruiming Tang, Peifeng Qin, Xinghao Chen, Dacheng Tao, and Yunhe Wang. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint ...

2025

[42] [42]

Kimi k2: Open agentic intelligence, 2025

Kimi Team, Yifan Bai, Yiping Bao, and etc Guan- duo Chen. Kimi k2: Open agentic intelligence, 2025

2025

[43] [43]

Thompson

W.J. Thompson. Poisson distributions.Computing in Science and Engineering, 3(3):78–82, 2001

2001

[44] [44]

Step-3 is large yet affordable: Model-system co-design for cost-effective decoding

Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, and Siyu Chen. Step-3 is large yet affordable: Model-system co-design for cost-effective decoding. arXiv preprint arXiv:2507.19427, 2025

arXiv 2025

[45] [45]

Flexsp: Accelerating large language model training via flexible sequence parallelism

Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, and Bin Cui. Flexsp: Accelerating large language model training via flexible sequence parallelism. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’2...

2025

[46] [46]

{WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training

Zheng Wang, Anna Cai, Xinfeng Xie, Zaifeng Pan, Yue Guan, Weiwei Chu, Jie Wang, Shikai Li, Jianyu Huang, and Chris Cai. {WLB-LLM}:{Workload-Balanced} 4d parallelism for large language model training. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 785–801, 2025

2025

[47] [47]

Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elas- tic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Prin- ciples, SOSP ’24, page 640–654, New York, NY , USA,

[48] [48]

Association for Computing Machinery

[49] [49]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. InProceedings of the ACM SIGOPS 31st Symposium on Operating Sys- tems Principles, pages 1030–1045, 2025

2025

[50] [50]

Servegen: Workload characteriza- tion and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

Yuxing Xiang, Xue Li, Kun Qian, Wenyuan Yu, Ennan Zhai, and Xin Jin. Servegen: Workload characteriza- tion and generation of large language model serving in production.arXiv preprint arXiv:2505.09999, 2025

Pith/arXiv arXiv 2025

[51] [51]

Qwen3 technical report, 2025

An Yang, Anfeng Li, and etc Baosong Yang. Qwen3 technical report, 2025

2025

[52] [52]

Gonzalez, Clark Bar- rett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Bar- rett, and Ying Sheng. Sglang: efficient execution of structured language model programs. InProceedings of the 38th International Conference on Neural Infor- mation Processing Systems, NIPS ’24, Red Hoo...

2024

[53] [53]

Bucketserve: Bucket-based dynamic batching for smart and efficient llm inference serving.arXiv preprint arXiv:2507.17120, 2025

Wanyi Zheng, Minxian Xu, Shengye Song, and Kejiang Ye. Bucketserve: Bucket-based dynamic batching for smart and efficient llm inference serving.arXiv preprint arXiv:2507.17120, 2025

arXiv 2025

[54] [54]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

2024

[55] [55]

Sampleattention: Near- lossless acceleration of long context llm inference with adaptive structured sparse attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. Sampleattention: Near- lossless acceleration of long context llm inference with adaptive structured sparse attention. In M. Zaharia, G. Joshi, and Y . Lin, editors,Proceedings of Machine Learning and Systems, volume 7. MLSys, 2025

2025

[56] [56]

Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Ce- sar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, and Yang Cheng. Megascale-infer: Efficient mixture-of-experts model serving with disag- gregated expert parallelism. InProceedings of the ACM SIGCOMM 2025 Conference, pages 592–608, 2025

2025

[57] [57]

Serving large language models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708, 2025

Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, and Shirui Lu. Serving large language models on huawei cloudmatrix384.arXiv preprint arXiv:2506.12708, 2025. 15

arXiv 2025