MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Aurick Qiao; Juncheng Yang; Karthik Ganesan; Olatunji Ruwase; Samyam Rajbhandari; Yue Cheng; Yuxiong He; Zhaoyuan Su

arxiv: 2605.02960 · v2 · pith:SGRHL77Bnew · submitted 2026-05-03 · 💻 cs.LG

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

Zhaoyuan Su , Olatunji Ruwase , Karthik Ganesan , Aurick Qiao , Samyam Rajbhandari , Juncheng Yang , Yue Cheng , Yuxiong He This is my paper

Pith reviewed 2026-05-19 17:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsprefill servingasynchronous expert parallelismlarge language modelsdistributed inferencethroughput optimizationzero redundancy

0 comments

The pith

MoE prefill serving eliminates redundant overheads by asynchronously gathering expert weights during compute-bound phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard distributed strategies for MoE models incur unnecessary computation and communication costs during prefill because they couple expert placement with synchronous activation routing, a holdover from decoding workloads. For prefill-only tasks with large batches, the extended compute time per layer allows expert weights to be streamed asynchronously and overlapped with computation, replacing activation AllToAll with weight AllGather. A sympathetic reader would care because this directly improves efficiency for common production tasks like classification and recommendation on massive MoE models without sacrificing accuracy or requiring additional memory reductions.

Core claim

MoE-Prefill uses AsyncEP to gather experts by weight asynchronously rather than routing activations synchronously, fully overlapping the AllGather with the long forward passes of large-batch prefill, while the frontend uses prefix-aware routing and true-FLOPs load tracking to enforce saturation thresholds.

What carries the argument

Asynchronous Expert Parallelism (AsyncEP), which streams expert weights in the background to replace per-layer activation AllToAll with overlapped weight AllGather.

If this is right

Throughput improves by 1.35-1.37x over the strongest baseline on real-world workloads for Qwen3-235B-A22B.
Up to 1.59x throughput gain on long-context synthetic workloads.
Per-GPU model FLOPs utilization reaches 29.8-36.2% across four hardware and precision setups.
The approach maintains accuracy with no new bottlenecks introduced by the weight streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This separation of prefill parallelism from decode strategies could lead to specialized serving systems that switch modes based on workload type.
Similar asynchronous weight movement might apply to other compute-heavy phases in distributed model serving beyond MoE.
Adoption could reduce reliance on complex tensor and pipeline parallelism for prefill, simplifying deployment on heterogeneous hardware.

Load-bearing premise

The per-layer compute time during large-batch prefill is sufficiently long to allow complete overlapping of the asynchronous expert weight AllGather without introducing stalls or accuracy degradation.

What would settle it

An experiment showing that on the evaluated hardware configurations, the time to perform the weight AllGather exceeds the available compute window for some layers, resulting in no throughput improvement or utilization below 29 percent.

Figures

Figures reproduced from arXiv: 2605.02960 by Aurick Qiao, Juncheng Yang, Karthik Ganesan, Olatunji Ruwase, Samyam Rajbhandari, Yue Cheng, Yuxiong He, Zhaoyuan Su.

**Figure 1.** Figure 1: Prefill-only workloads dominate LLM serving input view at source ↗

**Figure 2.** Figure 2: MoE model size vs. per-GPU HBM capacity across view at source ↗

**Figure 4.** Figure 4: Expert routing imbalance of Qwen3-30B-A3B on view at source ↗

**Figure 5.** Figure 5: ZeRO-Prefill system architecture and end-to-end prefill-only serving workflow. The frontend normalizes incoming tasks into prefill-only form and schedules them into saturation-bounded batches with prefix affinity; the backend executes each batch under data-parallel attention and asynchronous expert streaming, returning logits without entering any decoding loop. Frontend Scheduler (Router) GPU 0 GPU 1 GPU 2… view at source ↗

**Figure 6.** Figure 6: Conventional synchronous DP+EP with four GPUs. view at source ↗

**Figure 7.** Figure 7: AsyncEP execution models with four GPUs. (a) Each GPU replicates all experts for the first layer and gathers subsequent view at source ↗

**Figure 8.** Figure 8: ZeRO-Prefill frontend scheduling with four GPUs, realized in three stages: (1) Prefix-aware routing picks the GPU with the longest block-level cache match; (2) Compute-aware tracking updates each GPU’s true-FLOPs load after prefix-sharing credit; (3) Overlap-aware balancing marks a GPU saturated once its load reaches the backendderived threshold T. when per-request cost is dominated by decoding. They viol… view at source ↗

**Figure 9.** Figure 9: End-to-end throughput on the aggregated real-world prefill-only workload. view at source ↗

**Figure 10.** Figure 10: Contribution of ZeRO-Prefill’s two design tiers on the real-world workload. DP+AsyncEP applies the backend of §6 under vLLM’s default scheduler; ZeRO-Prefill additionally applies the frontend of §7. Annotations report the throughput gain of adding the frontend over the backend-only configuration at each parallel degree. chronous MoE), DP+AsyncEP under vLLM’s default scheduler, and ZeRO-Prefill (backend +… view at source ↗

**Figure 11.** Figure 11: Throughput under synthetic workloads with no prefix reuse, across four context regimes on 8 view at source ↗

**Figure 12.** Figure 12: MFU on Qwen3-235B-A22B (H100, FP8) under synthetic no-prefix-reuse workloads, across four context regimes view at source ↗

read the original abstract

Production LLM workloads increasingly serve discriminative tasks, such as classification, recommendation, and verification, whose answers are read from the logits of a single prefill pass with no autoregressive decoding. Serving these prefill-only workloads on mixture-of-experts (MoE) models is bottlenecked not by compute but by the distributed execution required to fit the model: existing parallel strategies (tensor, expert, and pipeline parallelism) trade memory pressure for redundant computation, communication, and synchronization, severely degrading MoE prefill serving efficiency. We observe that these overheads stem from coupling expert placement with synchronous activation routing -- a design inherited from the decoding era. The long, compute-bound forward passes of large-batch prefill open a per-layer window wide enough to stream expert weights in the background, replacing per-layer activation AllToAll with asynchronous weight AllGather fully overlapped with computation. We propose MoE-Prefill, a prefill-only serving system whose backend, AsyncEP (Asynchronous Expert Parallelism), gathers experts by weight rather than routing them by activation, and whose frontend co-enforces a physically-derived saturation threshold through prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, MoE-Prefill delivers 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads and up to 1.59x on long-context synthetic workloads, sustaining 29.8-36.2% per-GPU model FLOPs utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoE-Prefill swaps activation AllToAll for overlapped async weight AllGather in prefill-only MoE serving and reports clear throughput gains, but the overlap claim needs tighter validation.

read the letter

The core idea is straightforward: for prefill-only workloads on big MoE models, the long compute phases per layer give enough time to pull expert weights asynchronously instead of doing synchronous activation routing. This replaces the usual AllToAll with AllGather that runs in the background, and they add prefix-aware routing plus FLOPs tracking to keep load balanced up to a saturation point. On Qwen3-235B they show 1.35-1.37x throughput on real tasks and up to 1.59x on long synthetic ones, with MFU in the 30% range across a few hardware setups. That is the practical payoff they are after for classification-style serving where you skip decoding entirely.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MoE-Prefill, a prefill-only serving system for large MoE models that replaces per-layer activation AllToAll with asynchronous weight AllGather (via AsyncEP) fully overlapped with the long compute-bound forward passes of large-batch prefill. A frontend enforces a saturation threshold via prefix-aware routing and true-FLOPs load tracking. On Qwen3-235B-A22B across four hardware/precision configurations, it reports 1.35-1.37x throughput over the strongest distributed baseline on real-world workloads, up to 1.59x on long-context synthetic workloads, and 29.8-36.2% per-GPU model FLOPs utilization.

Significance. If the asynchronous overlap can be shown to complete without stalls, load imbalance, or accuracy loss, the approach would meaningfully advance efficient serving of prefill-dominant discriminative workloads on MoE models by removing redundancy from traditional tensor/expert/pipeline parallelism. The multi-configuration empirical results on real hardware provide a concrete basis for assessing practical gains in throughput and utilization.

major comments (3)

[§4 and Evaluation] §4 (AsyncEP design) and Evaluation: The central performance claims rest on the assertion that per-layer compute windows are wide enough to fully hide the cost of background expert weight AllGather. No quantitative overlap fractions, per-layer timing breakdowns, communication volume measurements, or memory-footprint data during streaming are provided to confirm absence of stalls or new bottlenecks.
[Evaluation section] Evaluation section: Throughput and MFU numbers (1.35-1.59x, 29.8-36.2%) are reported without sufficient detail on baseline implementations, exact workload definitions, measurement methodology, or statistical variance across runs. This information is required to substantiate the gains over the strongest distributed baseline.
[§3 and Accuracy] §3 (frontend routing) and Accuracy: Prefix-aware routing combined with true-FLOPs tracking is claimed to prevent imbalance, yet no load-imbalance metrics, expert utilization histograms, or end-to-end accuracy verification on the reported workloads are shown. Any drift here would directly undermine the reported MFU and throughput.

minor comments (2)

[§3] Clarify the exact definition and computation of the 'saturation threshold' and 'true-FLOPs' tracking in the frontend description.
[Related Work] Add a brief related-work paragraph contrasting AsyncEP with prior asynchronous communication techniques in MoE inference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address each of the major comments in detail below. We are committed to improving the clarity and completeness of the paper based on this feedback.

read point-by-point responses

Referee: [§4 and Evaluation] §4 (AsyncEP design) and Evaluation: The central performance claims rest on the assertion that per-layer compute windows are wide enough to fully hide the cost of background expert weight AllGather. No quantitative overlap fractions, per-layer timing breakdowns, communication volume measurements, or memory-footprint data during streaming are provided to confirm absence of stalls or new bottlenecks.

Authors: We acknowledge that the manuscript would benefit from more detailed profiling to explicitly demonstrate the overlap. In the revised version, we will add quantitative overlap fractions, per-layer timing breakdowns, and communication volume measurements from our experiments. These will confirm that the AllGather is fully overlapped without introducing stalls or new bottlenecks, as evidenced by the sustained high MFU and throughput gains. revision: yes
Referee: [Evaluation section] Evaluation section: Throughput and MFU numbers (1.35-1.59x, 29.8-36.2%) are reported without sufficient detail on baseline implementations, exact workload definitions, measurement methodology, or statistical variance across runs. This information is required to substantiate the gains over the strongest distributed baseline.

Authors: We agree that additional details on the experimental setup are necessary for reproducibility and to substantiate the claims. We will expand the Evaluation section to include precise descriptions of the baseline implementations (e.g., how tensor, expert, and pipeline parallelism are configured in the strongest baseline), exact definitions of the real-world and synthetic workloads, the measurement methodology (including how throughput and MFU are calculated), and report statistical variance across multiple runs. revision: yes
Referee: [§3 and Accuracy] §3 (frontend routing) and Accuracy: Prefix-aware routing combined with true-FLOPs tracking is claimed to prevent imbalance, yet no load-imbalance metrics, expert utilization histograms, or end-to-end accuracy verification on the reported workloads are shown. Any drift here would directly undermine the reported MFU and throughput.

Authors: We will include load-imbalance metrics and expert utilization histograms in the revised manuscript to demonstrate the effectiveness of the prefix-aware routing and true-FLOPs tracking in maintaining balance. Regarding accuracy, since the routing decisions affect only the distribution of computation without altering the model parameters or the final computation graph, the end-to-end accuracy remains identical to the baseline. We will add a note or verification on this point for the reported workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems paper with hardware-validated throughput claims

full rationale

The paper presents an engineering system (AsyncEP + prefix-aware routing) whose core claim is that large-batch prefill compute windows suffice to fully overlap asynchronous expert weight AllGather with per-layer computation, replacing activation AllToAll. This is not derived from equations or fitted parameters but stated as an observation about compute-bound prefill phases, then implemented and measured directly on Qwen3-235B-A22B across four hardware/precision setups. Throughput (1.35-1.59x) and MFU (29.8-36.2%) numbers are reported from real runs rather than any self-referential prediction or self-citation chain. No load-bearing mathematical derivation, uniqueness theorem, or ansatz appears; the work is self-contained against external benchmarks of measured performance.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim rests on the domain assumption that prefill compute phases are long enough for full overlap of weight movement, plus the new system components AsyncEP and prefix-aware routing; no free parameters are explicitly fitted to data and no new physical entities are postulated.

free parameters (1)

saturation threshold
Physically-derived value enforced through prefix-aware routing and true-FLOPs load tracking to avoid overload.

axioms (2)

domain assumption Existing tensor, expert, and pipeline parallelism trade memory pressure for redundant computation, communication, and synchronization in MoE prefill.
Stated as the source of overheads inherited from decoding-era designs.
domain assumption Per-layer activation AllToAll can be replaced by asynchronous weight AllGather fully overlapped with computation.
Core premise enabling zero redundancy in the proposed backend.

invented entities (2)

AsyncEP no independent evidence
purpose: Backend implementing asynchronous expert parallelism that gathers experts by weight rather than routing activations.
New system component introduced to replace synchronous strategies.
MoE-Prefill no independent evidence
purpose: Overall prefill-only serving system combining AsyncEP backend with prefix-aware frontend.
New end-to-end system design proposed in the paper.

pith-pipeline@v0.9.0 · 5845 in / 1643 out tokens · 78365 ms · 2026-05-19T17:38:54.747167+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · 16 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...

work page 2022
[4]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025
[5]

Moe-lightning: High-throughput moe inference on memory-constrained gpus

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...

work page 2025
[6]

LexGLUE: A benchmark dataset for legal language understanding in English

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bom- marito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal language understanding in English. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022

work page 2022
[7]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

work page 2023
[8]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for com- putational linguistics: Human language technologies, volume 1 (long and short papers)...

work page 2019
[9]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

work page 2022
[10]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

GoE- motions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoE- motions: A dataset of fine-grained emotions. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, 2020

work page 2020
[12]

Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xi- aoxuan Liu, Yifan Qiao, et al. Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 399–414, 2025

work page 2025
[13]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning, pages 5547–5569. PMLR, 2022

work page 2022
[14]

Moral stories: Situated reason- ing about norms, intents, actions, and their consequences

Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reason- ing about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698– 718, 2021

work page 2021
[15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

work page 2022
[16]

Cascade Inference

FlashInfer. Cascade Inference. https://flashinfer. ai/2024/02/02/cascade-inference.html, 2024. Blog post. Accessed: 2026-04-23

work page 2024
[17]

Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

work page 2023
[18]

{Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou 13 Yu, and Pengfei Zuo. {Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24), pages 111–126, 2024

work page 2024
[19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Ariel Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, et al. Elastic moe: Unlocking the inference-time scalability of mixture-of-experts.arXiv preprint arXiv:2509.21892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing

Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 791– 803, 2023

work page 2023
[22]

Fastmoe:

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. Fastmoe: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262, 2021

work page arXiv 2021
[23]

Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

work page 2022
[24]

Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

work page 2019
[25]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[26]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019
[27]

Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

work page 2023
[28]

Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024

work page 2024
[29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

work page 2024
[32]

Fu, Christopher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher Ré, and Azalia Mirhoseini. Hydra- gen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024

work page arXiv 2024
[33]

Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

work page arXiv 2024
[34]

Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6710–6720, 2024

work page 2024
[35]

Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023
[36]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 14

work page 2023
[37]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gi- ant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[38]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[39]

Accelerating distributed {MoE} training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

work page 2023
[40]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[41]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Cachegen: Kv cache compression and streaming for fast large lan- guage model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large lan- guage model serving. InProceedings of the ACM SIG- COMM 2024 Conference, pages 38–56, 2024

work page 2024
[43]

Learning word vectors for sentiment analysis

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computa- tional linguistics: Human language technologies, pages 142–150, 2011

work page 2011
[44]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

work page 2023
[45]

Pipedream: Gen- eralized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019

work page 2019
[46]

Twitter Financial News Sentiment

Neural Magic. Twitter Financial News Sentiment. https://huggingface.co/datasets/zeroshot/ twitter-financial-news-sentiment , 2022. Hug- ging Face dataset. Accessed: 2026-04-23

work page 2022
[47]

NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper

NVIDIA. NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper. https://www.nvidia.com/en- us/data-center/h100/, 2026. Accessed: 2026-04- 23

work page 2026
[48]

TensorRT-LLM

NVIDIA. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM, 2026. GitHub repository. Ac- cessed: 2026-04-23

work page 2026
[49]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies...

work page 2022
[50]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024
[51]

Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, and Xunliang Cai. Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

work page arXiv 2024
[52]

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chat- gpt a general-purpose natural language processing task solver? InProceedings of the 2023 conference on em- pirical methods in natural language processing, pages 1339–1384, 2023

work page 2023
[53]

Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

work page 2024
[54]

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. InInternational confer- ence on machine learning, pages 18332–18346. PMLR, 2022

work page 2022
[55]

Zero: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: international 15 conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020
[56]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high per- formance computing, networking, storage and analysis, pages 1–14, 2021

work page 2021
[57]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

work page 2023
[59]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[60]

Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models

Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models. arXiv preprint arXiv:2510.02613, 2025

work page arXiv 2025
[61]

Text classifi- cation via large language models

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classifi- cation via large language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pages 8990–9005, 2023

work page 2023
[62]

The Toxicity Dataset

Surge AI. The Toxicity Dataset. https://github. com/surge-ai/toxicity, 2022. GitHub repository. Accessed: 2026-04-23

work page 2022
[63]

Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

work page 2025
[64]

Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

work page 2024
[65]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Sub- ramoni, and Dhabaleswar K DK Panda. Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference. In2024 IEEE International parallel and distributed processing symposium (IPDPS), pages 915–925. IEEE, 2024

work page 2024
[67]

Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition

Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 11608–11620, 2024

work page 2024
[68]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

work page 2022
[69]

Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

work page 2026
[70]

Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Sto- ica. Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching. arXiv preprint arXiv:2411.16102, 2024

work page arXiv 2024
[71]

Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

work page 2024
[72]

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16

work page 2024
[74]

Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

work page 2022
[75]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan- ping Huang, Jeff Dean, Noam Shazeer, and William Fe- dus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. Appendix A Scheduling Algorithm Pseudocode Algorithm 1 summarizes one scheduling round of MoE-Prefill’s frontend (§7), integrating prefix-aware ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachan- dran Ramjee. Sarathi: Efficient llm inference by piggy- backing decodes with chunked prefills.arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Am- mar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient in- ference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, ...

work page 2022

[4] [4]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025

[5] [5]

Moe-lightning: High-throughput moe inference on memory-constrained gpus

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xi- aoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Za- haria, and Ion Stoica. Moe-lightning: High-throughput moe inference on memory-constrained gpus. InPro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 715–...

work page 2025

[6] [6]

LexGLUE: A benchmark dataset for legal language understanding in English

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bom- marito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal language understanding in English. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, 2022

work page 2022

[7] [7]

Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of machine learning research, 24(240):1–113, 2023

work page 2023

[8] [8]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for com- putational linguistics: Human language technologies, volume 1 (long and short papers)...

work page 2019

[9] [9]

Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344– 16359, 2022

work page 2022

[10] [10]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingx- uan Wang, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

GoE- motions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoE- motions: A dataset of fine-grained emotions. InPro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, 2020

work page 2020

[12] [12]

Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications

Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xi- aoxuan Liu, Yifan Qiao, et al. Prefillonly: An infer- ence engine for prefill-only workloads in large language model applications. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 399–414, 2025

work page 2025

[13] [13]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning, pages 5547–5569. PMLR, 2022

work page 2022

[14] [14]

Moral stories: Situated reason- ing about norms, intents, actions, and their consequences

Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi. Moral stories: Situated reason- ing about norms, intents, actions, and their consequences. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 698– 718, 2021

work page 2021

[15] [15]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learn- ing Research, 23(120):1–39, 2022

work page 2022

[16] [16]

Cascade Inference

FlashInfer. Cascade Inference. https://flashinfer. ai/2024/02/02/cascade-inference.html, 2024. Blog post. Accessed: 2026-04-23

work page 2024

[17] [17]

Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems, 5:288–304, 2023

work page 2023

[18] [18]

{Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou 13 Yu, and Pengfei Zuo. {Cost-Efficient} large lan- guage model serving for multi-turn conversations with {CachedAttention}. In2024 USENIX annual technical conference (USENIX ATC 24), pages 111–126, 2024

work page 2024

[19] [19]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Ariel Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, et al. Elastic moe: Unlocking the inference-time scalability of mixture-of-experts.arXiv preprint arXiv:2509.21892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing

Liwei Guo, Wonkyo Choe, and Felix Xiaozhu Lin. Sti: Turbocharge nlp inference at the edge via elastic pipelin- ing. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 791– 803, 2023

work page 2023

[22] [22]

Fastmoe:

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. Fastmoe: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262, 2021

work page arXiv 2021

[23] [23]

Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dy- namic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 120–134, 2022

work page 2022

[24] [24]

Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

Jun He, Liqun Wang, Liu Liu, Jiao Feng, and Hao Wu. Long document classification from local word glimpses via recurrent attention learning.IEEE Access, 7:40707– 40718, 2019

work page 2019

[25] [25]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[26] [26]

Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Effi- cient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

work page 2019

[27] [27]

Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale.Proceedings of Machine Learning and Systems, 5:269–287, 2023

work page 2023

[28] [28]

Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. Pre- gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 1018–1031. IEEE, 2024

work page 2024

[29] [29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture- of-experts training via whole graph computation- communication overlapping.Proceedings of Machine Learning and Systems, 6:74–86, 2024

work page 2024

[32] [32]

Fu, Christopher Ré, and Azalia Mirhoseini

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher Ré, and Azalia Mirhoseini. Hydra- gen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024

work page arXiv 2024

[33] [33]

Fid- dler: Cpu-gpu orchestration for fast inference of mixture- of-experts models.arXiv preprint arXiv:2402.07033,

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033, 2024

work page arXiv 2024

[34] [34]

Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Xiaozhou Ye, Ye Ouyang, Linghe Kong, and Yunxin Liu. Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6710–6720, 2024

work page 2024

[35] [35]

Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re- computation in large transformer models.Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023

[36] [36]

Efficient memory manage- ment for large language model serving with pagedatten- tion

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory manage- ment for large language model serving with pagedatten- tion. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 14

work page 2023

[37] [37]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De- hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling gi- ant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[38] [38]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023

[39] [39]

Accelerating distributed {MoE} training and inference with lina

Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. Accelerating distributed {MoE} training and inference with lina. In2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 945–959, 2023

work page 2023

[40] [40]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch dis- tributed: Experiences on accelerating data parallel train- ing.arXiv preprint arXiv:2006.15704, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[41] [41]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring at- tention with blockwise transformers for near-infinite context.arXiv preprint arXiv:2310.01889, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Cachegen: Kv cache compression and streaming for fast large lan- guage model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large lan- guage model serving. InProceedings of the ACM SIG- COMM 2024 Conference, pages 38–56, 2024

work page 2024

[43] [43]

Learning word vectors for sentiment analysis

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th annual meeting of the association for computa- tional linguistics: Human language technologies, pages 142–150, 2011

work page 2011

[44] [44]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

work page 2023

[45] [45]

Pipedream: Gen- eralized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Gen- eralized pipeline parallelism for dnn training. InPro- ceedings of the 27th ACM symposium on operating sys- tems principles, pages 1–15, 2019

work page 2019

[46] [46]

Twitter Financial News Sentiment

Neural Magic. Twitter Financial News Sentiment. https://huggingface.co/datasets/zeroshot/ twitter-financial-news-sentiment , 2022. Hug- ging Face dataset. Accessed: 2026-04-23

work page 2022

[47] [47]

NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper

NVIDIA. NVIDIA H100 Tensor Core GPU Archi- tecture Whitepaper. https://www.nvidia.com/en- us/data-center/h100/, 2026. Accessed: 2026-04- 23

work page 2026

[48] [48]

TensorRT-LLM

NVIDIA. TensorRT-LLM. https://github.com/ NVIDIA/TensorRT-LLM, 2026. GitHub repository. Ac- cessed: 2026-04-23

work page 2026

[49] [49]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies...

work page 2022

[50] [50]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

work page 2024

[51] [51]

Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, and Xunliang Cai. Eps- moe: Expert pipeline scheduler for cost-efficient moe inference.arXiv preprint arXiv:2410.12247, 2024

work page arXiv 2024

[52] [52]

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chat- gpt a general-purpose natural language processing task solver? InProceedings of the 2023 conference on em- pirical methods in natural language processing, pages 1339–1384, 2023

work page 2023

[53] [53]

Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache- centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

work page 2024

[54] [54]

Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Min- jia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. InInternational confer- ence on machine learning, pages 18332–18346. PMLR, 2022

work page 2022

[55] [55]

Zero: Memory optimizations toward train- ing trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: international 15 conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020

[56] [56]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning

Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the international conference for high per- formance computing, networking, storage and analysis, pages 1–14, 2021

work page 2021

[57] [57]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

work page 2023

[59] [59]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[60] [60]

Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models

Gursimran Singh, Timothy Yu, Haley Li, Cheng Chen, Hanieh Sadri, Qintao Zhang, Yu Zhang, Ying Xiong, Yong Zhang, and Zhenan Fan. Elasticmoe: An effi- cient auto scaling method for mixture-of-experts models. arXiv preprint arXiv:2510.02613, 2025

work page arXiv 2025

[61] [61]

Text classifi- cation via large language models

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. Text classifi- cation via large language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2023, pages 8990–9005, 2023

work page 2023

[62] [62]

The Toxicity Dataset

Surge AI. The Toxicity Dataset. https://github. com/surge-ai/toxicity, 2022. GitHub repository. Accessed: 2026-04-23

work page 2022

[63] [63]

Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, and John Paul Shen. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architectures. In2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 49–61. IEEE, 2025

work page 2025

[64] [64]

Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe-infinity: Offloading-efficient moe model serving.arXiv e-prints, pages arXiv–2401, 2024

work page 2024

[65] [65]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Sub- ramoni, and Dhabaleswar K DK Panda. Exploiting inter- layer expert affinity for accelerating mixture-of-experts model inference. In2024 IEEE International parallel and distributed processing symposium (IPDPS), pages 915–925. IEEE, 2024

work page 2024

[67] [67]

Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition

Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkatten- tion: Efficient self-attention with prefix-aware kv cache and two-phase partition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 11608–11620, 2024

work page 2024

[68] [68]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating sys- tems design and implementation (OSDI 22), pages 521– 538, 2022

work page 2022

[69] [69]

Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empow- ered recommendation approach.ACM Transactions on Information Systems, 43(5):1–37, 2026

work page 2026

[70] [70]

Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Sto- ica. Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching. arXiv preprint arXiv:2411.16102, 2024

work page arXiv 2024

[71] [71]

Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.Advances in neural information processing sys- tems, 37:62557–62583, 2024

work page 2024

[72] [72]

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16

work page 2024

[74] [74]

Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

work page 2022

[75] [75]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan- ping Huang, Jeff Dean, Noam Shazeer, and William Fe- dus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. Appendix A Scheduling Algorithm Pseudocode Algorithm 1 summarizes one scheduling round of MoE-Prefill’s frontend (§7), integrating prefix-aware ...

work page internal anchor Pith review Pith/arXiv arXiv 2022