pith. sign in

arxiv: 2605.28095 · v1 · pith:7K7YW5EXnew · submitted 2026-05-27 · 💻 cs.DC

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

Pith reviewed 2026-06-29 10:04 UTC · model grok-4.3

classification 💻 cs.DC
keywords LLM inferencedata parallelismKV cacheoffline workloadsmemory efficiencydistributed weightsthroughput optimization
0
0 comments X

The pith

SiDP raises offline LLM throughput up to 1.5x by expanding usable KV cache 1.8x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SiDP to address the memory tension in data-parallel LLM inference for offline workloads. Data parallelism normally replicates full model weights on every GPU, which leaves little room for the large KV caches needed to support big batch sizes that fully utilize compute. SiDP instead keeps weights as a distributed pool, with each layer owned by one GPU and accessed on demand by other replicas through two modes. This design frees substantial GPU memory for KV cache while preserving the scheduling independence of data parallelism. Evaluations on multiple GPU types and models show the resulting gains in capacity and throughput over standard data-parallel baselines.

Core claim

SiDP treats weights as a bandwidth-backed shared resource inside a DP group. Instead of storing the full model on every GPU, each layer is owned by a single GPU, and other replicas access its weights on demand via two complementary execution modes: a Weight-as-a-Service (WaS) mode that streams remote weights over NVLink into a small cache in the large-batch regime, and a Compute-as-a-Service (CaS) mode that ships activations to owners in the small-batch tail.

What carries the argument

The Weight-as-a-Service (WaS) and Compute-as-a-Service (CaS) modes that let replicas access distributed layer weights on demand without full replication.

If this is right

  • SiDP increases usable KV capacity by up to 1.8x under the same configurations.
  • This capacity increase converts into up to 1.5x higher end-to-end throughput over vLLM for offline workloads.
  • SiDP preserves data parallelism's independence and scheduling flexibility while reducing per-device weight storage.
  • The gains hold across Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B on H20, H200, and B200 GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If NVLink bandwidth becomes saturated at larger scales, the WaS mode may need fallback to selective replication of hot layers.
  • The approach could extend to heterogeneous GPU clusters where slower interconnects make the CaS mode more attractive for certain batch sizes.
  • Adapting the ownership and streaming logic to models with heavy mixture-of-experts routing would require tracking which experts are accessed most often.

Load-bearing premise

The communication overhead from streaming weights over NVLink or shipping activations remains low enough not to negate the throughput gains from larger batches.

What would settle it

A direct measurement of end-to-end throughput on the same GPUs and models where weight-streaming or activation-shipping time exceeds the time saved by processing the extra batch size enabled by the larger KV cache.

Figures

Figures reproduced from arXiv: 2605.28095 by Alan Zhao, Cyril Y. He.

Figure 1
Figure 1. Figure 1: (a) Per-iter time (TP=2) with Llama-3.1-70B vs. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SiDP architecture and runtime workflow within a data-parallel groups. and numerics are unchanged; SiDP only changes where FFN weights reside and how they are accessed. 4.2 Weight as a Service In the WaS mode, we exploit the fact that DP replicas on an NVLink-connected node share a high-bandwidth intercon￾nect but do not actually need ownership of the full model to execute a forward pass. The ke… view at source ↗
Figure 4
Figure 4. Figure 4: Memory layout and per-layer workflows of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Available KV cache size (in tokens) of vLLM vs. SiDP under different parallelism on H20. vLLM cannot run Llama-3.1-70B and Qwen2.5-72B with TP=1, DP=8 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end throughput with different models, parallelisms, and sequence lengths on H20. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end throughput with different models, parallelisms, and sequence lengths on H200. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end throughput with different models, parallelisms, and sequence lengths on B200. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prefetch vs. T(B) (TP=2, S = 1K). Fetch-1 and 2 are the time to read entire FFN weights of Llama-3.1-70B on NVLink 4 and NVLink 5. decode iterations become compute-dominated, while the cost of weight prefetch stays bounded and can be hidden within T(B) on all three GPU generations. The experiment is intentionally conservative: the “Fetch” lines reflect streaming all FFN weights once, whereas the runtime on… view at source ↗
Figure 10
Figure 10. Figure 10: Impact of peak shifting with Qwen3-32B on H20, [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Iteration time of WaS/CaS/SiDP/vLLM with Llama￾3.1-70B on H20, TP=2, DP=2. grows to 3.4×. This confirms that avoiding incast on owners is necessary to translate WaS’s HBM savings into DP-scalable throughput on NVLink-connected nodes. Mode switching mitigates tail inefficiency [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 15
Figure 15. Figure 15: Batch/mode distribution of SiDP with Llama-3.1- 70B on H20, DP=8, S=4K. cutting time further to 19 s (another 24% reduction). Finally, “CaS V3” incorporates dummy skipping (§4.3), avoiding any P2P or GEMM work for replicas that only execute dummy iterations and reducing the tail from 19 s to 12 s (∼37%). Taken together, these optimizations improve a straightfor￾ward FSDP-like implementation by 2.8×. This … view at source ↗
Figure 14
Figure 14. Figure 14: End-to-end time of a single request under different [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
read the original abstract

The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing deployments face a structural tension. Data parallelism (DP) scales throughput well but replicates model weights, leaving limited GPU memory for key-value (KV) cache and constraining batch size. Model parallelism reduces per-device weights, but requires fine-grained synchronization that erodes DP's independence and scheduling flexibility. We present SiDP, a memory-efficient data-parallel paradigm for offline LLM inference that treats weights as a bandwidth-backed shared resource inside a DP group. Instead of storing the full model on every GPU, SiDP organizes weights as a distributed pool: each layer is owned by a single GPU, and other replicas access its weights on demand via two complementary execution modes: a Weight-as-a-Service (WaS) mode that streams remote weights over NVLink into a small cache in the large-batch regime, and a Compute-as-a-Service (CaS) mode that ships activations to owners in the small-batch tail. Evaluated on NVIDIA H20, H200, and B200 GPUs with Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B, SiDP increases usable KV capacity by up to 1.8x under the same configurations, and converts this into up to 1.5x higher end-to-end throughput over baselines (vLLM) for offline workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes SiDP, a memory-efficient data-parallel paradigm for offline LLM inference. Instead of replicating full model weights on every GPU in a DP group, it organizes weights as a distributed pool where each layer is owned by one GPU; other replicas access them on demand via Weight-as-a-Service (WaS) mode (streaming weights over NVLink into a small cache for large batches) or Compute-as-a-Service (CaS) mode (shipping activations to weight owners for small-batch tails). Evaluated on H20/H200/B200 GPUs with Qwen3-32B, Qwen2.5-72B, and Llama-3.1-70B, it claims up to 1.8x usable KV capacity and 1.5x end-to-end throughput over vLLM baselines.

Significance. If the results hold under rigorous controls, SiDP would provide a practical middle ground between pure data parallelism and model parallelism for throughput-oriented offline inference, relaxing the KV-cache bottleneck while preserving DP scheduling flexibility. The empirical evaluation across three GPU generations and multiple model sizes is a positive feature; the approach also demonstrates concrete use of NVLink bandwidth to support weight sharing without full model parallelism.

major comments (1)
  1. [Evaluation] Evaluation section: the reported 1.8x KV-capacity and 1.5x throughput gains are presented without workload details (sequence lengths, batch-size distributions), exact vLLM baseline configurations, error bars or run counts, or direct measurements of NVLink/activation-shipping overhead in WaS and CaS modes. These omissions are load-bearing because the central claim rests on the premise that communication overhead remains low enough not to offset the batch-size gains; without the missing controls it is impossible to verify the numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We agree that the current presentation of results lacks several key controls needed to fully substantiate the reported gains, and we will revise the manuscript to address these omissions directly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported 1.8x KV-capacity and 1.5x throughput gains are presented without workload details (sequence lengths, batch-size distributions), exact vLLM baseline configurations, error bars or run counts, or direct measurements of NVLink/activation-shipping overhead in WaS and CaS modes. These omissions are load-bearing because the central claim rests on the premise that communication overhead remains low enough not to offset the batch-size gains; without the missing controls it is impossible to verify the numbers.

    Authors: We agree that these details are essential for verifying the claims. In the revised manuscript we will expand the evaluation section with: (1) explicit workload specifications including sequence length distributions and batch-size histograms for all reported experiments; (2) precise vLLM baseline configurations (version, tensor-parallel degree, scheduling parameters, and memory settings); (3) error bars and the number of runs (minimum 5) for all throughput and capacity measurements; and (4) direct instrumentation results for NVLink bandwidth utilization and activation-shipping latency in both WaS and CaS modes, shown as a function of batch size to demonstrate that communication overhead does not negate the KV-cache capacity benefits. These additions will be placed in a new subsection and accompanying figures/tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluation

full rationale

The paper introduces SiDP as a systems technique (WaS/CaS modes for weight sharing in DP) and supports its claims exclusively via direct hardware measurements of KV capacity and end-to-end throughput on H20/H200/B200 GPUs with listed models. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All performance numbers are reported as observed outcomes under the implemented modes, making the argument self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on hardware assumptions about interconnect performance and workload batch-size distributions rather than fitted parameters or new physical entities.

axioms (2)
  • domain assumption NVLink provides sufficient bandwidth for on-demand weight streaming in the large-batch regime without becoming the dominant bottleneck
    Invoked to justify the WaS mode benefit.
  • domain assumption Workloads exhibit a batch-size distribution with a large-batch body and small-batch tail that the two modes can exploit complementarily
    Required for the overall throughput claim to hold across the full workload.
invented entities (2)
  • Weight-as-a-Service (WaS) mode no independent evidence
    purpose: Streams remote weights over NVLink into a small cache for large batches
    Newly proposed execution mode central to the memory-sharing design.
  • Compute-as-a-Service (CaS) mode no independent evidence
    purpose: Ships activations to weight-owning GPUs for the small-batch tail
    Newly proposed execution mode central to the memory-sharing design.

pith-pipeline@v0.9.1-grok · 5803 in / 1514 out tokens · 48275 ms · 2026-06-29T10:04:50.062135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    Taming throughput- latency tradeoff in llm inference with sarathi-serve

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tu- manov, and Ramachandran Ramjee. Taming throughput- latency tradeoff in llm inference with sarathi-serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association

  2. [2]

    Evaluating large language models trained on code, 2021

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. Evaluating large language models trained on code, 2021

  3. [3]

    Kunserve: Elastic and efficient large language model serving with parameter- centric memory management.arXiv e-prints, pages arXiv–2412, 2024

    Rongxin Cheng, Yifan Peng, Yuxin Lai, Xingda Wei, Rong Chen, and Haibo Chen. Kunserve: Elastic and efficient large language model serving with parameter- centric memory management.arXiv e-prints, pages arXiv–2412, 2024

  4. [4]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  5. [5]

    Improving the end-to-end efficiency of offline inference for multi-llm applications based on sampling and simu- lation.arXiv preprint arXiv:2503.16893, 2025

    Jingzhi Fang, Yanyan Shen, Yue Wang, and Lei Chen. Improving the end-to-end efficiency of offline inference for multi-llm applications based on sampling and simu- lation.arXiv preprint arXiv:2503.16893, 2025

  6. [6]

    Model tells you what to discard: Adaptive KV cache compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Ji- awei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. In International Conference on Learning Representations (ICLR), 2024. 12

  7. [7]

    Glinthawk: A two-tiered architecture for offline llm inference.arXiv preprint arXiv:2501.11779, 2025

    Pouya Hamadanian and Sadjad Fouladi. Glinthawk: A two-tiered architecture for offline llm inference.arXiv preprint arXiv:2501.11779, 2025

  8. [8]

    Deferred continuous batching in resource-efficient large language model serving

    Yongjun He, Yao Lu, and Gustavo Alonso. Deferred continuous batching in resource-efficient large language model serving. InProceedings of the 4th Workshop on Machine Learning and Systems, pages 98–106, 2024

  9. [9]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  10. [10]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

  11. [11]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781. Association for Computational Linguistics, 2020

  12. [12]

    MT-eval: A multi-turn capabilities evaluation benchmark for large language models

    Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-eval: A multi-turn capabilities evaluation benchmark for large language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, Miami, Florida, USA, 2024. Association for...

  13. [13]

    Efficient memory man- agement for large language model serving with page- dattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with page- dattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  14. [14]

    John, and Neeraja J

    Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Pra- soon Sinha, Jeeho Ryoo, Lizy K. John, and Neeraja J. Yadwadkar. Oneiros: Kv cache optimization through parameter remapping for multi-tenant LLM serving. In Proceedings of the ACM Symposium on Cloud Comput- ing (SoCC ’25), 2025

  15. [15]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Giannis Daras, Deep Ganguli, Dario Amodei Hernandez, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  16. [16]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 techni- cal report.arXiv preprint arXiv:2412.19437, 2024

  17. [17]

    Deepspeed, 2024

    Microsoft. Deepspeed, 2024. https://github.com /microsoft/DeepSpeed

  18. [18]

    Training language models to follow instructions with human feedback.Advances in Neural Information Pro- cessing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Pro- cessing Systems, 35:27730–27744, 2022

  19. [19]

    The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

    David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R So, Maud Texier, and Jeff Dean. The carbon footprint of machine learning training will plateau, then shrink.Computer, 55(7):18–28, 2022

  20. [20]

    Zero: Memory optimizations toward train- ing trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward train- ing trillion parameter models. InSC20: International Conference for High Performance Computing, Network- ing, Storage and Analysis, pages 1–16. IEEE, 2020

  21. [21]

    Zero-offload : Democ- ratizing billion-scale model training

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Am- inabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. Zero-offload : Democ- ratizing billion-scale model training. In2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021

  22. [22]

    Recipes for building an open-domain chatbot

    Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for building an open-domain chatbot. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguis- tics: Main Volume, pages 300–325, 2021

  23. [23]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada,

  24. [24]

    Association for Computational Linguistics

  25. [25]

    Flexgen: High-throughput generative inference of large language models with a single GPU.CoRR, abs/2303.06865, 2023

    Ying Sheng, Lianmin Li, Lianmin Zheng, et al. Flexgen: High-throughput generative inference of large language models with a single GPU.CoRR, abs/2303.06865, 2023

  26. [26]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. 13

  27. [27]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on gemini research and technology.CoRR, abs/2403.08295, 2024

  28. [28]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  29. [29]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Zhang, Hao Zhang, Haoyi Xu, Yu- tong Yang, Yuxiao Dong, Ziwei Liu, Yixin Wang, et al. V oyager: An open-ended embodied agent with large lan- guage models.arXiv preprint arXiv:2305.16291, 2023

  30. [30]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Pe- gah Alipoormolabashi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, et al. Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022

  31. [31]

    Pie: Pooling CPU memory for LLM inference

    Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Sto- ica. Pie: Pooling CPU memory for LLM inference. abs/2411.09317, 2024

  32. [32]

    Qwen2.5 Technical Report

    An Yang et al. Qwen2.5 technical report.CoRR, abs/2412.15115, 2024

  33. [33]

    Qwen3 Technical Report

    An Yang, Baosong Yang, and et al. Zhi- wei Zhang. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  34. [34]

    Orca: A distributed serving system for transformer-based generative mod- els

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soo- jeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative mod- els. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

  35. [35]

    TD-pipe: Temporally-disaggregated pipeline parallelism architec- ture for high-throughput llm inference.arXiv preprint arXiv:2506.10470, 2025

    Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Jiangsu Du, Zhiguang Chen, and Yutong Lu. TD-pipe: Temporally-disaggregated pipeline parallelism architec- ture for high-throughput llm inference.arXiv preprint arXiv:2506.10470, 2025

  36. [36]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Sho- janazeri, Myle Ott, Sam Shleifer, et al. Pytorch FSDP: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

  37. [37]

    BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

    Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, and Ion Sto- ica. Blendserve: Optimizing offline inference for auto- regressive large models with resource-aware batching. arXiv preprint arXiv:2411.16102, 2024

  38. [38]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model pro- grams.arXiv preprint arXiv:2312.07104, 2024

  39. [39]

    BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

    Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, and Gang Peng. Batchllm: Optimizing large batched llm inference with global prefix sharing and throughput-oriented token batching.arXiv preprint arXiv:2412.03594, 2024

  40. [40]

    StreamRL: Scalable, hetero- geneous, and elastic RL for LLMs with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, Hongyu Zhou, Yimin Jiang, Yibo Zhu, and Daxin Jiang. StreamRL: Scalable, hetero- geneous, and elastic RL for LLMs with disaggregated stream generation.arXiv preprint arXiv:2504.15930, 2025

  41. [41]

    RLHFuse: Effi- cient rlhf training for large language models with inter- and intra-stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, and Xin Jin. RLHFuse: Effi- cient rlhf training for large language models with inter- and intra-stage fusion. InUSENIX Symposium on Net- worked Systems Design and Implementation (NSDI),

  42. [42]

    arXiv:2409.13221. 14