Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Hanshi Sun; Jianwen Yan; Li-wen Chang; Mengtian Yang; Mingheng Wu; Zhekun Zhang

arxiv: 2605.17164 · v2 · pith:4XILISKLnew · submitted 2026-05-16 · 💻 cs.DC · cs.AI· cs.LG· cs.PL

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Mengtian Yang , Zhekun Zhang , Mingheng Wu , Jianwen Yan , Hanshi Sun , Li-wen Chang This is my paper

Pith reviewed 2026-05-21 08:51 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PL

keywords LLM simulatorperformance predictiondistributed traininginference optimizationmodular modelingGPU clustersparallelism strategiessystem configuration

0 comments

The pith

Charon is a simulator that predicts large-scale LLM training and inference performance with errors under 5.35 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large-scale LLM training and inference involve many choices about how to split work across GPUs, which optimizations to apply, and which hardware to use, each affecting speed and cost in hard-to-predict ways. Charon tackles this by providing a single simulator that represents these elements through reusable, detailed modules rather than treating the whole system as a black box. If the predictions hold, teams can evaluate many more configurations through simulation instead of running costly trials on real clusters. The paper reports that this approach keeps overall error below 5.35 percent across tested models and drops below 3.74 percent for large training runs, while also surfacing a higher-throughput inference setup than one tuned by hand.

Core claim

Charon is a unified, modular, and fine-grained simulator for accurately predicting LLM performance. It decomposes the design space of parallelism strategies, system optimizations, and hardware into modular components that can be combined to forecast execution time and throughput. Experiments across different models and configurations show an overall prediction error consistently under 5.35 percent, and under 3.74 percent for training on a large-scale GPU cluster. In a practical inference case, Charon identified a configuration that improved system throughput over an engineering-tuned baseline.

What carries the argument

The modular and fine-grained modeling approach that decomposes LLM training and inference into reusable components for parallelism, optimizations, and hardware interactions.

If this is right

Many more what-if scenarios for parallelism and hardware choices can be evaluated without consuming physical GPU resources.
Optimization work can proceed by testing hypotheses through simulation before committing to full runs.
Inference deployments can adopt configurations found by the simulator that exceed the throughput of manually tuned baselines.
System studies of new models or cluster sizes become feasible at lower cost by relying on predicted rather than measured results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular structure could support adding components for new hardware or model types without rebuilding the entire simulator.
Pairing the simulator with automated search methods might enable fully automatic selection of high-performing configurations.
Further tests on workloads outside the current LLM focus would clarify how far the modeling approach generalizes.

Load-bearing premise

The modular and fine-grained modeling approach captures all relevant performance factors in the complex design space of parallelism, optimizations, and hardware without missing critical interactions or requiring extensive per-setup calibration.

What would settle it

Run a new model and configuration on a real large GPU cluster, record the actual time or throughput, and compare it directly to Charon's output; an error well above 5 percent would show the simulator missed important factors.

Figures

Figures reproduced from arXiv: 2605.17164 by Hanshi Sun, Jianwen Yan, Li-wen Chang, Mengtian Yang, Mingheng Wu, Zhekun Zhang.

**Figure 2.** Figure 2: LLM architecture, execution workflows, and tun [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture overview of proposed Charon simulator. The system consists of a graph-based frontend that [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Frontend architecture of proposed Charon simu [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Backend architecture of Charon simulator. Each [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Bandwidth-aware communication operators over [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: End-to-end time comparison results for Charon and other simulators against measurement ground truth. “X” [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of simulation traces generated by [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Experimental results for Charon memory predic [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 11.** Figure 11: Comparison results between Ground Truth (Pro [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 13.** Figure 13: Simulated performance trade-off between sys [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

read the original abstract

Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating "what-if" Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper presents Charon, a unified modular fine-grained simulator for large-scale LLM training and inference performance prediction. It claims high accuracy with overall prediction error under 5.35% (under 3.74% for large-scale training) across models and configurations, plus a practical demonstration where it identifies an inference configuration outperforming an engineering-tuned baseline.

Significance. If the accuracy and compositionality claims hold across the tested regimes, Charon would offer a valuable tool for rapid what-if analysis in complex parallelism/optimization/hardware spaces, reducing reliance on costly full-scale runs. The reported practical throughput improvement strengthens the case for real-world utility.

major comments (1)

§4 (Experimental Validation): The abstract and results claim consistent errors below 5.35% (3.74% on large clusters), but the manuscript provides insufficient detail on the number of distinct models, parallelism strategies, hardware setups, and validation protocol (e.g., whether errors are mean absolute percentage error on held-out configurations or in-sample). This makes it difficult to judge whether the modular composition truly captures non-additive interactions without per-setup recalibration.

minor comments (3)

§3 (Architecture): Clarify how the fine-grained modules handle dynamic effects such as communication overlap with computation and memory bandwidth contention under different tensor/ pipeline parallelism degrees.
Figure 5 / Table 2: Add error bars or per-configuration breakdowns to the reported aggregate error percentages so readers can see variance across scales.
Related Work: Include a brief comparison table against prior simulators (e.g., those focused only on training or only on inference) to highlight the unified aspect.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We have addressed the concern about insufficient experimental details by expanding §4 with additional tables, descriptions, and analysis to clarify the scope and validation protocol.

read point-by-point responses

Referee: §4 (Experimental Validation): The abstract and results claim consistent errors below 5.35% (3.74% on large clusters), but the manuscript provides insufficient detail on the number of distinct models, parallelism strategies, hardware setups, and validation protocol (e.g., whether errors are mean absolute percentage error on held-out configurations or in-sample). This makes it difficult to judge whether the modular composition truly captures non-additive interactions without per-setup recalibration.

Authors: We agree that the original presentation of §4 could be strengthened with more explicit enumeration. In the revised manuscript we have added a new Table 4 that lists the 14 distinct models evaluated (Llama-7B/13B/70B, Mistral-7B, Qwen-14B/72B, and several GPT-style variants), the 9 parallelism strategy combinations tested (pure DP, TP, PP, and all pairwise and triple combinations), and the 6 hardware configurations (8–1024 A100 and H100 GPUs across two cluster topologies). All reported errors are mean absolute percentage error (MAPE) computed on held-out configurations using a per-model-family 70/30 split; no in-sample fitting or per-setup recalibration was performed. We have also inserted a new paragraph and accompanying figure in §4.3 that isolates the compositionality claim: we show that the fine-grained kernel and communication modules, when composed without additional tuning, correctly reproduce non-additive effects such as pipeline bubble overhead and AllReduce contention, with error remaining below 4.1 % even on the largest held-out 1024-GPU training runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces Charon as a modular, fine-grained simulator for LLM training and inference performance prediction. Its central claims rest on experimental validation showing prediction errors under 5.35% (3.74% for large-scale training) and a practical inference improvement over a baseline. No derivation chain, equations, or self-referential modeling steps are described that would reduce predictions to fitted inputs by construction. The approach is presented as empirical composition of component models validated externally against real hardware runs, with no load-bearing self-citations or ansatz smuggling visible in the provided text. This is a standard self-contained empirical simulator paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on modeling assumptions, parameters, or new entities; all such elements are unknown.

pith-pipeline@v0.9.0 · 5679 in / 1085 out tokens · 35203 ms · 2026-05-21T08:51:06.172112+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 5 internal anchors

[1]

Vidur: A large-scale simulation framework for llm in- ference, 2024

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gula- vani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm in- ference, 2024. URL https://arxiv.org/abs/2405. 05465

work page 2024
[2]

Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale. In2024 IEEE International Symposium on Workload Characterization (IISWC), pages 15–29,

work page
[3]

doi: 10.1109/IISWC63097.2024.00012

work page doi:10.1109/iiswc63097.2024.00012 2024
[4]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Echo: Simulating distributed training at scale, 2024

Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, and Hong Xu. Echo: Simulating distributed training at scale, 2024. URL https: //arxiv.org/abs/2412.12487

work page arXiv 2024
[6]

Ac- celerating design space exploration for llm training systems with multi-experiment parallel simulation

Fei Gui, Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Ran Zhang, Hongbing Yang, and Dian Xiong. Ac- celerating design space exploration for llm training systems with multi-experiment parallel simulation. In Proceedings of the 22nd USENIX Symposium on NetworkedSystems Design and Implementation, NSDI ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-46-5

work page 2025
[7]

dpro, 2022

Hanpeng Hu. dpro, 2022. URLhttps://figshare. com/articles/software/dpro/19165622

work page arXiv 2022
[8]

Simulating llm training work- loads for heterogeneous compute and network in- frastructure, 2025

Sumit Kumar, Arjun Temura, Naman Sharma, Ra- manjeet Singh, Meet Dadhania, Praveen Tammana, Satananda Burla, Abed Mohammad Kamaluddin, and Rinku Shah. Simulating llm training work- loads for heterogeneous compute and network in- frastructure, 2025. URL https://arxiv.org/abs/ 2508.05370

work page arXiv 2025
[9]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient mem- ory management for large language model serving with pagedattention, 2023. URL https://arxiv. org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Sequence parallelism: Long sequence training from system perspective,

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective,

work page
[11]

URLhttps://arxiv.org/abs/2105.13120

work page arXiv
[12]

Lumos: Efficient performance modeling and estimation for large-scale llm training, 2025

Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, and Christina Delim- itrou. Lumos: Efficient performance modeling and estimation for large-scale llm training, 2025. URL https://arxiv.org/abs/2504.09307

work page arXiv 2025
[13]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo, 2025

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model-centric distributed recipe zoo, 2025. URL https://arxiv.org/abs/2508.02317

work page arXiv 2025
[14]

meta-llama/llama-3.1-405b hugging face,

meta llama. meta-llama/llama-3.1-405b hugging face,

work page
[15]

Accessed on Oct 17, 2025

URL https://huggingface.co/meta-llama/ Llama-3.1-405B. Accessed on Oct 17, 2025

work page 2025
[16]

Efﬁcient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URLhttps://arxiv.org/ abs/2104.04473

work page arXiv 2021
[17]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations to- ward training trillion parameter models, 2020. URL https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Samajdar, J

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srini- vasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 81–92, 2020. doi: 10.1109/ISPASS48437.2020.00018

work page doi:10.1109/ispass48437.2020.00018 2020
[19]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Using deepspeed and megatron to train megatron- turing nlg 530b, a large-scale generative language model, 2022

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

work page 2022
[21]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706. 03762

work page 2023
[22]

SimAI: Unifying archi- tecture design and performance tuning for Large- Scale large language model training with scalabil- ity and precision

Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen 14 Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Den- nis Cai, and Binzhang Fu. SimAI: Unifying archi- tecture design and performance tuning for Large- Scale large language model training with scalabil- ity and precision...

work page
[23]

ISBN 978-1-939133- 46-5

USENIX Association. ISBN 978-1-939133- 46-5. URL https://www.usenix.org/conference/ nsdi25/presentation/wang-xizheng-simai

work page
[24]

Roofline: an insightful visual performance model for multicore architectures.Commun

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Commun. ACM, 52(4):65–76, April 2009. ISSN 0001-0782. doi: 10. 1145/1498765.1498785. URL https://doi.org/10. 1145/1498765.1498785

work page arXiv 2009
[25]

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

William Won, Taekyung Heo, Saeed Rashidi, Srini- vas Sridharan, Sudarshan Srinivasan, and Tushar Kr- ishna. Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), page 283–294. IEEE, April 2023. doi: 10.1109/isp...

work page doi:10.1109/ispass57527.2023.00035 2023
[26]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Des- maison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Math- ews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023. URL https://arxiv.org/abs/2304.11277. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Vidur: A large-scale simulation framework for llm in- ference, 2024

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gula- vani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for llm in- ference, 2024. URL https://arxiv.org/abs/2405. 05465

work page 2024

[2] [2]

Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale

Jaehong Cho, Minsu Kim, Hyunmin Choi, Guseul Heo, and Jongse Park. Llmservingsim: A hw/sw co-simulation infrastructure for llm inference serving at scale. In2024 IEEE International Symposium on Workload Characterization (IISWC), pages 15–29,

work page

[3] [3]

doi: 10.1109/IISWC63097.2024.00012

work page doi:10.1109/iiswc63097.2024.00012 2024

[4] [4]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Echo: Simulating distributed training at scale, 2024

Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, and Hong Xu. Echo: Simulating distributed training at scale, 2024. URL https: //arxiv.org/abs/2412.12487

work page arXiv 2024

[6] [6]

Ac- celerating design space exploration for llm training systems with multi-experiment parallel simulation

Fei Gui, Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Ran Zhang, Hongbing Yang, and Dian Xiong. Ac- celerating design space exploration for llm training systems with multi-experiment parallel simulation. In Proceedings of the 22nd USENIX Symposium on NetworkedSystems Design and Implementation, NSDI ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-46-5

work page 2025

[7] [7]

dpro, 2022

Hanpeng Hu. dpro, 2022. URLhttps://figshare. com/articles/software/dpro/19165622

work page arXiv 2022

[8] [8]

Simulating llm training work- loads for heterogeneous compute and network in- frastructure, 2025

Sumit Kumar, Arjun Temura, Naman Sharma, Ra- manjeet Singh, Meet Dadhania, Praveen Tammana, Satananda Burla, Abed Mohammad Kamaluddin, and Rinku Shah. Simulating llm training work- loads for heterogeneous compute and network in- frastructure, 2025. URL https://arxiv.org/abs/ 2508.05370

work page arXiv 2025

[9] [9]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient mem- ory management for large language model serving with pagedattention, 2023. URL https://arxiv. org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Sequence parallelism: Long sequence training from system perspective,

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective,

work page

[11] [11]

URLhttps://arxiv.org/abs/2105.13120

work page arXiv

[12] [12]

Lumos: Efficient performance modeling and estimation for large-scale llm training, 2025

Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Brian Coutinho, Louis Feng, and Christina Delim- itrou. Lumos: Efficient performance modeling and estimation for large-scale llm training, 2025. URL https://arxiv.org/abs/2504.09307

work page arXiv 2025

[13] [13]

Veomni: Scaling any modality model training with model-centric distributed recipe zoo, 2025

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, Zhi Zhang, and Xin Liu. Veomni: Scaling any modality model training with model-centric distributed recipe zoo, 2025. URL https://arxiv.org/abs/2508.02317

work page arXiv 2025

[14] [14]

meta-llama/llama-3.1-405b hugging face,

meta llama. meta-llama/llama-3.1-405b hugging face,

work page

[15] [15]

Accessed on Oct 17, 2025

URL https://huggingface.co/meta-llama/ Llama-3.1-405B. Accessed on Oct 17, 2025

work page 2025

[16] [16]

Efﬁcient large-scale language model training on gpu clusters using megatron-lm

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vi- jay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URLhttps://arxiv.org/ abs/2104.04473

work page arXiv 2021

[17] [17]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations to- ward training trillion parameter models, 2020. URL https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020

[18] [18]

Samajdar, J

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srini- vasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 81–92, 2020. doi: 10.1109/ISPASS48437.2020.00018

work page doi:10.1109/ispass48437.2020.00018 2020

[19] [19]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

Using deepspeed and megatron to train megatron- turing nlg 530b, a large-scale generative language model, 2022

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. Using deepspeed and megatron to ...

work page 2022

[21] [21]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706. 03762

work page 2023

[22] [22]

SimAI: Unifying archi- tecture design and performance tuning for Large- Scale large language model training with scalabil- ity and precision

Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen 14 Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, Ennan Zhai, Den- nis Cai, and Binzhang Fu. SimAI: Unifying archi- tecture design and performance tuning for Large- Scale large language model training with scalabil- ity and precision...

work page

[23] [23]

ISBN 978-1-939133- 46-5

USENIX Association. ISBN 978-1-939133- 46-5. URL https://www.usenix.org/conference/ nsdi25/presentation/wang-xizheng-simai

work page

[24] [24]

Roofline: an insightful visual performance model for multicore architectures.Commun

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Commun. ACM, 52(4):65–76, April 2009. ISSN 0001-0782. doi: 10. 1145/1498765.1498785. URL https://doi.org/10. 1145/1498765.1498785

work page arXiv 2009

[25] [25]

Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

William Won, Taekyung Heo, Saeed Rashidi, Srini- vas Sridharan, Sudarshan Srinivasan, and Tushar Kr- ishna. Astra-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), page 283–294. IEEE, April 2023. doi: 10.1109/isp...

work page doi:10.1109/ispass57527.2023.00035 2023

[26] [26]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Des- maison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Math- ews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023. URL https://arxiv.org/abs/2304.11277. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023