ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang; Hui Guan; Jin Zhou; Ramine Roane; Tongping Liu; Xiaoqun Wang; Yao Fu

arxiv: 2406.08334 · v2 · submitted 2024-06-12 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

ProTrain: Efficient LLM Training via Memory-Aware Techniques

Hanmei Yang , Jin Zhou , Yao Fu , Xiaoqun Wang , Ramine Roane , Hui Guan , Tongping Liu This is my paper

Pith reviewed 2026-05-23 23:47 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PF

keywords LLM trainingmemory managementautomatic configurationcost modelsruntime profilertraining throughputresource constrained training

0 comments

The pith

ProTrain automates memory policy configuration for LLM training using cost models from runtime profiling to raise throughput without manual tuning or accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProTrain as a system that automatically tailors memory management policies to specific model architectures and hardware by abstracting strategies into tunable parameters and searching for optimal settings via cost models. A runtime profiler supplies precise estimates of latency, memory usage, and I/O bandwidth to support these high-fidelity models. This matters for scaling LLM training in environments where memory pressure is the main limit and manual knob tuning adds overhead while risking poor utilization. The system leaves the training algorithm unchanged, so accuracy stays the same. Experiments report throughput improvements ranging from 1.43 times to 2.71 times over current state-of-the-art training systems.

Core claim

ProTrain provides automated memory management that reduces complex strategies to a small set of tunable parameters whose optimal values are located through cost-model search. The runtime profiler supplies the latency, memory, and bandwidth measurements needed to construct accurate cost models for any given model and hardware pair. Because the underlying training algorithm is untouched, model accuracy is preserved while hardware utilization improves.

What carries the argument

The runtime profiler that feeds precise latency, memory, and I/O measurements into cost models used to search over a small set of tunable memory-configuration parameters.

If this is right

Training throughput rises between 1.43× and 2.71× relative to existing systems while model accuracy remains unchanged.
Engineering effort for memory tuning drops because configuration search replaces manual adjustment.
The same approach applies to different model sizes and hardware platforms without requiring system expertise.
Hardware utilization improves in memory-constrained settings because policies are chosen to match available resources.
No change to the training loss or optimizer is needed, so existing checkpoints and convergence behavior stay intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to inference workloads where memory bandwidth similarly limits batch size.
Integration with dynamic hardware scaling during a run might further reduce idle time on multi-node clusters.
The parameter abstraction could serve as a template for automating other low-level systems knobs such as communication scheduling.
Lowering the expertise threshold may allow smaller teams to train larger models on the same hardware budget.

Load-bearing premise

The runtime profiler must deliver estimates accurate enough that the resulting cost models reliably identify configurations that outperform manual or default settings across models and hardware.

What would settle it

A controlled run on a held-out LLM and hardware combination in which ProTrain's selected configuration yields throughput no higher than the best manually tuned baseline system.

Figures

Figures reproduced from arXiv: 2406.08334 by Hanmei Yang, Hui Guan, Jin Zhou, Ramine Roane, Tongping Liu, Xiaoqun Wang, Yao Fu.

**Figure 2.** Figure 2: Block-Wise Activation Management Layout and Memory Usage Trend [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Maximum Training Throughput on four RTX 3090 GPUs (upper) and A100 GPUs [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Scalability of performance on RTX 3090 GPUs (a) Maximum throughput across different [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Effectiveness of adaptive memory management on four RTX 3090 GPUs (a) Runtime [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Scalability of performance on A100 GPUs (a) Maximum throughput across different [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Predicted vs. Actual Runtime and Peak Memory Usage for Various Models [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Memory pressure has emerged as a dominant constraint in scaling the training of large language models (LLMs), particularly in resource-constrained environments. While modern frameworks incorporate various memory-saving techniques, they often expose low-level configuration knobs that require manual tuning and specialized system expertise. This not only adds engineering overhead but also risks suboptimal hardware utilization when misconfigured. This paper introduces ProTrain, a novel training system that automatically tailors memory management policies to the model architecture and underlying hardware resources, eliminating the need for manual intervention. The core of ProTrain is its automated memory management that abstracts complex memory management strategies into a few tunable configuration parameters, allowing searches for optimal parameter settings using cost models. ProTrain is equipped with a runtime profiler that provides precise estimates of latency, memory usage, and I/O bandwidth to build high-fidelity cost models. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the state-of-the-art training systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProTrain automates memory config search for LLM training via profiler-driven cost models, but the abstract gives no experimental details so the 1.43–2.71× claims cannot be checked.

read the letter

The paper's main move is to wrap existing memory-saving tricks (activation checkpointing, offloading, etc.) into a small set of tunable parameters and let a runtime profiler feed cost models that pick the settings automatically. That removes the manual knob-twiddling that DeepSpeed and similar systems still require. The claim that accuracy stays untouched because the training algorithm itself is unchanged is straightforward and correct on its face. The profiler is presented as the key enabler, supplying latency, memory, and bandwidth numbers to build the models. If the full paper shows that those estimates are accurate enough across models and hardware, and that the search actually finds better points than expert tuning, then the automation angle is useful for people running on constrained clusters. The abstract, however, gives zero information on model sizes, hardware, baselines, or how the profiler was validated, so the throughput numbers sit unsupported. The stress-test concern about profiler error is exactly the load-bearing piece; without error bounds or cross-validation results, it is impossible to know whether the reported gains are real or just lucky configs. The work looks like a practical engineering layer on top of known techniques rather than a new algorithm. It is worth a serious referee if the full manuscript supplies the missing experimental controls and profiler accuracy data; otherwise it stays at the level of an unverified system description.

Referee Report

2 major / 0 minor

Summary. The paper introduces ProTrain, a system for LLM training that automates memory management by abstracting strategies into tunable parameters optimized via cost models built from a runtime profiler providing estimates of latency, memory usage, and I/O bandwidth. It claims to eliminate manual tuning while preserving training accuracy and reports throughput gains of 1.43× to 2.71× over state-of-the-art systems.

Significance. If the profiler accuracy and resulting configurations hold across models and hardware, the automation could reduce engineering overhead for memory optimization in constrained environments, improving accessibility of efficient LLM training.

major comments (2)

[Abstract] Abstract: the central claim of 1.43×–2.71× throughput improvements rests on the runtime profiler delivering 'precise estimates' for high-fidelity cost models that yield optimal configurations, yet no validation, error bounds, or search procedure details are provided to substantiate this.
[Experiments] Experiments (implied by abstract claims): no information is given on experimental setup, baselines, model sizes, hardware platforms, or how cost models were validated against actual runs, preventing assessment of whether the reported gains are supported by the data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that strengthening the presentation of profiler validation and experimental details will improve the paper and will make the requested revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 1.43×–2.71× throughput improvements rests on the runtime profiler delivering 'precise estimates' for high-fidelity cost models that yield optimal configurations, yet no validation, error bounds, or search procedure details are provided to substantiate this.

Authors: The abstract is necessarily concise. The full manuscript describes the runtime profiler (Section 3) and how it supplies latency, memory, and bandwidth estimates to the cost models, along with the parameter search procedure (Section 4). We acknowledge that explicit error bounds and a dedicated validation subsection are not highlighted in the abstract. We will revise the abstract to reference the validation approach and add a short profiler-accuracy subsection with error bounds and search details in the main text. revision: yes
Referee: [Experiments] Experiments (implied by abstract claims): no information is given on experimental setup, baselines, model sizes, hardware platforms, or how cost models were validated against actual runs, preventing assessment of whether the reported gains are supported by the data.

Authors: Section 5 of the manuscript presents the experimental results, including the models evaluated, hardware platforms, baselines, and throughput numbers. Cost-model fidelity is shown via the end-to-end gains. To address the concern directly, we will expand Section 5 with an explicit experimental-setup subsection, a hardware/model configuration table, and additional plots comparing predicted versus measured costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems paper with no derivations or self-referential claims

full rationale

The paper describes an engineering system that uses runtime profiling to measure latency, memory, and bandwidth, then builds cost models from those direct measurements to search configuration parameters. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. Throughput gains are reported from experiments rather than derived from the cost models themselves, so the argument does not reduce to its inputs by construction and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering systems paper. No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5738 in / 1197 out tokens · 25203 ms · 2026-05-23T23:47:06.894674+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Efficient combination of rematerialization and offloading for training dnns,

O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training dnns,”Advances in Neural Information Processing Systems , vol. 34, pp. 23 844–23 857, 2021

work page 2021
[3]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

work page 2020
[4]

Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 178–191

work page 2024
[5]

Generative pretraining from pixels,

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International conference on machine learning . PMLR, 2020, pp. 1691–1703

work page 2020
[6]

Training Deep Nets with Sublinear Memory Cost

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Parallel training of pre-trained models via chunk-based dynamic memory management,

J. Fang, Z. Zhu, S. Li, H. Su, Y . Yu, J. Zhou, and Y . You, “Parallel training of pre-trained models via chunk-based dynamic memory management,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 304–315, 2022

work page 2022
[10]

Mobius: Fine tuning large-scale models on commodity gpu servers,

Y . Feng, M. Xie, Z. Tian, S. Wang, Y . Lu, and J. Shu, “Mobius: Fine tuning large-scale models on commodity gpu servers,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, 2023, pp. 489–501

work page 2023
[11]

Pre-trained models: Past, present and future,

X. Han, Z. Zhang, N. Ding, Y . Gu, X. Liu, Y . Huo, J. Qiu, Y . Yao, A. Zhang, L. Zhanget al., “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021

work page 2021
[12]

Tictac: Accelerating distributed deep learning with communication scheduling,

S. H. Hashemi, S. Abdu Jyothi, and R. Campbell, “Tictac: Accelerating distributed deep learning with communication scheduling,” Proceedings of Machine Learning and Systems , vol. 1, pp. 418–430, 2019. 10

work page 2019
[13]

Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory,

J. Herrmann, O. Beaumont, L. Eyraud-Dubois, J. Hermann, A. Joly, and A. Shilova, “Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory,”arXiv preprint arXiv:1911.13214, 2019

work page arXiv 1911
[14]

Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,

C.-C. Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems , 2020, pp. 1341– 1355

work page 2020
[15]

Elixir: Train a large language model on a small gpu cluster,

H. Huang, J. Fang, H. Liu, S. Li, and Y . You, “Elixir: Train a large language model on a small gpu cluster,”arXiv preprint arXiv:2212.05339, 2022

work page arXiv 2022
[16]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems , vol. 32, 2019

work page 2019
[17]

Checkmate: Breaking the memory wall with optimal tensor rematerialization,

P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, J. Gonzalez, K. Keutzer, and I. Stoica, “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” Proceedings of Machine Learning and Systems , vol. 2, pp. 497–511, 2020

work page 2020
[18]

Breaking the computation and communication abstraction barrier in distributed machine learning workloads,

A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y . Miao, M. Musuvathi, T. Mytkow- icz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Confer- ence on Architectural Support for Programming Languages and Operating Systems , 20...

work page 2022
[19]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

Reducing activation recomputation in large transformer models,

V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023

work page 2023
[22]

Tflms: Large model support in tensorflow by graph rewriting,

T. D. Le, H. Imai, Y . Negishi, and K. Kawachiya, “Tflms: Large model support in tensorflow by graph rewriting,”arXiv preprint arXiv:1807.02037, 2018

work page arXiv 2018
[23]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania et al., “Pytorch distributed: Experiences on accelerating data parallel training,” arXiv preprint arXiv:2006.15704, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[24]

Colossal-ai: A unified deep learning system for large-scale parallel training,

S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y . Liu, B. Wang, and Y . You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775

work page 2023
[25]

Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,

Y . Li, A. Phanishayee, D. Murray, J. Tarnawski, and N. S. Kim, “Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,” arXiv preprint arXiv:2202.01306, 2022

work page arXiv 2022
[26]

Swin transformer v2: Scaling up capacity and resolution,

Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Dong et al. , “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 12 009–12 019

work page 2022
[27]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 10 012–10 022

work page 2021
[28]

Better together: Jointly optimizing {ML} collective scheduling and execution planning using {SYNDICATE},

K. Mahajan, C.-H. Chu, S. Sridharan, and A. Akella, “Better together: Jointly optimizing {ML} collective scheduling and execution planning using {SYNDICATE},” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , 2023, pp. 809–824. 11

work page 2023
[29]

Mixed Precision Training

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkateshet al., “Mixed precision training,”arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Pipedream: generalized pipeline parallelism for dnn training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM symposium on operating systems principles , 2019, pp. 1–15

work page 2019
[31]

Api documentation of apex optimizers,

NVIDIA, “Api documentation of apex optimizers,” https://nvidia.github.io/apex/optimizers.html, 2018

work page 2018
[32]

Capuchin: Tensor-based gpu memory management for deep learning,

X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, “Capuchin: Tensor-based gpu memory management for deep learning,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 891...

work page doi:10.1145/3373376.3378505 2020
[33]

A generic communication scheduler for distributed dnn training acceleration,

Y . Peng, Y . Zhu, Y . Chen, Y . Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed dnn training acceleration,” inProceedings of the 27th ACM Symposium on Operating Systems Principles , 2019, pp. 16–29

work page 2019
[34]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

work page 2021
[35]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskeveret al., “Improving language understanding by generative pre-training,” 2018

work page 2018
[36]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[37]

Zero: Memory optimizations toward training trillion parameter models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16

work page 2020
[38]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y . He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2021, pp. 1–14

work page 2021
[39]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020, pp. 3505–3506

work page 2020
[40]

Zero-offload: Democratizing billion-scale model training,

J. Ren, S. Rajbhandari, R. Y . Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y . He, “Zero-offload: Democratizing billion-scale model training,” in2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 551–564

work page 2021
[41]

vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13

work page 2016
[42]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[43]

Videobert: A joint model for video and language representation learning,

C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7464–7473

work page 2019
[44]

Stronghold: fast and affordable billion-scale deep learning model training,

X. Sun, W. Wang, S. Qiu, R. Yang, S. Huang, J. Xu, and Z. Wang, “Stronghold: fast and affordable billion-scale deep learning model training,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2022, pp. 1–17. 12

work page 2022
[45]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems , vol. 30, 2017

work page 2017
[47]

Superneurons: Dynamic gpu memory management for training deep neural networks,

L. Wang, J. Ye, Y . Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, “Superneurons: Dynamic gpu memory management for training deep neural networks,” inProceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming , 2018, pp. 41–53

work page 2018
[48]

H3t: Efficient integration of memory optimization and parallelism for large-scale transformer training,

Y . Wang, X. Han, W. Zhao, G. Zeng, Z. Liu, and M. Sun, “H3t: Efficient integration of memory optimization and parallelism for large-scale transformer training,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[49]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin et al. , “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Rockmate: an efficient, fast, automatic and generic tool for re-materialization in pytorch,

X. Zhao, T. Le Hellard, L. Eyraud-Dubois, J. Gusak, and O. Beaumont, “Rockmate: an efficient, fast, automatic and generic tool for re-materialization in pytorch,” in International Conference on Machine Learning. PMLR, 2023, pp. 42 018–42 045

work page 2023
[51]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer et al., “Pytorch fsdp: experiences on scaling fully sharded data parallel,” arXiv preprint arXiv:2304.11277, 2023. 13 A Implementation Details ProTrain is implemented using Python language, with a total of 6,000 lines of code. It provides very simple APIs...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We clearly include the main claims and contributions in the abstract and introduction. Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the pape...

work page
[53]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Due to the page limit, we discuss the limitations in the appendix. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the pa...

work page
[54]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] 18 Justification: This paper does not involve any theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, ...

work page
[55]

For compared baselines, we also detail the configurations in the appendix

Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We detail the design and i...

work page
[56]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code 19 Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: although the data and code are not open access, the paper provides compre- hensive descriptions of the me...

work page
[57]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify the evaluation details in the main paper and appendix. Guidelines: • The answer NA means that the ...

work page
[58]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: The metrics that we measured are stable. Guidelines: • The answer NA means that the paper does not include experiments. • The autho...

work page
[59]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the hardware information in the main paper and appendix. Guidelines: • The answer NA means t...

work page
[60]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: We follow the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer ...

work page
[61]

Guidelines: • The answer NA means that there is no societal impact of the work performed

Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We mention this only briefly in the conclusion and implicitly elsewhere. Guidelines: • The answer NA means that there is no societal impact of the work performed. • If the authors answe...

work page
[62]

Guidelines: • The answer NA means that the paper poses no such risks

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not have such risks. Guidelines: • The answer NA means that the paper poses no s...

work page
[63]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We add reference for the owners of assets. Guidelines: • The answer NA means that the paper does...

work page
[64]

Guidelines: • The answer NA means that the paper does not release new assets

New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the details of the dataset/code/model as part of their s...

work page
[65]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research...

work page
[66]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Efficient combination of rematerialization and offloading for training dnns,

O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training dnns,”Advances in Neural Information Processing Systems , vol. 34, pp. 23 844–23 857, 2021

work page 2021

[3] [3]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

work page 2020

[4] [4]

Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,

C. Chen, X. Li, Q. Zhu, J. Duan, P. Sun, X. Zhang, and C. Yang, “Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 3, 2024, pp. 178–191

work page 2024

[5] [5]

Generative pretraining from pixels,

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International conference on machine learning . PMLR, 2020, pp. 1691–1703

work page 2020

[6] [6]

Training Deep Nets with Sublinear Memory Cost

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[9] [9]

Parallel training of pre-trained models via chunk-based dynamic memory management,

J. Fang, Z. Zhu, S. Li, H. Su, Y . Yu, J. Zhou, and Y . You, “Parallel training of pre-trained models via chunk-based dynamic memory management,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 1, pp. 304–315, 2022

work page 2022

[10] [10]

Mobius: Fine tuning large-scale models on commodity gpu servers,

Y . Feng, M. Xie, Z. Tian, S. Wang, Y . Lu, and J. Shu, “Mobius: Fine tuning large-scale models on commodity gpu servers,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, V olume 2, 2023, pp. 489–501

work page 2023

[11] [11]

Pre-trained models: Past, present and future,

X. Han, Z. Zhang, N. Ding, Y . Gu, X. Liu, Y . Huo, J. Qiu, Y . Yao, A. Zhang, L. Zhanget al., “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021

work page 2021

[12] [12]

Tictac: Accelerating distributed deep learning with communication scheduling,

S. H. Hashemi, S. Abdu Jyothi, and R. Campbell, “Tictac: Accelerating distributed deep learning with communication scheduling,” Proceedings of Machine Learning and Systems , vol. 1, pp. 418–430, 2019. 10

work page 2019

[13] [13]

Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory,

J. Herrmann, O. Beaumont, L. Eyraud-Dubois, J. Hermann, A. Joly, and A. Shilova, “Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory,”arXiv preprint arXiv:1911.13214, 2019

work page arXiv 1911

[14] [14]

Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,

C.-C. Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems , 2020, pp. 1341– 1355

work page 2020

[15] [15]

Elixir: Train a large language model on a small gpu cluster,

H. Huang, J. Fang, H. Liu, S. Li, and Y . You, “Elixir: Train a large language model on a small gpu cluster,”arXiv preprint arXiv:2212.05339, 2022

work page arXiv 2022

[16] [16]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems , vol. 32, 2019

work page 2019

[17] [17]

Checkmate: Breaking the memory wall with optimal tensor rematerialization,

P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, J. Gonzalez, K. Keutzer, and I. Stoica, “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” Proceedings of Machine Learning and Systems , vol. 2, pp. 497–511, 2020

work page 2020

[18] [18]

Breaking the computation and communication abstraction barrier in distributed machine learning workloads,

A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y . Miao, M. Musuvathi, T. Mytkow- icz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Confer- ence on Architectural Support for Programming Languages and Operating Systems , 20...

work page 2022

[19] [19]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

Reducing activation recomputation in large transformer models,

V . A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” Proceedings of Machine Learning and Systems, vol. 5, 2023

work page 2023

[22] [22]

Tflms: Large model support in tensorflow by graph rewriting,

T. D. Le, H. Imai, Y . Negishi, and K. Kawachiya, “Tflms: Large model support in tensorflow by graph rewriting,”arXiv preprint arXiv:1807.02037, 2018

work page arXiv 2018

[23] [23]

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

S. Li, Y . Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania et al., “Pytorch distributed: Experiences on accelerating data parallel training,” arXiv preprint arXiv:2006.15704, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[24] [24]

Colossal-ai: A unified deep learning system for large-scale parallel training,

S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y . Liu, B. Wang, and Y . You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775

work page 2023

[25] [25]

Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,

Y . Li, A. Phanishayee, D. Murray, J. Tarnawski, and N. S. Kim, “Harmony: Overcoming the hurdles of gpu memory capacity to train massive dnn models on commodity servers,” arXiv preprint arXiv:2202.01306, 2022

work page arXiv 2022

[26] [26]

Swin transformer v2: Scaling up capacity and resolution,

Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Dong et al. , “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 12 009–12 019

work page 2022

[27] [27]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision , 2021, pp. 10 012–10 022

work page 2021

[28] [28]

Better together: Jointly optimizing {ML} collective scheduling and execution planning using {SYNDICATE},

K. Mahajan, C.-H. Chu, S. Sridharan, and A. Akella, “Better together: Jointly optimizing {ML} collective scheduling and execution planning using {SYNDICATE},” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , 2023, pp. 809–824. 11

work page 2023

[29] [29]

Mixed Precision Training

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkateshet al., “Mixed precision training,”arXiv preprint arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Pipedream: generalized pipeline parallelism for dnn training,

D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM symposium on operating systems principles , 2019, pp. 1–15

work page 2019

[31] [31]

Api documentation of apex optimizers,

NVIDIA, “Api documentation of apex optimizers,” https://nvidia.github.io/apex/optimizers.html, 2018

work page 2018

[32] [32]

Capuchin: Tensor-based gpu memory management for deep learning,

X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, “Capuchin: Tensor-based gpu memory management for deep learning,” inProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 891...

work page doi:10.1145/3373376.3378505 2020

[33] [33]

A generic communication scheduler for distributed dnn training acceleration,

Y . Peng, Y . Zhu, Y . Chen, Y . Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed dnn training acceleration,” inProceedings of the 27th ACM Symposium on Operating Systems Principles , 2019, pp. 16–29

work page 2019

[34] [34]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervi- sion,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

work page 2021

[35] [35]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskeveret al., “Improving language understanding by generative pre-training,” 2018

work page 2018

[36] [36]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[37] [37]

Zero: Memory optimizations toward training trillion parameter models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He, “Zero: Memory optimizations toward training trillion parameter models,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16

work page 2020

[38] [38]

Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y . He, “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , 2021, pp. 1–14

work page 2021

[39] [39]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,

J. Rasley, S. Rajbhandari, O. Ruwase, and Y . He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020, pp. 3505–3506

work page 2020

[40] [40]

Zero-offload: Democratizing billion-scale model training,

J. Ren, S. Rajbhandari, R. Y . Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, and Y . He, “Zero-offload: Democratizing billion-scale model training,” in2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 551–564

work page 2021

[41] [41]

vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13

work page 2016

[42] [42]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[43] [43]

Videobert: A joint model for video and language representation learning,

C. Sun, A. Myers, C. V ondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7464–7473

work page 2019

[44] [44]

Stronghold: fast and affordable billion-scale deep learning model training,

X. Sun, W. Wang, S. Qiu, R. Yang, S. Huang, J. Xu, and Z. Wang, “Stronghold: fast and affordable billion-scale deep learning model training,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 2022, pp. 1–17. 12

work page 2022

[45] [45]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems , vol. 30, 2017

work page 2017

[47] [47]

Superneurons: Dynamic gpu memory management for training deep neural networks,

L. Wang, J. Ye, Y . Zhao, W. Wu, A. Li, S. L. Song, Z. Xu, and T. Kraska, “Superneurons: Dynamic gpu memory management for training deep neural networks,” inProceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming , 2018, pp. 41–53

work page 2018

[48] [48]

H3t: Efficient integration of memory optimization and parallelism for large-scale transformer training,

Y . Wang, X. Han, W. Zhao, G. Zeng, Z. Liu, and M. Sun, “H3t: Efficient integration of memory optimization and parallelism for large-scale transformer training,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[49] [49]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin et al. , “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [50]

Rockmate: an efficient, fast, automatic and generic tool for re-materialization in pytorch,

X. Zhao, T. Le Hellard, L. Eyraud-Dubois, J. Gusak, and O. Beaumont, “Rockmate: an efficient, fast, automatic and generic tool for re-materialization in pytorch,” in International Conference on Machine Learning. PMLR, 2023, pp. 42 018–42 045

work page 2023

[51] [51]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer et al., “Pytorch fsdp: experiences on scaling fully sharded data parallel,” arXiv preprint arXiv:2304.11277, 2023. 13 A Implementation Details ProTrain is implemented using Python language, with a total of 6,000 lines of code. It provides very simple APIs...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the paper

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: We clearly include the main claims and contributions in the abstract and introduction. Guidelines: • The answer NA means that the abstract and introduction do not include the claims made in the pape...

work page

[53] [53]

Limitations

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Due to the page limit, we discuss the limitations in the appendix. Guidelines: • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the pa...

work page

[54] [54]

Guidelines: • The answer NA means that the paper does not include theoretical results

Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] 18 Justification: This paper does not involve any theoretical results. Guidelines: • The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, ...

work page

[55] [55]

For compared baselines, we also detail the configurations in the appendix

Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We detail the design and i...

work page

[56] [56]

Guidelines: • The answer NA means that paper does not include experiments requiring code

Open access to data and code 19 Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: although the data and code are not open access, the paper provides compre- hensive descriptions of the me...

work page

[57] [57]

Guidelines: • The answer NA means that the paper does not include experiments

Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We specify the evaluation details in the main paper and appendix. Guidelines: • The answer NA means that the ...

work page

[58] [58]

Guidelines: • The answer NA means that the paper does not include experiments

Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: The metrics that we measured are stable. Guidelines: • The answer NA means that the paper does not include experiments. • The autho...

work page

[59] [59]

Guidelines: • The answer NA means that the paper does not include experiments

Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the hardware information in the main paper and appendix. Guidelines: • The answer NA means t...

work page

[60] [60]

Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics

Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: We follow the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer ...

work page

[61] [61]

Guidelines: • The answer NA means that there is no societal impact of the work performed

Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We mention this only briefly in the conclusion and implicitly elsewhere. Guidelines: • The answer NA means that there is no societal impact of the work performed. • If the authors answe...

work page

[62] [62]

Guidelines: • The answer NA means that the paper poses no such risks

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper does not have such risks. Guidelines: • The answer NA means that the paper poses no s...

work page

[63] [63]

Guidelines: • The answer NA means that the paper does not use existing assets

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We add reference for the owners of assets. Guidelines: • The answer NA means that the paper does...

work page

[64] [64]

Guidelines: • The answer NA means that the paper does not release new assets

New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets. Guidelines: • The answer NA means that the paper does not release new assets. • Researchers should communicate the details of the dataset/code/model as part of their s...

work page

[65] [65]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research...

work page

[66] [66]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page