torchtune: PyTorch native post-training library

Ariel Kwiatkowski; Evan Smothers; Felipe Mello; Joseph Cummings; Mark Obozov; Maxime Griot; Mircea Mironenco; Nathan Azrak; Philip John Bontrager; Rafi Ayub

arxiv: 2605.21442 · v1 · pith:P6JXQAXTnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

torchtune: PyTorch native post-training library

Mark Obozov , Maxime Griot , Joseph Cummings , Evan Smothers , Felipe Mello , Rafi Ayub , Philip John Bontrager , Salman Mohammadi

show 3 more authors

Ariel Kwiatkowski Nathan Azrak Mircea Mironenco

This is my paper

Pith reviewed 2026-05-21 05:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM post-trainingfine-tuningPyTorchmodular librarydistributed trainingmodel adaptationhackability

0 comments

The pith

torchtune is a PyTorch-native library for LLM post-training that delivers strong performance and memory efficiency through modularity and direct code access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents torchtune as a library built directly on PyTorch to handle the post-training stage for large language models, including fine-tuning and related workflows. It prioritizes a modular structure with easy access to core components so researchers can modify and extend the code quickly. The authors describe how this shows up in model builders, training recipes, and distributed training features, then compare it to other tools. A sympathetic reader would care because the design trades some ease of use for the ability to iterate on custom methods without being locked into black-box optimizations.

Core claim

We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training

What carries the argument

Modular model builders, training recipes, and distributed training stack that expose direct PyTorch components for customization and extension.

If this is right

Post-training experiments become reproducible across different research environments using the same modular components.
New model architectures or training methods can be integrated and tested with minimal changes to the base library.
The same codebase supports both rapid prototyping and production-oriented deployment workflows.
Memory and speed results remain competitive with specialized frameworks while preserving full transparency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could increase contributions from academic groups that need to inspect and alter every stage of the pipeline.
It suggests a broader pattern for other PyTorch-based tools to favor native access over polished high-level APIs.
Future work might test whether the same modularity principles extend cleanly to earlier training stages or new hardware backends.

Load-bearing premise

Users will achieve better research outcomes by directly accessing and modifying core PyTorch code instead of relying on abstracted interfaces or specialized optimizations.

What would settle it

A side-by-side run on the same fine-tuning task and hardware where torchtune uses more memory or achieves lower downstream accuracy than Unsloth or Axolotl.

Figures

Figures reproduced from arXiv: 2605.21442 by Ariel Kwiatkowski, Evan Smothers, Felipe Mello, Joseph Cummings, Mark Obozov, Maxime Griot, Mircea Mironenco, Nathan Azrak, Philip John Bontrager, Rafi Ayub, Salman Mohammadi.

**Figure 1.** Figure 1: Visual abstract. torchtune recipes instantiate model, data, objective, optimizer, logging, and runtime components from YAML, then run them through a shared PyTorch training loop. Experiments are expressed by swapping component paths or runtime policies while preserving the recipe structure. context parallelism are supported unevenly across frameworks, often requiring different backends or different abstrac… view at source ↗

**Figure 2.** Figure 2: Timeline schematics for GRPO execution. “sync” denotes a parameter refresh from trainer to generator; “shard” denotes shardedstate transitions associated with distributed training. memory/throughput balance. 7 Asynchronous GRPO GRPO (Shao et al., 2024) optimizes a policy using scalar rewards computed over groups of sampled generations. torchtune provides two distributed full fine-tuning recipes: grpo f… view at source ↗

**Figure 3.** Figure 3: Context Parallel memory trace B Supported Models & Datasets [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

torchtune is a modular PyTorch-native fine-tuning library with solid efficiency benchmarks against Axolotl and Unsloth, though its research flexibility advantage stays unmeasured.

read the letter

Hi, torchtune is a PyTorch native post-training library that aims to make fine-tuning LLMs more modular and transparent. The main thing to know is that it shows competitive performance and memory efficiency in benchmarks against Axolotl and Unsloth, but the flexibility for research iteration is more of a design claim than a measured result. The paper does a good job describing how the modularity is implemented in the model builders, recipes, and distributed training. This approach keeps everything close to raw PyTorch, which is the core difference from other frameworks that prioritize specialized optimizations or simpler interfaces. The evaluations cover representative post-training settings and provide numbers that support the efficiency side. Where it falls short is in backing up the rapid research iteration angle. No data is given on things like time to implement a custom extension or number of lines needed for modifications. That part relies on the reader accepting that direct PyTorch access leads to faster iteration, which may or may not be true depending on the user's experience level. This kind of paper is useful for people working on LLM adaptation who value control and extensibility over out-of-the-box simplicity. If you're doing custom fine-tuning work, the design details could save time by showing a clean way to structure things. I would send this to peer review. It's a practical contribution with evidence on the performance claims, and feedback could help refine the flexibility argument. Cheers,

Referee Report

1 major / 0 minor

Summary. The manuscript introduces torchtune, a PyTorch-native library for LLM post-training. It describes design principles emphasizing modularity, hackability, and direct PyTorch access (in contrast to ease-of-use or specialized optimizations in other tools), details their realization in model builders, training recipes, and the distributed stack, and reports benchmark comparisons against Axolotl and Unsloth that demonstrate strong performance and memory efficiency while claiming sufficient flexibility for rapid research iteration.

Significance. If the empirical results hold under scrutiny, torchtune supplies a practical, extensible foundation for reproducible LLM post-training research. Its native PyTorch orientation and open design could lower barriers to custom experimentation and improve transparency compared with more opaque frameworks, directly supporting the community's need for hackable tools in multistage training pipelines.

major comments (1)

Evaluation section: the central claim requires both strong performance/memory results and flexibility for rapid research iteration. Concrete benchmark numbers versus Axolotl and Unsloth support the former, but the latter rests solely on qualitative description of modularity and direct access; no quantitative evidence (developer time, lines changed, or iteration cycles for a custom extension) is provided, leaving the design-choice justification as the weakest link.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: Evaluation section: the central claim requires both strong performance/memory results and flexibility for rapid research iteration. Concrete benchmark numbers versus Axolotl and Unsloth support the former, but the latter rests solely on qualitative description of modularity and direct access; no quantitative evidence (developer time, lines changed, or iteration cycles for a custom extension) is provided, leaving the design-choice justification as the weakest link.

Authors: We agree that the current justification for flexibility is the weakest link in the evaluation, as it relies on qualitative descriptions of modularity and direct PyTorch access. Objective quantification of developer time or iteration cycles would require controlled user studies that are outside the scope of this library paper. To address the concern, we will revise the Evaluation section to include two concrete extension examples: (1) adding a custom loss function and (2) integrating a new data collator. For each, we will report the number of lines of code added or modified and the files touched, demonstrating the low effort required to extend the library while preserving its structure. These additions will provide the requested quantitative flavor without altering the paper's primary focus on performance and memory results. revision: yes

Circularity Check

0 steps flagged

No circularity in library design and benchmark presentation

full rationale

The paper introduces torchtune as a PyTorch-native library, describes its modular design principles and components (model builders, recipes, distributed stack), and reports empirical comparisons for performance and memory efficiency against Axolotl and Unsloth. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations are present. Claims rest on direct implementation descriptions and external benchmark results rather than any chain that reduces to its own inputs by construction. This is a standard non-circular outcome for a systems/software artifact paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library introduction paper rather than a mathematical or empirical scientific claim; no free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5754 in / 1128 out tokens · 30907 ms · 2026-05-21T05:39:11.186834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 10 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

work page 2023
[5]

Axolotl: Open Source LLM Post-Training , author =

work page
[7]

Daniel Han, Michael Han and Unsloth team , title =

work page
[8]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =

work page
[10]

GitHub repository , howpublished =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec , title =. GitHub repository , howpublished =. 2020 , publisher =

work page 2020
[11]

ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

work page
[12]

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[13]

and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =

Hu, Edward J. and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =. International

work page
[14]

Attention is all you need , volume =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , editor =. Attention is all you need , volume =. Advances in neural information processing systems , publisher =

work page
[15]

2019 , url =

Omry Yadan , title =. 2019 , url =

work page 2019
[16]

ArXiv , year =

Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. ArXiv , year =

work page
[17]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page
[19]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[20]

9th International Conference on Learning Representations, ICLR , year=

8-bit Optimizers via Block-wise Quantization , author=. 9th International Conference on Learning Representations, ICLR , year=

work page
[22]

2025 , eprint=

TorchAO: PyTorch-Native Training-to-Serving Model Optimization , author=. 2025 , eprint=

work page 2025
[23]

The Thirteenth International Conference on Learning Representations , year=

Cut Your Losses in Large-Vocabulary Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[24]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[25]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

work page
[26]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[27]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[28]

QLoRA: Efficient Finetuning of Quantized LLMs , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

work page
[31]

2016 , eprint=

Training Deep Nets with Sublinear Memory Cost , author=. 2016 , eprint=

work page 2016
[32]

and Stoica, Ion , booktitle=

Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion , booktitle=. Ray: A Distributed Framework for Emerging. 2018 , isbn=

work page 2018
[33]

TorchRL: A data-driven decision-making library for PyTorch , url =

Bou, Albert and Bettini, Matteo and Dittert, Sebastian and Kumar, Vikash and Sodhani, Shagun and Yang, Xiaomeng and De Fabritiis, Gianni and Moens, Vincent , booktitle =. TorchRL: A data-driven decision-making library for PyTorch , url =

work page
[34]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. Proceedings of the International Conference for High Performance Computing, Networking, Stor...

work page doi:10.1145/3458817.3476209 2021
[36]

2021 , eprint=

Efficient Sequence Packing Without Cross-Contamination: Accelerating Large Language Models Without Impacting Performance , author=. 2021 , eprint=

work page 2021
[37]

Journal of Machine Learning Research , year =

William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =

work page
[38]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.754

work page doi:10.18653/v1/2023.acl-long.754 2023
[39]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[40]

2015 ,URL =

Distilling the Knowledge in a Neural Network ,author =. 2015 ,URL =

work page 2015
[41]

2024 , howpublished=

work page 2024
[43]

2026 , month = apr, day =

work page 2026
[46]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, and 30 others. 2024. https://doi.org/10.1145/3620665.3640366 Pytorch 2: Fast...

work page doi:10.1145/3620665.3640366 2024
[47]

Axolotl maintainers and contributors . 2023. https://github.com/axolotl-ai-cloud/axolotl Axolotl: Open source llm post-training

work page 2023
[48]

Albert Bou, Matteo Bettini, Sebastian Dittert, Vikash Kumar, Shagun Sodhani, Xiaomeng Yang, Gianni De Fabritiis, and Vincent Moens. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/07bc8125400bf4b140c332010756bd9b-Paper-Conference.pdf Torchrl: A data-driven decision-making library for pytorch . In International Conference on Learning Represen...

work page 2024
[49]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. https://arxiv.org/abs/1604.06174 Training deep nets with sublinear memory cost . CoRR, abs/1604.06174

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

Michael Han Daniel Han and Unsloth team. 2023. http://github.com/unslothai/unsloth Unsloth

work page 2023
[51]

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2022. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR

work page 2022
[52]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf Qlora: Efficient finetuning of quantized llms . In Advances in Neural Information Processing Systems, volume 36, pages 10088--10115. Curran Associates, Inc

work page 2023
[53]

William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning

work page doi:10.5281/zenodo.3828935 2019
[54]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

work page doi:10.5281/zenodo.12608602 2024
[55]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 LoRA : Low - Rank Adaptation of Large Language Models . In International Conference on Learning Representations

work page 2022
[57]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[58]

Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan Qi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du, Zhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. https://doi.org/10.18653/v1/2025.acl-long.140 A survey of post-training scaling in large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Li...

work page doi:10.18653/v1/2025.acl-long.140 2025
[59]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. https://api.semanticscholar.org/CorpusID:263608461 Ring attention with blockwise transformers for near-infinite context . ArXiv, abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. 2022. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

work page 2022
[61]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. https://www.usenix.org/conference/osdi18/presentation/moritz Ray: A distributed framework for emerging AI applications . In 13th USENIX Symposium on Operating Systems Design and Imp...

work page 2018
[62]

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, and Aleksandar Samardžić. 2025. https://arxiv.org/abs/2507.16099 Torchao: Pytorch-native training-to-serving model optimization . Preprint, arXiv:2507.16099

work page arXiv 2025
[63]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

work page 2022
[64]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. https://proceedings.neurips.cc/paper_files/paper/2019/...

work page 2019
[65]

PyTorch Community . 2023. https://github.com/pytorch/pytorch/issues/114299 PyTorch FSDP2 RFC . GitHub Issue

work page 2023
[66]

PyTorch Community . 2026. https://docs.pytorch.org/docs/2.12/distributed.tensor.html torch.distributed.tensor --- PyTorch 2.12 Documentation

work page 2026
[67]

Qwen Team . 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Systems, ...

work page 2023
[69]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1109/SC41405.2020.00024 Zero: Memory optimizations toward training trillion parameter models . In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
[70]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, page 3505–3506, New York, NY, U...

work page doi:10.1145/3394486.3406703 2020
[71]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[72]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Philippe Tillet, H. T. Kung, and David Cox. 2019. https://doi.org/10.1145/3315508.3329973 Triton: an intermediate language and compiler for tiled neural network computations . In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3315508.3329973 2019
[74]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformers reinforcement learning. https://github.com/huggingface/trl

work page 2020
[75]

Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, and Philipp Kraehenbuehl. 2025. https://openreview.net/forum?id=E4Fk3YuG56 Cut your losses in large-vocabulary language models . In The Thirteenth International Conference on Learning Representations

work page 2025
[76]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[77]

Omry Yadan. 2019. https://github.com/facebookresearch/hydra Hydra - a framework for elegantly configuring complex applications . Github

work page 2019
[78]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://arxiv.org/abs/2304.11277 Pytorch fsdp: Experiences on scaling fully sharded data parallel . Prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

work page 2023

[5] [5]

Axolotl: Open Source LLM Post-Training , author =

work page

[6] [7]

Daniel Han, Michael Han and Unsloth team , title =

work page

[7] [8]

Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =

work page

[8] [10]

GitHub repository , howpublished =

Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec , title =. GitHub repository , howpublished =. 2020 , publisher =

work page 2020

[9] [11]

ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

work page

[10] [12]

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[11] [13]

and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =

Hu, Edward J. and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =. International

work page

[12] [14]

Attention is all you need , volume =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , editor =. Attention is all you need , volume =. Advances in neural information processing systems , publisher =

work page

[13] [15]

2019 , url =

Omry Yadan , title =. 2019 , url =

work page 2019

[14] [16]

ArXiv , year =

Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. ArXiv , year =

work page

[15] [17]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [18]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page

[17] [19]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[18] [20]

9th International Conference on Learning Representations, ICLR , year=

8-bit Optimizers via Block-wise Quantization , author=. 9th International Conference on Learning Representations, ICLR , year=

work page

[19] [22]

2025 , eprint=

TorchAO: PyTorch-Native Training-to-Serving Model Optimization , author=. 2025 , eprint=

work page 2025

[20] [23]

The Thirteenth International Conference on Learning Representations , year=

Cut Your Losses in Large-Vocabulary Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[21] [24]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page

[22] [25]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

work page

[23] [26]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017

[24] [27]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024

[25] [28]

QLoRA: Efficient Finetuning of Quantized LLMs , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

work page

[26] [31]

2016 , eprint=

Training Deep Nets with Sublinear Memory Cost , author=. 2016 , eprint=

work page 2016

[27] [32]

and Stoica, Ion , booktitle=

Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion , booktitle=. Ray: A Distributed Framework for Emerging. 2018 , isbn=

work page 2018

[28] [33]

TorchRL: A data-driven decision-making library for PyTorch , url =

Bou, Albert and Bettini, Matteo and Dittert, Sebastian and Kumar, Vikash and Sodhani, Shagun and Yang, Xiaomeng and De Fabritiis, Gianni and Moens, Vincent , booktitle =. TorchRL: A data-driven decision-making library for PyTorch , url =

work page

[29] [34]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. Proceedings of the International Conference for High Performance Computing, Networking, Stor...

work page doi:10.1145/3458817.3476209 2021

[30] [36]

2021 , eprint=

Efficient Sequence Packing Without Cross-Contamination: Accelerating Large Language Models Without Impacting Performance , author=. 2021 , eprint=

work page 2021

[31] [37]

Journal of Machine Learning Research , year =

William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =

work page

[32] [38]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.754

work page doi:10.18653/v1/2023.acl-long.754 2023

[33] [39]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page

[34] [40]

2015 ,URL =

Distilling the Knowledge in a Neural Network ,author =. 2015 ,URL =

work page 2015

[35] [41]

2024 , howpublished=

work page 2024

[36] [43]

2026 , month = apr, day =

work page 2026

[37] [46]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, and 30 others. 2024. https://doi.org/10.1145/3620665.3640366 Pytorch 2: Fast...

work page doi:10.1145/3620665.3640366 2024

[38] [47]

Axolotl maintainers and contributors . 2023. https://github.com/axolotl-ai-cloud/axolotl Axolotl: Open source llm post-training

work page 2023

[39] [48]

Albert Bou, Matteo Bettini, Sebastian Dittert, Vikash Kumar, Shagun Sodhani, Xiaomeng Yang, Gianni De Fabritiis, and Vincent Moens. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/07bc8125400bf4b140c332010756bd9b-Paper-Conference.pdf Torchrl: A data-driven decision-making library for pytorch . In International Conference on Learning Represen...

work page 2024

[40] [49]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. https://arxiv.org/abs/1604.06174 Training deep nets with sublinear memory cost . CoRR, abs/1604.06174

work page internal anchor Pith review Pith/arXiv arXiv 2016

[41] [50]

Michael Han Daniel Han and Unsloth team. 2023. http://github.com/unslothai/unsloth Unsloth

work page 2023

[42] [51]

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2022. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR

work page 2022

[43] [52]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf Qlora: Efficient finetuning of quantized llms . In Advances in Neural Information Processing Systems, volume 36, pages 10088--10115. Curran Associates, Inc

work page 2023

[44] [53]

William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning

work page doi:10.5281/zenodo.3828935 2019

[45] [54]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

work page doi:10.5281/zenodo.12608602 2024

[46] [55]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

work page internal anchor Pith review Pith/arXiv arXiv 2015

[47] [56]

Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 LoRA : Low - Rank Adaptation of Large Language Models . In International Conference on Learning Representations

work page 2022

[48] [57]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023

[49] [58]

Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan Qi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du, Zhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. https://doi.org/10.18653/v1/2025.acl-long.140 A survey of post-training scaling in large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Li...

work page doi:10.18653/v1/2025.acl-long.140 2025

[50] [59]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. https://api.semanticscholar.org/CorpusID:263608461 Ring attention with blockwise transformers for near-infinite context . ArXiv, abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [60]

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. 2022. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

work page 2022

[52] [61]

Jordan, and Ion Stoica

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. https://www.usenix.org/conference/osdi18/presentation/moritz Ray: A distributed framework for emerging AI applications . In 13th USENIX Symposium on Operating Systems Design and Imp...

work page 2018

[53] [62]

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, and Aleksandar Samardžić. 2025. https://arxiv.org/abs/2507.16099 Torchao: Pytorch-native training-to-serving model optimization . Preprint, arXiv:2507.16099

work page arXiv 2025

[54] [63]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

work page 2022

[55] [64]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. https://proceedings.neurips.cc/paper_files/paper/2019/...

work page 2019

[56] [65]

PyTorch Community . 2023. https://github.com/pytorch/pytorch/issues/114299 PyTorch FSDP2 RFC . GitHub Issue

work page 2023

[57] [66]

PyTorch Community . 2026. https://docs.pytorch.org/docs/2.12/distributed.tensor.html torch.distributed.tensor --- PyTorch 2.12 Documentation

work page 2026

[58] [67]

Qwen Team . 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [68]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Systems, ...

work page 2023

[60] [69]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1109/SC41405.2020.00024 Zero: Memory optimizations toward training trillion parameter models . In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020

[61] [70]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, page 3505–3506, New York, NY, U...

work page doi:10.1145/3394486.3406703 2020

[62] [71]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[63] [72]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [73]

Philippe Tillet, H. T. Kung, and David Cox. 2019. https://doi.org/10.1145/3315508.3329973 Triton: an intermediate language and compiler for tiled neural network computations . In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/3315508.3329973 2019

[65] [74]

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformers reinforcement learning. https://github.com/huggingface/trl

work page 2020

[66] [75]

Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, and Philipp Kraehenbuehl. 2025. https://openreview.net/forum?id=E4Fk3YuG56 Cut your losses in large-vocabulary language models . In The Thirteenth International Conference on Learning Representations

work page 2025

[67] [76]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[68] [77]

Omry Yadan. 2019. https://github.com/facebookresearch/hydra Hydra - a framework for elegantly configuring complex applications . Github

work page 2019

[69] [78]

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://arxiv.org/abs/2304.11277 Pytorch fsdp: Experiences on scaling fully sharded data parallel . Prepr...

work page internal anchor Pith review Pith/arXiv arXiv 2023