pith. sign in

arxiv: 2605.21442 · v1 · pith:P6JXQAXTnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

torchtune: PyTorch native post-training library

Pith reviewed 2026-05-21 05:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM post-trainingfine-tuningPyTorchmodular librarydistributed trainingmodel adaptationhackability
0
0 comments X

The pith

torchtune is a PyTorch-native library for LLM post-training that delivers strong performance and memory efficiency through modularity and direct code access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents torchtune as a library built directly on PyTorch to handle the post-training stage for large language models, including fine-tuning and related workflows. It prioritizes a modular structure with easy access to core components so researchers can modify and extend the code quickly. The authors describe how this shows up in model builders, training recipes, and distributed training features, then compare it to other tools. A sympathetic reader would care because the design trades some ease of use for the ability to iterate on custom methods without being locked into black-box optimizations.

Core claim

We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training

What carries the argument

Modular model builders, training recipes, and distributed training stack that expose direct PyTorch components for customization and extension.

If this is right

  • Post-training experiments become reproducible across different research environments using the same modular components.
  • New model architectures or training methods can be integrated and tested with minimal changes to the base library.
  • The same codebase supports both rapid prototyping and production-oriented deployment workflows.
  • Memory and speed results remain competitive with specialized frameworks while preserving full transparency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could increase contributions from academic groups that need to inspect and alter every stage of the pipeline.
  • It suggests a broader pattern for other PyTorch-based tools to favor native access over polished high-level APIs.
  • Future work might test whether the same modularity principles extend cleanly to earlier training stages or new hardware backends.

Load-bearing premise

Users will achieve better research outcomes by directly accessing and modifying core PyTorch code instead of relying on abstracted interfaces or specialized optimizations.

What would settle it

A side-by-side run on the same fine-tuning task and hardware where torchtune uses more memory or achieves lower downstream accuracy than Unsloth or Axolotl.

Figures

Figures reproduced from arXiv: 2605.21442 by Ariel Kwiatkowski, Evan Smothers, Felipe Mello, Joseph Cummings, Mark Obozov, Maxime Griot, Mircea Mironenco, Nathan Azrak, Philip John Bontrager, Rafi Ayub, Salman Mohammadi.

Figure 1
Figure 1. Figure 1: Visual abstract. torchtune recipes instantiate model, data, objective, optimizer, logging, and runtime components from YAML, then run them through a shared PyTorch training loop. Experiments are expressed by swapping component paths or runtime policies while preserving the recipe structure. context parallelism are supported unevenly across frameworks, often requiring different backends or different abstrac… view at source ↗
Figure 2
Figure 2. Figure 2: Timeline schematics for GRPO execu￾tion. “sync” denotes a parameter refresh from trainer to generator; “shard” denotes sharded￾state transitions associated with distributed train￾ing. memory/throughput balance. 7 Asynchronous GRPO GRPO (Shao et al., 2024) optimizes a policy using scalar rewards computed over groups of sampled generations. torchtune pro￾vides two distributed full fine-tuning recipes: grpo f… view at source ↗
Figure 3
Figure 3. Figure 3: Context Parallel memory trace B Supported Models & Datasets [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces torchtune, a PyTorch-native library for LLM post-training. It describes design principles emphasizing modularity, hackability, and direct PyTorch access (in contrast to ease-of-use or specialized optimizations in other tools), details their realization in model builders, training recipes, and the distributed stack, and reports benchmark comparisons against Axolotl and Unsloth that demonstrate strong performance and memory efficiency while claiming sufficient flexibility for rapid research iteration.

Significance. If the empirical results hold under scrutiny, torchtune supplies a practical, extensible foundation for reproducible LLM post-training research. Its native PyTorch orientation and open design could lower barriers to custom experimentation and improve transparency compared with more opaque frameworks, directly supporting the community's need for hackable tools in multistage training pipelines.

major comments (1)
  1. Evaluation section: the central claim requires both strong performance/memory results and flexibility for rapid research iteration. Concrete benchmark numbers versus Axolotl and Unsloth support the former, but the latter rests solely on qualitative description of modularity and direct access; no quantitative evidence (developer time, lines changed, or iteration cycles for a custom extension) is provided, leaving the design-choice justification as the weakest link.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and describe the planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Evaluation section: the central claim requires both strong performance/memory results and flexibility for rapid research iteration. Concrete benchmark numbers versus Axolotl and Unsloth support the former, but the latter rests solely on qualitative description of modularity and direct access; no quantitative evidence (developer time, lines changed, or iteration cycles for a custom extension) is provided, leaving the design-choice justification as the weakest link.

    Authors: We agree that the current justification for flexibility is the weakest link in the evaluation, as it relies on qualitative descriptions of modularity and direct PyTorch access. Objective quantification of developer time or iteration cycles would require controlled user studies that are outside the scope of this library paper. To address the concern, we will revise the Evaluation section to include two concrete extension examples: (1) adding a custom loss function and (2) integrating a new data collator. For each, we will report the number of lines of code added or modified and the files touched, demonstrating the low effort required to extend the library while preserving its structure. These additions will provide the requested quantitative flavor without altering the paper's primary focus on performance and memory results. revision: yes

Circularity Check

0 steps flagged

No circularity in library design and benchmark presentation

full rationale

The paper introduces torchtune as a PyTorch-native library, describes its modular design principles and components (model builders, recipes, distributed stack), and reports empirical comparisons for performance and memory efficiency against Axolotl and Unsloth. No mathematical derivations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations are present. Claims rest on direct implementation descriptions and external benchmark results rather than any chain that reduces to its own inputs by construction. This is a standard non-circular outcome for a systems/software artifact paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library introduction paper rather than a mathematical or empirical scientific claim; no free parameters, axioms, or invented entities are required or introduced.

pith-pipeline@v0.9.0 · 5754 in / 1128 out tokens · 30907 ms · 2026-05-21T05:39:11.186834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 10 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    2023 , eprint=

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

  5. [5]

    Axolotl: Open Source LLM Post-Training , author =

  6. [7]

    Daniel Han, Michael Han and Unsloth team , title =

  7. [8]

    Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan and Marian Tietz , howpublished =

  8. [10]

    GitHub repository , howpublished =

    Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec , title =. GitHub repository , howpublished =. 2020 , publisher =

  9. [11]

    ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , booktitle=. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , year=

  10. [12]

    Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and Goyal, Anirudh and Hartshorn, Anthony and Yang, Aobo and Mitra, Archi and Sravankumar, Archie and Korenev, Artem and Hinsvark, Arthur and Rao, Arun and Zhang, Aston and ...

  11. [13]

    and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =

    Hu, Edward J. and shen, yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , year =. International

  12. [14]

    Attention is all you need , volume =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , editor =. Attention is all you need , volume =. Advances in neural information processing systems , publisher =

  13. [15]

    2019 , url =

    Omry Yadan , title =. 2019 , url =

  14. [16]

    ArXiv , year =

    Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. ArXiv , year =

  15. [17]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  16. [18]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

  17. [19]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  18. [20]

    9th International Conference on Learning Representations, ICLR , year=

    8-bit Optimizers via Block-wise Quantization , author=. 9th International Conference on Learning Representations, ICLR , year=

  19. [22]

    2025 , eprint=

    TorchAO: PyTorch-Native Training-to-Serving Model Optimization , author=. 2025 , eprint=

  20. [23]

    The Thirteenth International Conference on Learning Representations , year=

    Cut Your Losses in Large-Vocabulary Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  21. [24]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  22. [25]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , booktitle =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , url =

  23. [26]

    2017 , eprint=

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

  24. [27]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  25. [28]

    QLoRA: Efficient Finetuning of Quantized LLMs , url =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. QLoRA: Efficient Finetuning of Quantized LLMs , url =

  26. [31]

    2016 , eprint=

    Training Deep Nets with Sublinear Memory Cost , author=. 2016 , eprint=

  27. [32]

    and Stoica, Ion , booktitle=

    Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion , booktitle=. Ray: A Distributed Framework for Emerging. 2018 , isbn=

  28. [33]

    TorchRL: A data-driven decision-making library for PyTorch , url =

    Bou, Albert and Bettini, Matteo and Dittert, Sebastian and Kumar, Vikash and Sodhani, Shagun and Yang, Xiaomeng and De Fabritiis, Gianni and Moens, Vincent , booktitle =. TorchRL: A data-driven decision-making library for PyTorch , url =

  29. [34]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    Narayanan, Deepak and Shoeybi, Mohammad and Casper, Jared and LeGresley, Patrick and Patwary, Mostofa and Korthikanti, Vijay and Vainbrand, Dmitri and Kashinkunti, Prethvi and Bernauer, Julie and Catanzaro, Bryan and Phanishayee, Amar and Zaharia, Matei , title =. Proceedings of the International Conference for High Performance Computing, Networking, Stor...

  30. [36]

    2021 , eprint=

    Efficient Sequence Packing Without Cross-Contamination: Accelerating Large Language Models Without Impacting Performance , author=. 2021 , eprint=

  31. [37]

    Journal of Machine Learning Research , year =

    William Fedus and Barret Zoph and Noam Shazeer , title =. Journal of Machine Learning Research , year =

  32. [38]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A. and Khashabi, Daniel and Hajishirzi, Hannaneh. Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.754

  33. [39]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

  34. [40]

    2015 ,URL =

    Distilling the Knowledge in a Neural Network ,author =. 2015 ,URL =

  35. [41]

    2024 , howpublished=

  36. [43]

    2026 , month = apr, day =

  37. [46]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, and 30 others. 2024. https://doi.org/10.1145/3620665.3640366 Pytorch 2: Fast...

  38. [47]

    Axolotl maintainers and contributors . 2023. https://github.com/axolotl-ai-cloud/axolotl Axolotl: Open source llm post-training

  39. [48]

    Albert Bou, Matteo Bettini, Sebastian Dittert, Vikash Kumar, Shagun Sodhani, Xiaomeng Yang, Gianni De Fabritiis, and Vincent Moens. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/07bc8125400bf4b140c332010756bd9b-Paper-Conference.pdf Torchrl: A data-driven decision-making library for pytorch . In International Conference on Learning Represen...

  40. [49]

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. https://arxiv.org/abs/1604.06174 Training deep nets with sublinear memory cost . CoRR, abs/1604.06174

  41. [50]

    Michael Han Daniel Han and Unsloth team. 2023. http://github.com/unslothai/unsloth Unsloth

  42. [51]

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 2022. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR

  43. [52]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf Qlora: Efficient finetuning of quantized llms . In Advances in Neural Information Processing Systems, volume 36, pages 10088--10115. Curran Associates, Inc

  44. [53]

    William Falcon and The PyTorch Lightning team . 2019. https://doi.org/10.5281/zenodo.3828935 PyTorch Lightning

  45. [54]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

  46. [55]

    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

  47. [56]

    Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 LoRA : Low - Rank Adaptation of Large Language Models . In International Conference on Learning Representations

  48. [57]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  49. [58]

    Hanyu Lai, Xiao Liu, Junjie Gao, Jiale Cheng, Zehan Qi, Yifan Xu, Shuntian Yao, Dan Zhang, Jinhua Du, Zhenyu Hou, Xin Lv, Minlie Huang, Yuxiao Dong, and Jie Tang. 2025. https://doi.org/10.18653/v1/2025.acl-long.140 A survey of post-training scaling in large language models . In Proceedings of the 63rd Annual Meeting of the Association for Computational Li...

  50. [59]

    Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. https://api.semanticscholar.org/CorpusID:263608461 Ring attention with blockwise transformers for near-infinite context . ArXiv, abs/2310.01889

  51. [60]

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, Benjamin Bossan, and Marian Tietz. 2022. PEFT : State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

  52. [61]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. https://www.usenix.org/conference/osdi18/presentation/moritz Ray: A distributed framework for emerging AI applications . In 13th USENIX Symposium on Operating Systems Design and Imp...

  53. [62]

    Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, and Aleksandar Samardžić. 2025. https://arxiv.org/abs/2507.16099 Torchao: Pytorch-native training-to-serving model optimization . Preprint, arXiv:2507.16099

  54. [63]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

  55. [64]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. https://proceedings.neurips.cc/paper_files/paper/2019/...

  56. [65]

    PyTorch Community . 2023. https://github.com/pytorch/pytorch/issues/114299 PyTorch FSDP2 RFC . GitHub Issue

  57. [66]

    PyTorch Community . 2026. https://docs.pytorch.org/docs/2.12/distributed.tensor.html torch.distributed.tensor --- PyTorch 2.12 Documentation

  58. [67]

    Qwen Team . 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  59. [68]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf Direct preference optimization: Your language model is secretly a reward model . In Advances in Neural Information Processing Systems, ...

  60. [69]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1109/SC41405.2020.00024 Zero: Memory optimizations toward training trillion parameter models . In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16

  61. [70]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. https://doi.org/10.1145/3394486.3406703 Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters . In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '20, page 3505–3506, New York, NY, U...

  62. [71]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . Preprint, arXiv:1707.06347

  63. [72]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

  64. [73]

    Philippe Tillet, H. T. Kung, and David Cox. 2019. https://doi.org/10.1145/3315508.3329973 Triton: an intermediate language and compiler for tiled neural network computations . In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY, USA. Association for Computing Machinery

  65. [74]

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. 2020. Trl: Transformers reinforcement learning. https://github.com/huggingface/trl

  66. [75]

    Erik Wijmans, Brody Huval, Alexander Hertzberg, Vladlen Koltun, and Philipp Kraehenbuehl. 2025. https://openreview.net/forum?id=E4Fk3YuG56 Cut your losses in large-vocabulary language models . In The Thirteenth International Conference on Learning Representations

  67. [76]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

  68. [77]

    Omry Yadan. 2019. https://github.com/facebookresearch/hydra Hydra - a framework for elegantly configuring complex applications . Github

  69. [78]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. https://arxiv.org/abs/2304.11277 Pytorch fsdp: Experiences on scaling fully sharded data parallel . Prepr...