Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
Pith reviewed 2026-05-22 21:00 UTC · model grok-4.3
The pith
Throughput measurements show how 3D parallelism configurations and flash attention affect training speed for 7B models on HPC systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Measured throughput data across varied 3D parallelism settings during training of a 7B-parameter model, together with the performance impact of flash attention, constitute the central empirical contribution.
What carries the argument
3D parallelism (the combination of data, tensor, and pipeline parallelism) augmented by the flash attention optimization.
If this is right
- Particular mixes of the three parallelism dimensions deliver higher tokens processed per second than others.
- Enabling flash attention reduces memory use and raises overall training throughput.
- Systematic profiling during runs identifies the dominant bottlenecks in the training pipeline.
- Careful management of the software environment and job scheduler reduces lost compute time.
Where Pith is reading between the lines
- The same measurement approach could be applied to models larger than 7B parameters to test whether the same configuration trends hold.
- Repeating the benchmarks on a different hardware generation would reveal how portable the observed optimal settings are.
- The reported numbers supply a baseline for estimating total compute cost when planning multilingual model training campaigns.
Load-bearing premise
The throughput values recorded on this hardware and model size will remain representative on other hardware or with models of different sizes.
What would settle it
Repeating the exact same parallelism configurations on a second HPC cluster and obtaining throughput rankings that differ from those reported here.
read the original abstract
The training of large language models (LLMs) requires substantial computational resources, complex software stacks, and carefully designed workflows to achieve scalability and efficiency. This report presents best practices and insights gained from the OpenGPT-X project, a German initiative focused on developing open, multilingual LLMs optimized for European languages. We detail the use of high-performance computing (HPC) systems, primarily JUWELS Booster at JSC, for training Teuken-7B, a 7-billion-parameter transformer model. The report covers system architecture, training infrastructure, software choices, profiling and benchmarking tools, as well as engineering and operational challenges. It includes measured throughput data of various configurations of 3D parallelism during training and the impact of features such as flash attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents best practices from the OpenGPT-X project for training LLMs on HPC systems, focusing on the training of the Teuken-7B model on JUWELS Booster. It details the infrastructure, software stack, 3D parallelism strategies, flash attention implementation, and includes specific measured throughput numbers for different parallelism configurations and feature impacts.
Significance. The provision of real-world measured data from a production-scale training run is a strength, offering practical insights into scaling LLM training on European HPC resources. This can inform similar projects, especially those emphasizing open and multilingual models. The report's value lies in its concrete examples rather than theoretical derivations.
major comments (2)
- [Section on 3D Parallelism Configurations] The throughput measurements for different 3D parallelism setups are presented without error bars or details on experimental repetitions. This weakens the ability to confidently recommend specific configurations as best practices based on the data.
- [Discussion of Best Practices] The generalization of the reported practices beyond the JUWELS Booster system and 7B model size is not addressed. Since the central claim is to provide best practices, evidence or discussion on how these would apply to other hardware or scales is needed to support the claim.
minor comments (1)
- [Abstract] Consider adding one or two key quantitative results to the abstract to better convey the paper's empirical contributions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results and claims.
read point-by-point responses
-
Referee: The throughput measurements for different 3D parallelism setups are presented without error bars or details on experimental repetitions. This weakens the ability to confidently recommend specific configurations as best practices based on the data.
Authors: We agree that additional details on measurement methodology would improve confidence in the reported numbers. The throughput values were collected during extended stable phases of production training runs on JUWELS Booster, where full repeated trials with statistical error bars are resource-prohibitive. In the revised manuscript we have added a paragraph describing the measurement protocol (duration of each stable window, number of independent configuration tests performed where feasible, and observed run-to-run variability). We also note the practical constraints of production-scale experiments. This provides necessary context while preserving the original data. revision: partial
-
Referee: The generalization of the reported practices beyond the JUWELS Booster system and 7B model size is not addressed. Since the central claim is to provide best practices, evidence or discussion on how these would apply to other hardware or scales is needed to support the claim.
Authors: We acknowledge that the manuscript would benefit from explicit discussion of transferability. While the concrete numbers are tied to JUWELS Booster and the 7B scale, the underlying engineering choices (3D parallelism decomposition, software stack selection, and Flash Attention integration) rest on general principles of communication-computation overlap and memory hierarchy optimization. In the revised version we have added a dedicated subsection that discusses applicability to other European HPC systems with different interconnects, to larger model sizes, and to alternative accelerator architectures, including both the transferable elements and the system-specific caveats. revision: yes
Circularity Check
No circularity detected; purely descriptive empirical report of measurements and project practices
full rationale
The manuscript is a project report presenting measured throughput numbers for 3D parallelism configurations and flash-attention impact on Teuken-7B running on JUWELS Booster, along with descriptions of system architecture, software choices, and operational challenges. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All content consists of direct reporting of project-specific data and choices with no load-bearing claims that reduce to inputs by construction. This matches the default expectation for non-circular empirical reports; the generalization limitation noted by the reader is a correctness/scope issue, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre
Dorian Krause. “JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre”. In: Journal of large-scale research facilities 5 (2019), A135. /d.sc/o.sc/i.sc: 10.17815/jlsrf-5-171. /u.sc/r.sc/l.sc: https://jlsrf.org/index.php/lsf/article/view/171
-
[2]
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali et al. Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs. 2024. arXiv: 2410.03730 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.03730
-
[3]
Ashish Vaswani et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
A mathematician’s introduction to transformers and large l anguage models
Carolin Penke. A mathematician’s introduction to transformers and large l anguage models . JSC Accelerating Devices Lab Blog (online). July 2022. /d.sc/o.sc/i.sc: 10.34732/xdvblg-qsbtyx
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding. 2019. arXiv: 1810.04805 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[6]
Language Models are Unsupervised Mult itask Learners
Alec Radford et al. “Language Models are Unsupervised Mult itask Learners”. In: OpenAI (2019). Accessed: 2024-11-15./u.sc/r.sc/l.sc: https://cdn.openai.com/better-language-models/language _models_are_unsup
work page 2019
-
[7]
Language Models are Few-Shot Learners
Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[8]
Training language models to follow instructions with human feedback
Long Ouyang et al. Training language models to follow instructions with human feedback
-
[9]
Training language models to follow instructions with human feedback
arXiv: 2203.02155 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler et al. Fine-Tuning Language Models from Human Preferences . 2020. arXiv: 1909.08593 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1909.08593
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[11]
Learning to summarize from human feedback
Nisan Stiennon et al. Learning to summarize from human feedback. 2022. arXiv: 2009.01325 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2009.01325. 12 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models . 2023. arXiv: 2302.13971 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI et al. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of -Experts Lan- guage Model. 2024. arXiv: 2405.04434 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04434
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
OpenGPT-X - Training Large Language Mod els on HPC Systems
Carolin Penke et al. “OpenGPT-X - Training Large Language Mod els on HPC Systems”. In: 14th JLESC Workshop, Urbana-Champaign (USA), 28 Sep 2022 - 30 Sep 2022. Sept. 28, 2022. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/910080
work page 2022
-
[15]
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Malte Ostendorff and Georg Rehm. Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning. 2023. arXiv: 2301.09626 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2301.09626
-
[18]
Tokenizer Choice For LLM Training: Negligible or Crucial?2024
Mehdi Ali et al. Tokenizer Choice For LLM Training: Negligible or Crucial?2024. arXiv: 2310.08754 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2310.08754
-
[19]
OpenGPT-X: Novel Ar chitecture Exploration
Chelsea Maria John and Andreas Herten. “OpenGPT-X: Novel Ar chitecture Exploration”. In: WHPC Workshop at SC23 (WHPC@SC23). Denver, CO: Zenodo, Nov. 2023./d.sc/o.sc/i.sc: 10.5281/zenodo.10116242. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.10116242
-
[20]
Alexander Arno Weber et al. Investigating Multilingual Instruction-Tuning: Do Polyglot Mod- els Demand for Multilingual Instructions? 2024. arXiv: 2402.13703 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2402.13703
-
[21]
ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classifier and Slot Filler
Paramita Mirza et al. ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classifier and Slot Filler. 2024. arXiv: 2403.17536 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2403.17536
-
[22]
Knowledge-Centric Hallucination Detection
Martin Courtois et al. “Symmetric Dot-Product Attentio n for Efficient Training of BERT Lan- guage Models”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 8002–8011. /d.sc/o.sc/i.sc: 10.18653/v1/2024....
-
[23]
LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models
Malte Ostendorff et al. “LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models”. In: First Conference on Language Modeling. 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=5RdIMlGLXL
work page 2024
-
[24]
Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML
Chelsea Maria John et al. Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML. 2024. arXiv: 2409.12994 [cs.AR]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.12994
-
[25]
Towards Multilingual LLM Evaluation for European Languages
Klaudia Thellmann et al. Towards Multilingual LLM Evaluation for European Languages. 2024. arXiv: 2410.08928 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08928
-
[26]
Data Processing for the OpenGPT-X Model Family
Nicolo’ Brandizzi et al. Data Processing for the OpenGPT-X Model Family. 2024. arXiv: 2410.08800 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08800
-
[27]
OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models
Jan Ebert et al. “OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models”. In: NIC Symposium 2025 Proceedings . Ed. by Christine Peter, Marcus Müller, and Alexander Trautmann. Vol. 52. NIC Series. To appear. Jülich, Germany: Forschungszentrum Jülich GmbH, 2025
work page 2025
-
[28]
Training Compute-Optimal Large Language Models
Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. 2022. arXiv: 2203.15556 [cs.CL] /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2104.09864
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Albert Q. Jiang et al. Mixtral of Experts. 2024. arXiv: 2401.04088 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2401.04088
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
OLMoE: Open Mixture-of-Experts Language Models
Niklas Muennighoff et al. OLMoE: Open Mixture-of-Experts Language Models . 2025. arXiv: 2409.02060 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.02060
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie et al. GQA: Training Generalized Multi-Query Transformer Models from Multi- Head Checkpoints. 2023. arXiv: 2305.13245 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2305.13245. Best Practices for HPC LLM Training 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang et al. OPT: Open Pre-trained Transformer Language Models. 2022. arXiv: 2205.01068 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.01068
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Yanli Zhao et al. “PyTorch FSDP: Experiences on Scaling Ful ly Sharded Data Parallel”. In: Proc. VLDB Endow.16.12 (Aug. 2023), pp. 3848–3860. /i.sc/s.sc/s.sc/n.sc: 2150-8097. /d.sc/o.sc/i.sc: 10.14778/3611540.3611569. /u.sc/r.sc/l.sc: https://doi.org/10.14778/3611540.3611569
-
[35]
Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel
Jiangtao Wang et al. “Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel”. In: 2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (W ANT@ICML 2024). 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=qqV
work page 2024
-
[36]
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. 2020. arXiv: 1910.02054 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1910.02054
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[39]
FlashAttention: Fast and Memory-Efficient Exact Attention w ith IO-A wareness
Tri Dao et al. FlashAttention: Fast and Memory-Efficient Exact Attention w ith IO-A wareness
-
[40]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arXiv: 2205.14135 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.14135
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning
Tri Dao. FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning
-
[42]
arXiv: 2307.08691 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Yiqun Yao and Yequan Wang. Research without Re-search: Maximal Update Parametrizati on Yields Accurate Loss Prediction across Scales . 2023. arXiv: 2304.06875 [cs.CL]
-
[44]
Efficiently Scaling Transformer Inference
Reiner Pope et al. Efficiently Scaling Transformer Inference . 2022. arXiv: 2211.05102 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05102
-
[45]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces
-
[46]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
arXiv: 2312.00752 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2312.00752
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
xlstm: Extended long short-term memory
Maximilian Beck et al. xLSTM: Extended Long Short-Term Memory. 2024. arXiv: 2405.04517 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04517
-
[48]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929 [cs.CV]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
Mod ular Supercomputing Architecture: From Idea to Production
Estela Suarez, Norbert Eicker, and Thomas Lippert. “Mod ular Supercomputing Architecture: From Idea to Production”. In: May 2019, pp. 223–255. /i.sc/s.sc/b.sc/n.sc: 9781351036863. /d.sc/o.sc/i.sc: 10.1201/9781351036863-9
-
[50]
Philipp Thörnig. “JURECA: Data Centric and Booster module s implementing the modular supercomputing architecture at Jülich Supercomputing Centre ”. In: J. Large-scale Res. Facil. JLSRF 7.A182 (Oct. 2021). /u.sc/r.sc/l.sc: https://doi.org/10.17815/jlsrf-7-182
-
[51]
PyTorch: An Imperative Style, High-Performance Deep Learn ing Library
Adam Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learn ing Library
-
[52]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
arXiv: 1912.01703 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1912.01703
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[53]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems
Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Soft- ware available from tensorflow.org. 2015. /u.sc/r.sc/l.sc: https://www.tensorflow.org/
work page 2015
-
[54]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop et al. BLOOM: A 176B-Parameter Open-Access Multilingual Languag e Model. 2023. arXiv: 2211.05100 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05100
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Reducing Activation Recomputation in Large Transformer Mo dels
Vijay Korthikanti et al. Reducing Activation Recomputation in Large Transformer Mo dels
- [56]
-
[57]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation . 2022. arXiv: 2108.12409 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models
Xingyu Xie et al. Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models. 2023. arXiv: 2208.06677 [cs.LG]
-
[59]
The Ultra-Scale Playbook: Training LLMs on GPU Clusters
Nouamane Tazi et al. The Ultra-Scale Playbook: Training LLMs on GPU Clusters . 2025
work page 2025
-
[60]
Modern Sc ientific Software Manage- ment Using EasyBuild and Lmod
Markus Geimer, Kenneth Hoste, and Robert McLay. “Modern Sc ientific Software Manage- ment Using EasyBuild and Lmod”. In: 2014 First International Workshop on HPC User Support Tools. 2014, pp. 41–51. /d.sc/o.sc/i.sc: 10.1109/HUST.2014.8
-
[61]
Singularity Developers. Singularity. 2021. /d.sc/o.sc/i.sc: 10.5281/zenodo.1310023. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo 14 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten
-
[62]
UFTP: high-perfor mance data transfer for UNI- CORE
Bernd Thomas Schuller and Tim Pohlmann. “UFTP: high-perfor mance data transfer for UNI- CORE”. In: July 2011, pp. 135–142
work page 2011
-
[63]
Yannik Müller et al. LLview. Version v2.3.1-base. July 2024. /d.sc/o.sc/i.sc: 10.5281/zenodo.12706843. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.12706843
-
[64]
Analyzing HPC Monitoring Data With a Vi ew Towards Efficient Re- source Utilization
Samuel Maloney et al. “Analyzing HPC Monitoring Data With a Vi ew Towards Efficient Re- source Utilization”. In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2024, pp. 170–181. /d.sc/o.sc/i.sc: 10.1109/SBAC-PAD63648.2024.00023
-
[65]
OpenGPT-X – Training Large Langua ge Models on HPC Systems
Chelsea Maria John et al. “OpenGPT-X – Training Large Langua ge Models on HPC Systems”. In: ISC High Performance 2023, Hamburg (Germany), 21 May 2023 - 2 5 May 2023. May 21,
work page 2023
-
[66]
/u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707
/d.sc/o.sc/i.sc: 10.34732/XDVBLG-SVNDMJ. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707
-
[67]
Andreas Herten et al. “Application-Driven Exascale: The JUPITER Benchmark Suite”. In: SC24: International Conference for High Performance Compu ting, Networking, Storage and Analysis. IEEE, Nov. 2024, pp. 1–45. /d.sc/o.sc/i.sc: 10.1109/sc41406.2024.00038. /u.sc/r.sc/l.sc: http://dx.doi.org/10.1109/SC41406.2024.00 A APPENDIX Example Slurm job script to lau...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00038 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.