Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project

Andreas Herten; Carolin Penke; Chelsea Maria John; Jan Ebert; Stefan Kesselheim

arxiv: 2504.10013 · v2 · submitted 2025-04-14 · 💻 cs.DC

Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project

Carolin Penke , Chelsea Maria John , Jan Ebert , Stefan Kesselheim , Andreas Herten This is my paper

Pith reviewed 2026-05-22 21:00 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM trainingHPC systems3D parallelismflash attentiontraining throughputtransformer modelscalability

0 comments

The pith

Throughput measurements show how 3D parallelism configurations and flash attention affect training speed for 7B models on HPC systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports measured throughput numbers collected while training a 7-billion-parameter transformer under multiple combinations of 3D parallelism. It also quantifies the change in speed when flash attention is enabled. These numbers matter because they supply concrete data points that can inform configuration choices when similar models are trained on large computing clusters. The report further outlines the supporting software stack, profiling methods, and day-to-day operational issues encountered during the runs.

Core claim

Measured throughput data across varied 3D parallelism settings during training of a 7B-parameter model, together with the performance impact of flash attention, constitute the central empirical contribution.

What carries the argument

3D parallelism (the combination of data, tensor, and pipeline parallelism) augmented by the flash attention optimization.

If this is right

Particular mixes of the three parallelism dimensions deliver higher tokens processed per second than others.
Enabling flash attention reduces memory use and raises overall training throughput.
Systematic profiling during runs identifies the dominant bottlenecks in the training pipeline.
Careful management of the software environment and job scheduler reduces lost compute time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same measurement approach could be applied to models larger than 7B parameters to test whether the same configuration trends hold.
Repeating the benchmarks on a different hardware generation would reveal how portable the observed optimal settings are.
The reported numbers supply a baseline for estimating total compute cost when planning multilingual model training campaigns.

Load-bearing premise

The throughput values recorded on this hardware and model size will remain representative on other hardware or with models of different sizes.

What would settle it

Repeating the exact same parallelism configurations on a second HPC cluster and obtaining throughput rankings that differ from those reported here.

read the original abstract

The training of large language models (LLMs) requires substantial computational resources, complex software stacks, and carefully designed workflows to achieve scalability and efficiency. This report presents best practices and insights gained from the OpenGPT-X project, a German initiative focused on developing open, multilingual LLMs optimized for European languages. We detail the use of high-performance computing (HPC) systems, primarily JUWELS Booster at JSC, for training Teuken-7B, a 7-billion-parameter transformer model. The report covers system architecture, training infrastructure, software choices, profiling and benchmarking tools, as well as engineering and operational challenges. It includes measured throughput data of various configurations of 3D parallelism during training and the impact of features such as flash attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents best practices from the OpenGPT-X project for training LLMs on HPC systems, focusing on the training of the Teuken-7B model on JUWELS Booster. It details the infrastructure, software stack, 3D parallelism strategies, flash attention implementation, and includes specific measured throughput numbers for different parallelism configurations and feature impacts.

Significance. The provision of real-world measured data from a production-scale training run is a strength, offering practical insights into scaling LLM training on European HPC resources. This can inform similar projects, especially those emphasizing open and multilingual models. The report's value lies in its concrete examples rather than theoretical derivations.

major comments (2)

[Section on 3D Parallelism Configurations] The throughput measurements for different 3D parallelism setups are presented without error bars or details on experimental repetitions. This weakens the ability to confidently recommend specific configurations as best practices based on the data.
[Discussion of Best Practices] The generalization of the reported practices beyond the JUWELS Booster system and 7B model size is not addressed. Since the central claim is to provide best practices, evidence or discussion on how these would apply to other hardware or scales is needed to support the claim.

minor comments (1)

[Abstract] Consider adding one or two key quantitative results to the abstract to better convey the paper's empirical contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results and claims.

read point-by-point responses

Referee: The throughput measurements for different 3D parallelism setups are presented without error bars or details on experimental repetitions. This weakens the ability to confidently recommend specific configurations as best practices based on the data.

Authors: We agree that additional details on measurement methodology would improve confidence in the reported numbers. The throughput values were collected during extended stable phases of production training runs on JUWELS Booster, where full repeated trials with statistical error bars are resource-prohibitive. In the revised manuscript we have added a paragraph describing the measurement protocol (duration of each stable window, number of independent configuration tests performed where feasible, and observed run-to-run variability). We also note the practical constraints of production-scale experiments. This provides necessary context while preserving the original data. revision: partial
Referee: The generalization of the reported practices beyond the JUWELS Booster system and 7B model size is not addressed. Since the central claim is to provide best practices, evidence or discussion on how these would apply to other hardware or scales is needed to support the claim.

Authors: We acknowledge that the manuscript would benefit from explicit discussion of transferability. While the concrete numbers are tied to JUWELS Booster and the 7B scale, the underlying engineering choices (3D parallelism decomposition, software stack selection, and Flash Attention integration) rest on general principles of communication-computation overlap and memory hierarchy optimization. In the revised version we have added a dedicated subsection that discusses applicability to other European HPC systems with different interconnects, to larger model sizes, and to alternative accelerator architectures, including both the transferable elements and the system-specific caveats. revision: yes

Circularity Check

0 steps flagged

No circularity detected; purely descriptive empirical report of measurements and project practices

full rationale

The manuscript is a project report presenting measured throughput numbers for 3D parallelism configurations and flash-attention impact on Teuken-7B running on JUWELS Booster, along with descriptions of system architecture, software choices, and operational challenges. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All content consists of direct reporting of project-specific data and choices with no load-bearing claims that reduce to inputs by construction. This matches the default expectation for non-circular empirical reports; the generalization limitation noted by the reader is a correctness/scope issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, or invented entities; the work is an applied engineering report on infrastructure choices and measurements.

pith-pipeline@v0.9.0 · 5664 in / 926 out tokens · 39958 ms · 2026-05-22T21:00:44.180964+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 23 internal anchors

[1]

JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre

Dorian Krause. “JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre”. In: Journal of large-scale research facilities 5 (2019), A135. /d.sc/o.sc/i.sc: 10.17815/jlsrf-5-171. /u.sc/r.sc/l.sc: https://jlsrf.org/index.php/lsf/article/view/171

work page doi:10.17815/jlsrf-5-171 2019
[2]

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Mehdi Ali et al. Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs. 2024. arXiv: 2410.03730 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.03730

work page arXiv 2024
[3]

Attention Is All You Need

Ashish Vaswani et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

A mathematician’s introduction to transformers and large l anguage models

Carolin Penke. A mathematician’s introduction to transformers and large l anguage models . JSC Accelerating Devices Lab Blog (online). July 2022. /d.sc/o.sc/i.sc: 10.34732/xdvblg-qsbtyx

work page doi:10.34732/xdvblg-qsbtyx 2022
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding. 2019. arXiv: 1810.04805 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[6]

Language Models are Unsupervised Mult itask Learners

Alec Radford et al. “Language Models are Unsupervised Mult itask Learners”. In: OpenAI (2019). Accessed: 2024-11-15./u.sc/r.sc/l.sc: https://cdn.openai.com/better-language-models/language _models_are_unsup

work page 2019
[7]

Language Models are Few-Shot Learners

Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[8]

Training language models to follow instructions with human feedback

Long Ouyang et al. Training language models to follow instructions with human feedback

work page
[9]

Training language models to follow instructions with human feedback

arXiv: 2203.02155 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler et al. Fine-Tuning Language Models from Human Preferences . 2020. arXiv: 1909.08593 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

Learning to summarize from human feedback

Nisan Stiennon et al. Learning to summarize from human feedback. 2022. arXiv: 2009.01325 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2009.01325. 12 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. LLaMA: Open and Eﬃcient Foundation Language Models . 2023. arXiv: 2302.13971 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI et al. DeepSeek-V2: A Strong, Economical, and Eﬃcient Mixture-of -Experts Lan- guage Model. 2024. arXiv: 2405.04434 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

OpenGPT-X - Training Large Language Mod els on HPC Systems

Carolin Penke et al. “OpenGPT-X - Training Large Language Mod els on HPC Systems”. In: 14th JLESC Workshop, Urbana-Champaign (USA), 28 Sep 2022 - 30 Sep 2022. Sept. 28, 2022. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/910080

work page 2022
[15]

Eﬃcient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Malte Ostendorﬀ and Georg Rehm. Eﬃcient Language Model Training through Cross-Lingual and Progressive Transfer Learning. 2023. arXiv: 2301.09626 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2301.09626

work page arXiv 2023
[18]

Tokenizer Choice For LLM Training: Negligible or Crucial?2024

Mehdi Ali et al. Tokenizer Choice For LLM Training: Negligible or Crucial?2024. arXiv: 2310.08754 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2310.08754

work page arXiv 2024
[19]

OpenGPT-X: Novel Ar chitecture Exploration

Chelsea Maria John and Andreas Herten. “OpenGPT-X: Novel Ar chitecture Exploration”. In: WHPC Workshop at SC23 (WHPC@SC23). Denver, CO: Zenodo, Nov. 2023./d.sc/o.sc/i.sc: 10.5281/zenodo.10116242. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.10116242

work page doi:10.5281/zenodo.10116242 2023
[20]

Investigating Multilingual Instruction-Tuning: Do Polyglot Mod- els Demand for Multilingual Instructions? 2024

Alexander Arno Weber et al. Investigating Multilingual Instruction-Tuning: Do Polyglot Mod- els Demand for Multilingual Instructions? 2024. arXiv: 2402.13703 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2402.13703

work page arXiv 2024
[21]

ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classiﬁer and Slot Filler

Paramita Mirza et al. ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classiﬁer and Slot Filler. 2024. arXiv: 2403.17536 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2403.17536

work page arXiv 2024
[22]

Knowledge-Centric Hallucination Detection

Martin Courtois et al. “Symmetric Dot-Product Attentio n for Eﬃcient Training of BERT Lan- guage Models”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 8002–8011. /d.sc/o.sc/i.sc: 10.18653/v1/2024....

work page doi:10.18653/v1/2024 2024
[23]

LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models

Malte Ostendorﬀ et al. “LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models”. In: First Conference on Language Modeling. 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=5RdIMlGLXL

work page 2024
[24]

Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML

Chelsea Maria John et al. Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML. 2024. arXiv: 2409.12994 [cs.AR]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.12994

work page arXiv 2024
[25]

Towards Multilingual LLM Evaluation for European Languages

Klaudia Thellmann et al. Towards Multilingual LLM Evaluation for European Languages. 2024. arXiv: 2410.08928 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08928

work page arXiv 2024
[26]

Data Processing for the OpenGPT-X Model Family

Nicolo’ Brandizzi et al. Data Processing for the OpenGPT-X Model Family. 2024. arXiv: 2410.08800 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08800

work page arXiv 2024
[27]

OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models

Jan Ebert et al. “OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models”. In: NIC Symposium 2025 Proceedings . Ed. by Christine Peter, Marcus Müller, and Alexander Trautmann. Vol. 52. NIC Series. To appear. Jülich, Germany: Forschungszentrum Jülich GmbH, 2025

work page 2025
[28]

Training Compute-Optimal Large Language Models

Jordan Hoﬀmann et al. Training Compute-Optimal Large Language Models. 2022. arXiv: 2203.15556 [cs.CL] /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Mixtral of Experts

Albert Q. Jiang et al. Mixtral of Experts. 2024. arXiv: 2401.04088 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoﬀ et al. OLMoE: Open Mixture-of-Experts Language Models . 2025. arXiv: 2409.02060 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.02060

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie et al. GQA: Training Generalized Multi-Query Transformer Models from Multi- Head Checkpoints. 2023. arXiv: 2305.13245 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2305.13245. Best Practices for HPC LLM Training 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang et al. OPT: Open Pre-trained Transformer Language Models. 2022. arXiv: 2205.01068 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

I am EdgeRunner AI

Yanli Zhao et al. “PyTorch FSDP: Experiences on Scaling Ful ly Sharded Data Parallel”. In: Proc. VLDB Endow.16.12 (Aug. 2023), pp. 3848–3860. /i.sc/s.sc/s.sc/n.sc: 2150-8097. /d.sc/o.sc/i.sc: 10.14778/3611540.3611569. /u.sc/r.sc/l.sc: https://doi.org/10.14778/3611540.3611569

work page doi:10.14778/3611540.3611569 2023
[35]

Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel

Jiangtao Wang et al. “Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel”. In: 2nd Workshop on Advancing Neural Network Training: Computational Eﬃciency, Scalability, and Resource Optimization (W ANT@ICML 2024). 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=qqV

work page 2024
[36]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. 2020. arXiv: 1910.02054 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020
[39]

FlashAttention: Fast and Memory-Eﬃcient Exact Attention w ith IO-A wareness

Tri Dao et al. FlashAttention: Fast and Memory-Eﬃcient Exact Attention w ith IO-A wareness

work page
[40]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arXiv: 2205.14135 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv
[41]

FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning

work page
[42]

arXiv: 2307.08691 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Research without Re-search: Maximal Update Parametrizati on Yields Accurate Loss Prediction across Scales

Yiqun Yao and Yequan Wang. Research without Re-search: Maximal Update Parametrizati on Yields Accurate Loss Prediction across Scales . 2023. arXiv: 2304.06875 [cs.CL]

work page arXiv 2023
[44]

Eﬃciently Scaling Transformer Inference

Reiner Pope et al. Eﬃciently Scaling Transformer Inference . 2022. arXiv: 2211.05102 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05102

work page arXiv 2022
[45]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

work page
[46]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

arXiv: 2312.00752 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv
[47]

xlstm: Extended long short-term memory

Maximilian Beck et al. xLSTM: Extended Long Short-Term Memory. 2024. arXiv: 2405.04517 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04517

work page arXiv 2024
[48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929 [cs.CV]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Mod ular Supercomputing Architecture: From Idea to Production

Estela Suarez, Norbert Eicker, and Thomas Lippert. “Mod ular Supercomputing Architecture: From Idea to Production”. In: May 2019, pp. 223–255. /i.sc/s.sc/b.sc/n.sc: 9781351036863. /d.sc/o.sc/i.sc: 10.1201/9781351036863-9

work page doi:10.1201/9781351036863-9 2019
[50]

JURECA: Data Centric and Booster module s implementing the modular supercomputing architecture at Jülich Supercomputing Centre

Philipp Thörnig. “JURECA: Data Centric and Booster module s implementing the modular supercomputing architecture at Jülich Supercomputing Centre ”. In: J. Large-scale Res. Facil. JLSRF 7.A182 (Oct. 2021). /u.sc/r.sc/l.sc: https://doi.org/10.17815/jlsrf-7-182

work page doi:10.17815/jlsrf-7-182 2021
[51]

PyTorch: An Imperative Style, High-Performance Deep Learn ing Library

Adam Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learn ing Library

work page
[52]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

arXiv: 1912.01703 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1912.01703

work page internal anchor Pith review Pith/arXiv arXiv 1912
[53]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems

Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Soft- ware available from tensorﬂow.org. 2015. /u.sc/r.sc/l.sc: https://www.tensorﬂow.org/

work page 2015
[54]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop et al. BLOOM: A 176B-Parameter Open-Access Multilingual Languag e Model. 2023. arXiv: 2211.05100 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05100

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Reducing Activation Recomputation in Large Transformer Mo dels

Vijay Korthikanti et al. Reducing Activation Recomputation in Large Transformer Mo dels

work page
[56]

arXiv: 2205.05198 [cs.LG]

work page arXiv
[57]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Oﬁr Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation . 2022. arXiv: 2108.12409 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models

Xingyu Xie et al. Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models. 2023. arXiv: 2208.06677 [cs.LG]

work page arXiv 2023
[59]

The Ultra-Scale Playbook: Training LLMs on GPU Clusters

Nouamane Tazi et al. The Ultra-Scale Playbook: Training LLMs on GPU Clusters . 2025

work page 2025
[60]

Modern Sc ientiﬁc Software Manage- ment Using EasyBuild and Lmod

Markus Geimer, Kenneth Hoste, and Robert McLay. “Modern Sc ientiﬁc Software Manage- ment Using EasyBuild and Lmod”. In: 2014 First International Workshop on HPC User Support Tools. 2014, pp. 41–51. /d.sc/o.sc/i.sc: 10.1109/HUST.2014.8

work page doi:10.1109/hust.2014.8 2014
[61]

Singularity

Singularity Developers. Singularity. 2021. /d.sc/o.sc/i.sc: 10.5281/zenodo.1310023. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo 14 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten

work page doi:10.5281/zenodo.1310023 2021
[62]

UFTP: high-perfor mance data transfer for UNI- CORE

Bernd Thomas Schuller and Tim Pohlmann. “UFTP: high-perfor mance data transfer for UNI- CORE”. In: July 2011, pp. 135–142

work page 2011
[63]

Yannik Müller et al. LLview. Version v2.3.1-base. July 2024. /d.sc/o.sc/i.sc: 10.5281/zenodo.12706843. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.12706843

work page doi:10.5281/zenodo.12706843 2024
[64]

Analyzing HPC Monitoring Data With a Vi ew Towards Eﬃcient Re- source Utilization

Samuel Maloney et al. “Analyzing HPC Monitoring Data With a Vi ew Towards Eﬃcient Re- source Utilization”. In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2024, pp. 170–181. /d.sc/o.sc/i.sc: 10.1109/SBAC-PAD63648.2024.00023

work page doi:10.1109/sbac-pad63648.2024.00023 2024
[65]

OpenGPT-X – Training Large Langua ge Models on HPC Systems

Chelsea Maria John et al. “OpenGPT-X – Training Large Langua ge Models on HPC Systems”. In: ISC High Performance 2023, Hamburg (Germany), 21 May 2023 - 2 5 May 2023. May 21,

work page 2023
[66]

/u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707

/d.sc/o.sc/i.sc: 10.34732/XDVBLG-SVNDMJ. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707

work page doi:10.34732/xdvblg-svndmj
[67]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Andreas Herten et al. “Application-Driven Exascale: The JUPITER Benchmark Suite”. In: SC24: International Conference for High Performance Compu ting, Networking, Storage and Analysis. IEEE, Nov. 2024, pp. 1–45. /d.sc/o.sc/i.sc: 10.1109/sc41406.2024.00038. /u.sc/r.sc/l.sc: http://dx.doi.org/10.1109/SC41406.2024.00 A APPENDIX Example Slurm job script to lau...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00038 2024

[1] [1]

JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre

Dorian Krause. “JUWELS: Modular Tier-0/1 Supercompute r at the Jülich Supercomputing Centre”. In: Journal of large-scale research facilities 5 (2019), A135. /d.sc/o.sc/i.sc: 10.17815/jlsrf-5-171. /u.sc/r.sc/l.sc: https://jlsrf.org/index.php/lsf/article/view/171

work page doi:10.17815/jlsrf-5-171 2019

[2] [2]

Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs

Mehdi Ali et al. Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs. 2024. arXiv: 2410.03730 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.03730

work page arXiv 2024

[3] [3]

Attention Is All You Need

Ashish Vaswani et al. Attention Is All You Need. 2023. arXiv: 1706.03762 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

A mathematician’s introduction to transformers and large l anguage models

Carolin Penke. A mathematician’s introduction to transformers and large l anguage models . JSC Accelerating Devices Lab Blog (online). July 2022. /d.sc/o.sc/i.sc: 10.34732/xdvblg-qsbtyx

work page doi:10.34732/xdvblg-qsbtyx 2022

[5] [5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- derstanding. 2019. arXiv: 1810.04805 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019

[6] [6]

Language Models are Unsupervised Mult itask Learners

Alec Radford et al. “Language Models are Unsupervised Mult itask Learners”. In: OpenAI (2019). Accessed: 2024-11-15./u.sc/r.sc/l.sc: https://cdn.openai.com/better-language-models/language _models_are_unsup

work page 2019

[7] [7]

Language Models are Few-Shot Learners

Tom B. Brown et al. Language Models are Few-Shot Learners. 2020. arXiv: 2005.14165 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020

[8] [8]

Training language models to follow instructions with human feedback

Long Ouyang et al. Training language models to follow instructions with human feedback

work page

[9] [9]

Training language models to follow instructions with human feedback

arXiv: 2203.02155 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.02155

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler et al. Fine-Tuning Language Models from Human Preferences . 2020. arXiv: 1909.08593 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

Learning to summarize from human feedback

Nisan Stiennon et al. Learning to summarize from human feedback. 2022. arXiv: 2009.01325 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2009.01325. 12 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. LLaMA: Open and Eﬃcient Foundation Language Models . 2023. arXiv: 2302.13971 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI et al. DeepSeek-V2: A Strong, Economical, and Eﬃcient Mixture-of -Experts Lan- guage Model. 2024. arXiv: 2405.04434 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04434

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

OpenGPT-X - Training Large Language Mod els on HPC Systems

Carolin Penke et al. “OpenGPT-X - Training Large Language Mod els on HPC Systems”. In: 14th JLESC Workshop, Urbana-Champaign (USA), 28 Sep 2022 - 30 Sep 2022. Sept. 28, 2022. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/910080

work page 2022

[15] [15]

Eﬃcient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Malte Ostendorﬀ and Georg Rehm. Eﬃcient Language Model Training through Cross-Lingual and Progressive Transfer Learning. 2023. arXiv: 2301.09626 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2301.09626

work page arXiv 2023

[16] [18]

Tokenizer Choice For LLM Training: Negligible or Crucial?2024

Mehdi Ali et al. Tokenizer Choice For LLM Training: Negligible or Crucial?2024. arXiv: 2310.08754 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2310.08754

work page arXiv 2024

[17] [19]

OpenGPT-X: Novel Ar chitecture Exploration

Chelsea Maria John and Andreas Herten. “OpenGPT-X: Novel Ar chitecture Exploration”. In: WHPC Workshop at SC23 (WHPC@SC23). Denver, CO: Zenodo, Nov. 2023./d.sc/o.sc/i.sc: 10.5281/zenodo.10116242. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.10116242

work page doi:10.5281/zenodo.10116242 2023

[18] [20]

Investigating Multilingual Instruction-Tuning: Do Polyglot Mod- els Demand for Multilingual Instructions? 2024

Alexander Arno Weber et al. Investigating Multilingual Instruction-Tuning: Do Polyglot Mod- els Demand for Multilingual Instructions? 2024. arXiv: 2402.13703 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2402.13703

work page arXiv 2024

[19] [21]

ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classiﬁer and Slot Filler

Paramita Mirza et al. ILLUMINER: Instruction-tuned Large Language Models as Few -shot In- tent Classiﬁer and Slot Filler. 2024. arXiv: 2403.17536 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2403.17536

work page arXiv 2024

[20] [22]

Knowledge-Centric Hallucination Detection

Martin Courtois et al. “Symmetric Dot-Product Attentio n for Eﬃcient Training of BERT Lan- guage Models”. In: Findings of the Association for Computational Linguistics: ACL 2024. Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand: Association for Com- putational Linguistics, Aug. 2024, pp. 8002–8011. /d.sc/o.sc/i.sc: 10.18653/v1/2024....

work page doi:10.18653/v1/2024 2024

[21] [23]

LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models

Malte Ostendorﬀ et al. “LLM-Datasets: An Open Framework for Pretraining Datasets of Large Language Models”. In: First Conference on Language Modeling. 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=5RdIMlGLXL

work page 2024

[22] [24]

Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML

Chelsea Maria John et al. Performance and Power: Systematic Evaluation of AI Workloa ds on Accelerators with CARAML. 2024. arXiv: 2409.12994 [cs.AR]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.12994

work page arXiv 2024

[23] [25]

Towards Multilingual LLM Evaluation for European Languages

Klaudia Thellmann et al. Towards Multilingual LLM Evaluation for European Languages. 2024. arXiv: 2410.08928 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08928

work page arXiv 2024

[24] [26]

Data Processing for the OpenGPT-X Model Family

Nicolo’ Brandizzi et al. Data Processing for the OpenGPT-X Model Family. 2024. arXiv: 2410.08800 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2410.08800

work page arXiv 2024

[25] [27]

OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models

Jan Ebert et al. “OpenGPT-X: Leveraging GCS Infrastructur e for European Large Language Models”. In: NIC Symposium 2025 Proceedings . Ed. by Christine Peter, Marcus Müller, and Alexander Trautmann. Vol. 52. NIC Series. To appear. Jülich, Germany: Forschungszentrum Jülich GmbH, 2025

work page 2025

[26] [28]

Training Compute-Optimal Large Language Models

Jordan Hoﬀmann et al. Training Compute-Optimal Large Language Models. 2022. arXiv: 2203.15556 [cs.CL] /u.sc/r.sc/l.sc: https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [29]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [30]

Mixtral of Experts

Albert Q. Jiang et al. Mixtral of Experts. 2024. arXiv: 2401.04088 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [31]

OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoﬀ et al. OLMoE: Open Mixture-of-Experts Language Models . 2025. arXiv: 2409.02060 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2409.02060

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [32]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie et al. GQA: Training Generalized Multi-Query Transformer Models from Multi- Head Checkpoints. 2023. arXiv: 2305.13245 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2305.13245. Best Practices for HPC LLM Training 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [33]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang et al. OPT: Open Pre-trained Transformer Language Models. 2022. arXiv: 2205.01068 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.01068

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [34]

I am EdgeRunner AI

Yanli Zhao et al. “PyTorch FSDP: Experiences on Scaling Ful ly Sharded Data Parallel”. In: Proc. VLDB Endow.16.12 (Aug. 2023), pp. 3848–3860. /i.sc/s.sc/s.sc/n.sc: 2150-8097. /d.sc/o.sc/i.sc: 10.14778/3611540.3611569. /u.sc/r.sc/l.sc: https://doi.org/10.14778/3611540.3611569

work page doi:10.14778/3611540.3611569 2023

[33] [35]

Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel

Jiangtao Wang et al. “Memory and Bandwidth are All Your Need f or Fully Sharded Data Parallel”. In: 2nd Workshop on Advancing Neural Network Training: Computational Eﬃciency, Scalability, and Resource Optimization (W ANT@ICML 2024). 2024. /u.sc/r.sc/l.sc: https://openreview.net/forum?id=qqV

work page 2024

[34] [36]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. 2020. arXiv: 1910.02054 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [39]

FlashAttention: Fast and Memory-Eﬃcient Exact Attention w ith IO-A wareness

Tri Dao et al. FlashAttention: Fast and Memory-Eﬃcient Exact Attention w ith IO-A wareness

work page

[36] [40]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arXiv: 2205.14135 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv

[37] [41]

FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning

Tri Dao. FlashAttention-2: Faster Attention with Better Paralleli sm and Work Partitioning

work page

[38] [42]

arXiv: 2307.08691 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv

[39] [43]

Research without Re-search: Maximal Update Parametrizati on Yields Accurate Loss Prediction across Scales

Yiqun Yao and Yequan Wang. Research without Re-search: Maximal Update Parametrizati on Yields Accurate Loss Prediction across Scales . 2023. arXiv: 2304.06875 [cs.CL]

work page arXiv 2023

[40] [44]

Eﬃciently Scaling Transformer Inference

Reiner Pope et al. Eﬃciently Scaling Transformer Inference . 2022. arXiv: 2211.05102 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05102

work page arXiv 2022

[41] [45]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

work page

[42] [46]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

arXiv: 2312.00752 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv

[43] [47]

xlstm: Extended long short-term memory

Maximilian Beck et al. xLSTM: Extended Long Short-Term Memory. 2024. arXiv: 2405.04517 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2405.04517

work page arXiv 2024

[44] [48]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929 [cs.CV]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [49]

Mod ular Supercomputing Architecture: From Idea to Production

Estela Suarez, Norbert Eicker, and Thomas Lippert. “Mod ular Supercomputing Architecture: From Idea to Production”. In: May 2019, pp. 223–255. /i.sc/s.sc/b.sc/n.sc: 9781351036863. /d.sc/o.sc/i.sc: 10.1201/9781351036863-9

work page doi:10.1201/9781351036863-9 2019

[46] [50]

JURECA: Data Centric and Booster module s implementing the modular supercomputing architecture at Jülich Supercomputing Centre

Philipp Thörnig. “JURECA: Data Centric and Booster module s implementing the modular supercomputing architecture at Jülich Supercomputing Centre ”. In: J. Large-scale Res. Facil. JLSRF 7.A182 (Oct. 2021). /u.sc/r.sc/l.sc: https://doi.org/10.17815/jlsrf-7-182

work page doi:10.17815/jlsrf-7-182 2021

[47] [51]

PyTorch: An Imperative Style, High-Performance Deep Learn ing Library

Adam Paszke et al. PyTorch: An Imperative Style, High-Performance Deep Learn ing Library

work page

[48] [52]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

arXiv: 1912.01703 [cs.LG]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/1912.01703

work page internal anchor Pith review Pith/arXiv arXiv 1912

[49] [53]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems

Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Soft- ware available from tensorﬂow.org. 2015. /u.sc/r.sc/l.sc: https://www.tensorﬂow.org/

work page 2015

[50] [54]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BigScience Workshop et al. BLOOM: A 176B-Parameter Open-Access Multilingual Languag e Model. 2023. arXiv: 2211.05100 [cs.CL]. /u.sc/r.sc/l.sc: https://arxiv.org/abs/2211.05100

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [55]

Reducing Activation Recomputation in Large Transformer Mo dels

Vijay Korthikanti et al. Reducing Activation Recomputation in Large Transformer Mo dels

work page

[52] [56]

arXiv: 2205.05198 [cs.LG]

work page arXiv

[53] [57]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Oﬁr Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation . 2022. arXiv: 2108.12409 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [58]

Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models

Xingyu Xie et al. Adan: Adaptive Nesterov Momentum Algorithm for Faster Opti mizing Deep Models. 2023. arXiv: 2208.06677 [cs.LG]

work page arXiv 2023

[55] [59]

The Ultra-Scale Playbook: Training LLMs on GPU Clusters

Nouamane Tazi et al. The Ultra-Scale Playbook: Training LLMs on GPU Clusters . 2025

work page 2025

[56] [60]

Modern Sc ientiﬁc Software Manage- ment Using EasyBuild and Lmod

Markus Geimer, Kenneth Hoste, and Robert McLay. “Modern Sc ientiﬁc Software Manage- ment Using EasyBuild and Lmod”. In: 2014 First International Workshop on HPC User Support Tools. 2014, pp. 41–51. /d.sc/o.sc/i.sc: 10.1109/HUST.2014.8

work page doi:10.1109/hust.2014.8 2014

[57] [61]

Singularity

Singularity Developers. Singularity. 2021. /d.sc/o.sc/i.sc: 10.5281/zenodo.1310023. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo 14 Carolin Penke, Chelsea Maria John, Jan Ebert, Stefan Kessel heim, and Andreas Herten

work page doi:10.5281/zenodo.1310023 2021

[58] [62]

UFTP: high-perfor mance data transfer for UNI- CORE

Bernd Thomas Schuller and Tim Pohlmann. “UFTP: high-perfor mance data transfer for UNI- CORE”. In: July 2011, pp. 135–142

work page 2011

[59] [63]

Yannik Müller et al. LLview. Version v2.3.1-base. July 2024. /d.sc/o.sc/i.sc: 10.5281/zenodo.12706843. /u.sc/r.sc/l.sc: https://doi.org/10.5281/zenodo.12706843

work page doi:10.5281/zenodo.12706843 2024

[60] [64]

Analyzing HPC Monitoring Data With a Vi ew Towards Eﬃcient Re- source Utilization

Samuel Maloney et al. “Analyzing HPC Monitoring Data With a Vi ew Towards Eﬃcient Re- source Utilization”. In: 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). 2024, pp. 170–181. /d.sc/o.sc/i.sc: 10.1109/SBAC-PAD63648.2024.00023

work page doi:10.1109/sbac-pad63648.2024.00023 2024

[61] [65]

OpenGPT-X – Training Large Langua ge Models on HPC Systems

Chelsea Maria John et al. “OpenGPT-X – Training Large Langua ge Models on HPC Systems”. In: ISC High Performance 2023, Hamburg (Germany), 21 May 2023 - 2 5 May 2023. May 21,

work page 2023

[62] [66]

/u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707

/d.sc/o.sc/i.sc: 10.34732/XDVBLG-SVNDMJ. /u.sc/r.sc/l.sc: https://juser.fz-juelich.de/record/1007707

work page doi:10.34732/xdvblg-svndmj

[63] [67]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Andreas Herten et al. “Application-Driven Exascale: The JUPITER Benchmark Suite”. In: SC24: International Conference for High Performance Compu ting, Networking, Storage and Analysis. IEEE, Nov. 2024, pp. 1–45. /d.sc/o.sc/i.sc: 10.1109/sc41406.2024.00038. /u.sc/r.sc/l.sc: http://dx.doi.org/10.1109/SC41406.2024.00 A APPENDIX Example Slurm job script to lau...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00038 2024