Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems
Pith reviewed 2026-05-10 14:34 UTC · model grok-4.3
The pith
A hybrid HPC-cloud platform lets supercomputers run complete foundation model lifecycles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents a hybrid cloud-native platform that pairs diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure, all managed by Kubernetes, to bridge traditional HPC batch processing with the service-oriented workflows needed for fine-tuning and inference of foundation models.
What carries the argument
The hybrid service architecture that integrates specialized diskless HPC GPU nodes with virtualized commodity hardware under Kubernetes orchestration to handle mixed batch and interactive AI workloads.
If this is right
- Supercomputers can host fine-tuning pipelines that use substantial GPU resources yet need more interactive scheduling than pre-training jobs.
- Highly available inference services become feasible inside the same HPC facility that performed the pre-training.
- User productivity rises because researchers no longer need to export models to separate cloud environments for later lifecycle stages.
- Other national facilities gain a concrete blueprint for adding AI-factory capabilities to existing capability-class machines.
Where Pith is reading between the lines
- Sovereign AI programs could keep more of the model lifecycle inside government-funded HPC systems rather than relying on commercial clouds.
- Scientific workflows that combine simulation with AI inference might run end-to-end on the same machine without data movement.
- The architecture may generalize to other mixed workloads such as interactive data analysis alongside traditional batch simulations.
Load-bearing premise
The hybrid diskless HPC plus virtualized commodity setup orchestrated by Kubernetes can be deployed in production without major performance penalties or operational conflicts.
What would settle it
Measure whether production fine-tuning and inference jobs on the hybrid platform show higher latency, lower GPU utilization, or more frequent failures than equivalent jobs on pure batch HPC or pure cloud systems.
Figures
read the original abstract
Large-scale pre-training of Foundational Models (FM) constitutes a computationally intensive first phase for enabling AI across diverse scientific and societal applications. This first phase has positioned High-Performance Computing (HPC) facilities as indispensable backbones of "Sovereign AI" initiatives. While the massive throughput requirements of FM pre-training align with the traditional capability-oriented mission of HPC, subsequent phases of the AI lifecycle, typically referred to as fine-tuning and inference, introduce operational paradigms that can conflict with established batch-processing environments. Moreover, these phases are not computationally trivial: they often require substantial high-end compute resources while exhibiting hardware utilization patterns that differ significantly from those of pre-training. This paper addresses the architectural and strategic challenges of operationalizing a complete AI lifecycle within a national supercomputing facility. We present a hybrid cloud-native platform being developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure. Orchestrated by Kubernetes, this novel service architecture bridges the gap between HPC batch processing and service-oriented workflows. We report our initial investigations into fine-tuning pipelines and highly available inference services, analyzing the associated trade-offs while improving user productivity. Our findings offer a blueprint for enabling supercomputers to integrate "AI Factories" services and workflows, supporting AI innovations into end-to-end scientific and industrial use cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a hybrid cloud-native platform developed and deployed at the Swiss National Supercomputing Centre (CSCS) that combines diskless GPU-enabled HPE Cray EX compute nodes with virtualized commodity infrastructure, orchestrated by Kubernetes. This architecture is intended to support the full lifecycle of foundation models on HPC systems by bridging traditional batch processing with service-oriented workflows for fine-tuning and inference; the paper reports initial investigations into pipelines and services along with associated trade-off analyses and positions the work as a blueprint for integrating AI factory services into supercomputers.
Significance. If the hybrid platform can be shown through measurements to operate without major performance penalties or conflicts, the work would provide a practical blueprint for national HPC centers to incorporate dynamic AI workloads, supporting sovereign AI initiatives and end-to-end scientific use cases beyond pre-training.
major comments (2)
- [Abstract] Abstract: the claims that the platform 'bridges the gap between HPC batch processing and service-oriented workflows' and 'improves user productivity' are not supported by any quantitative benchmarks, GPU utilization data, orchestration latency, throughput deltas, or comparisons against pure-batch or pure-cloud baselines.
- [The reported initial investigations into fine-tuning pipelines and highly available inference services] The reported initial investigations into fine-tuning pipelines and highly available inference services: these remain at the level of design choices and qualitative trade-off discussion; no empirical evidence is provided to demonstrate acceptable overheads or absence of operational conflicts in the hybrid diskless-plus-virtualized setup under Kubernetes, which is load-bearing for the central architectural claim.
minor comments (1)
- The manuscript would benefit from explicit definitions of terms such as 'AI Factories' and a clearer description of how diskless nodes interact with the virtualized commodity layer.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where the current presentation of claims exceeds the empirical content provided. We address each major comment below and describe the revisions we will make to better align the text with the scope of our initial investigations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claims that the platform 'bridges the gap between HPC batch processing and service-oriented workflows' and 'improves user productivity' are not supported by any quantitative benchmarks, GPU utilization data, orchestration latency, throughput deltas, or comparisons against pure-batch or pure-cloud baselines.
Authors: We agree that the abstract asserts bridging of workflows and productivity improvements without supporting quantitative evidence. The manuscript reports architectural design and initial qualitative investigations rather than completed benchmark studies. We will revise the abstract to remove or qualify these claims, limiting it to a description of the hybrid platform, the reported design choices, and the positioning as a blueprint for future AI-factory integration. revision: yes
-
Referee: [The reported initial investigations into fine-tuning pipelines and highly available inference services] The reported initial investigations into fine-tuning pipelines and highly available inference services: these remain at the level of design choices and qualitative trade-off discussion; no empirical evidence is provided to demonstrate acceptable overheads or absence of operational conflicts in the hybrid diskless-plus-virtualized setup under Kubernetes, which is load-bearing for the central architectural claim.
Authors: The referee accurately notes that the investigations are presented through design choices and qualitative trade-off analysis rather than measured overheads or conflict data. Because the platform deployment is at an early stage, the manuscript does not contain such empirical results. We will revise the relevant sections to state explicitly that the work is preliminary, to describe the evaluation framework we intend to apply, and to avoid implying validated performance parity with pure-batch or pure-cloud baselines. revision: yes
- Empirical measurements demonstrating acceptable overheads and absence of operational conflicts in the hybrid diskless-plus-virtualized Kubernetes setup on HPE Cray EX nodes
Circularity Check
No circularity: purely descriptive systems architecture report
full rationale
The paper presents an architectural blueprint and initial qualitative investigations for a hybrid Kubernetes-orchestrated HPC platform supporting the full AI model lifecycle. It contains no equations, derivations, fitted parameters, predictive models, or quantitative claims that could reduce to their own inputs by construction. All load-bearing statements are design choices and trade-off discussions rather than self-referential results, self-citation chains, or renamed empirical patterns. This is a standard non-circular descriptive systems paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Argo Project. 2026. Argo CD: Declarative GitOps Continuous Delivery for Kubernetes. https://argo-cd.readthedocs.io. Accessed: 2026-03-16
work page 2026
-
[3]
BerriAI. 2024. LiteLLM: Open Source LLM Gateway to Call 100+ LLM APIs in a Unified Format. https://github.com/BerriAI/litellm. Proxy and SDK that provides a unified interface to multiple large language model providers such as OpenAI, Anthropic, Azure, and HuggingFace
work page 2024
-
[4]
Canonical Ltd. 2026. MAAS: Metal-as-a-Service for Bare-Metal Provisioning. https://maas.io. Accessed: 2026-03-16
work page 2026
-
[5]
Antony Chazapis, Fotis Nikolaidis, Manolis Marazakis, and Angelos Bilas. 2023. Running Kubernetes Workloads on HPC. InHigh Performance Computing: ISC High Performance 2023 International Workshops, Hamburg, Germany, May 21–25, 2023, Revised Selected Papers(Hamburg, Germany). Springer-Verlag, Berlin, Hei- delberg, 181–192. doi:10.1007/978-3-031-40843-4_14
-
[6]
European Commision. 2025. AI Factories, Bridging AI Innovation and Trust. https://digital-strategy.ec.europa.eu/en/policies/ai-factories
work page 2025
-
[7]
2026.ColonyOS: Distributed Meta-Orchestrator
ColonyOS Contributors. 2026.ColonyOS: Distributed Meta-Orchestrator. https: //github.com/colonyos/colonies
work page 2026
-
[8]
2025.Model Spinning: FirecREST and CI/CD for hot model avail- ability
Elia Palme et al. 2025.Model Spinning: FirecREST and CI/CD for hot model avail- ability. https://github.com/swiss-ai/model-spinning Accessed: 2026-03-15
work page 2025
-
[9]
ETH Zurich, EPFL, and Swiss National Supercomputing Centre (CSCS). 2023. The Swiss AI Initiative. https://www.swiss-ai.org/. National open-science initiative to develop transparent and trustworthy foundation models using large-scale compute on the Alps supercomputer
work page 2023
-
[10]
European Centre for Medium-Range Weather Forecasts (ECMWF) and Con- sortium. 2025. WeatherGenerator: Building a European Foundation Model for Weather and Climate. https://weathergenerator.eu/. EU Horizon Europe Project, Grant Agreement No. 101187947; aims to develop an AI-driven Earth system model for improved weather and climate prediction, renewable ene...
work page 2025
-
[11]
Gruntwork. 2026. Terragrunt: Thin Wrapper for Terraform/OpenTofu. https: //terragrunt.gruntwork.io. Accessed: 2026-03-16
work page 2026
-
[12]
HashiCorp. 2026. Vault: Secrets Management and Data Protection. https://www. vaultproject.io. Accessed: 2026-03-16
work page 2026
- [13]
-
[14]
Pedro Garcia Lopez, Daniel Barcelona Pons, Marcin Copik, Torsten Hoefler, Ed- uardo Quiñones, Maciej Malawski, Peter Pietzutch, Alberto Marti, Thomas Ohlson Timoudas, and Aleksander Slominski. 2025. AI Factories: It’s time to rethink the Cloud-HPC divide. (2025). arXiv:2509.12849 [cs.DC] https://arxiv.org/abs/2509. 12849
-
[15]
Maxime Martinasso, Mark Klein, Benjamin Cumming, Miguel Gila, Felipe Cruz, Alberto Madonna, Manuel Sopena Ballesteros, Sadaf R. Alam, and Thomas C. Schulthess. 2024. Versatile Software-Defined Cluster for HPC Using Cloud Ab- stractions.Computing in Science & Engineering26, 3 (2024), 20–29. doi:10.1109/ MCSE.2024.3394164
-
[16]
Maxime Martinasso, Mark Klein, and Thomas Schulthess. 2025. Alps, a ver- satile research infrastructure. InProceedings of the Cray User Group (CUG ’25). Association for Computing Machinery, New York, NY, USA, 156–165. doi:10.1145/3757348.3757365
- [17]
-
[18]
OpenCHAMI Project. [n. d.]. OpenCHAMI: Open-Source Toolkit for HPC and AI Infrastructure Management. https://www.openchami.org. Cloud-native, composable microservice platform for provisioning and managing HPC and AI systems. Accessed: 2026-03-15
work page 2026
-
[19]
OpenTofu Project. 2026. OpenTofu: Open Source Infrastructure as Code Tool. https://opentofu.org. Accessed: 2026-03-16
work page 2026
-
[20]
Elia Palme, Juan Pablo Dorsch, Ali Khosravi, Giovanni Pizzi, Francesco Pag- namenta, Andrea Ceriani, Eirini Koutsaniti, Rafael Sarmiento, Ivano Bone- sana, and Alejandro Dabin. 2025. FirecREST v2: lessons learned from re- designing an API for scalable HPC resource access. arXiv:2512.11634 [cs.DC] https://arxiv.org/abs/2512.11634
-
[21]
Bhagyajit Pingua, Adyakanta Sahoo, Meenakshi Kandpal, Deepak Murmu, Jyotir- mayee Rautaray, Rabindra Kumar Barik, and Manob Jyoti Saikia. 2025. Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation.Bioengineering12, 7 (2025). doi:10.3390/bioengineering12070687
-
[22]
Public AI. 2025. With Love, From Switzerland: Launching Apertus. https:// publicai.co/stories/apertus. Announcement of the Apertus open multilingual large language model and its deployment on the Public AI Inference Utility
work page 2025
-
[23]
PyTorch Contributors. 2017. Training a Classifier. https://docs.pytorch.org/ tutorials/beginner/blitz/cifar10_tutorial.html. Official PyTorch tutorial demon- strating image classification on the CIFAR-10 dataset using torchvision and a convolutional neural network
work page 2017
-
[24]
Fine-grained application energy and power measurements on the frontier exascale system,
Stefano Schuppli, Fawzi Mohamed, Henrique Mendonca, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost Vande- Vondele, Maxime Martinasso, Thomas C. Schulthess, and Torsten Hoefler. 2025. Evolving HPC services to enable ML workloads on HPE Cray EX. InProceedings of the Cray User Group (CUG ’25). Association for Com...
-
[25]
SUSE. 2026. Rancher: Enterprise Kubernetes Management Platform. https: //rancher.com/. Accessed: 2026-03-16
work page 2026
-
[26]
SUSE. 2026. SUSE Virtualization (formerly Harvester): Hyperconverged Infras- tructure Platform. https://www.suse.com/products/virtualization/. Accessed: 2026-03-16
work page 2026
-
[27]
Swiss AI Initiative. 2025. Apertus-70B-2509. https://huggingface.co/swiss-ai/ Apertus-70B-2509. Model card and weights for the Apertus open multilingual large language model (70B parameters)
work page 2025
-
[28]
Tim Trappen, Robert Keßler, Roland Pabel, Viktor Achter, and Stefan Wesner. 2025. Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM. (Dec. 2025), 13–18. doi:10.1145/3774902.3776632
-
[29]
vLLM Project Contributors. 2025. vLLM Production Stack: Reference Stack for Production LLM Inference. https://github.com/vllm-project/production-stack. Open-source Kubernetes-native reference implementation for scalable LLM in- ference built on top of vLLM, including routing, KV-cache management, and observability components
work page 2025
-
[30]
Waldur Project. [n. d.]. Waldur: Open-Source Platform for Managing Hybrid Cloud and HPC Resources. https://waldur.com. Platform providing automation, self-service portals, billing, and resource management for private clouds, public clouds, and HPC infrastructures. Accessed: 2026-03-15
work page 2026
-
[31]
2025.How SwissAI Uses OpenTela for Scalable LLM Serving
Xiaozhe Yao. 2025.How SwissAI Uses OpenTela for Scalable LLM Serving. https: //about.yao.sh/posts/opentela-swissai/ Accessed: 2026-03-15
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.