HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support
Pith reviewed 2026-05-20 22:23 UTC · model grok-4.3
The pith
An 8B LLM adapted for HPC tasks performs like much larger models but uses less memory and runs faster.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that domain adaptation of Llama 3.1 8B via QLoRA on an HPC corpus of 9,000-24,000 examples, when combined with retrieval-augmented generation, yields a practical assistant for Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting that approaches the performance of larger general-purpose models such as Qwen 2.5 14B under lower GPU memory and latency constraints.
What carries the argument
QLoRA-based lightweight domain adaptation of an 8B Llama model on a curated HPC corpus, integrated with dense retrieval for context-aware responses.
Load-bearing premise
That the constructed HPC corpus and the specific evaluation cases on JetStream2 sufficiently represent the diversity of real-world HPC user needs and cluster environments.
What would settle it
Observing a significant performance drop when the model is tested on HPC queries from a previously unseen university cluster or with novel troubleshooting scenarios not covered in the training corpus.
Figures
read the original abstract
Modern scientific research increasingly depends on High-Performance Computing (HPC) infrastructures, yet many researchers face significant operational barriers when interacting with cluster environments, job schedulers, GPU resources, and parallel computing frameworks. General-purpose large language models (LLMs) provide useful coding assistance but often lack the domain-specific operational knowledge required for reliable HPC support. This paper presents HPC-LLM, a retrieval augmented and domain-adapted assistant designed to support common HPC workflows including Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting. The proposed framework integrates automated documentation ingestion, dense retrieval, lightweight domain adaptation using QLoRA, and local inference within a modular orchestration pipeline. To support domain adaptation, we construct an HPC-oriented corpus from publicly available university HPC documentation, curated operational examples, and synthetic instruction-answer pairs generated from retrieved HPC content. The resulting dataset contains approximately 9,000 to 24,000 HPC-focused training examples spanning job scheduling, GPU computing, distributed training, storage systems, and cluster administration topics. We fine-tune Llama 3.1 8B using QLoRA and evaluate the resulting model against several open weight baselines under retrieval-augmented settings on JetStream2 infrastructure. Experimental results indicate that the adapted 8B model achieves performance comparable to substantially larger general-purpose models while operating under significantly lower GPU memory requirements and inference latency. In particular, the adapted model approaches the performance of Qwen 2.5 14B while requiring substantially fewer computational resources.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HPC-LLM, a retrieval-augmented and domain-adapted LLM assistant for HPC workflows such as Slurm scheduling, MPI, GPU utilization, and cluster troubleshooting. It builds an HPC corpus of 9,000–24,000 examples from public documentation, operational cases, and synthetic pairs; applies QLoRA adaptation to Llama 3.1 8B; and reports that the resulting model achieves performance comparable to larger general-purpose models (e.g., approaching Qwen 2.5 14B) while using substantially lower GPU memory and latency, evaluated under RAG settings on JetStream2 infrastructure.
Significance. If the central comparability claim holds under properly controlled conditions, the work would offer a practical, resource-efficient path to domain-specific HPC support that lowers barriers for researchers without access to large-scale inference hardware. The modular pipeline combining automated ingestion, dense retrieval, and lightweight adaptation is a concrete contribution to applied LLM deployment in scientific computing.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.
- [Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.
minor comments (2)
- [Abstract] The range 'approximately 9,000 to 24,000' for the training corpus size should be replaced by a single precise figure or a clear breakdown by source.
- [Framework description] Clarify the exact retrieval model, embedding dimension, and top-k value used in the RAG pipeline, as these parameters directly affect the reported inference latency and memory figures.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the rigor and reproducibility of our claims. We address each major comment point by point below and have made revisions to the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.
Authors: We agree that the current manuscript lacks sufficient quantitative detail to fully support the comparability claim. In the revised version, we will expand the Evaluation section to report concrete metrics (e.g., accuracy or success rate on HPC task categories), error bars derived from multiple independent runs, exact baseline configurations including retrieval parameters and prompting strategies, retrieval-quality measurements such as recall@5 and nDCG, and results of statistical significance tests comparing the adapted model to Qwen 2.5 14B. These additions will be placed in both the abstract summary and the main evaluation tables. revision: yes
-
Referee: [Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.
Authors: We acknowledge that the absence of explicit data-handling details creates ambiguity regarding potential contamination. We will revise the Data construction and Evaluation sections to describe the train/test split procedure (including the 80/20 ratio and hold-out criteria), decontamination steps (e.g., embedding-based similarity filtering to remove near-duplicates between training examples and evaluation queries), and confirmation that the JetStream2 evaluation queries were collected from live operational logs and user-submitted tickets that were never used in corpus construction or synthetic pair generation. This will be supported by a new subsection on data provenance and leakage prevention. revision: yes
Circularity Check
No circularity: empirical adaptation and external benchmarking
full rationale
The paper constructs an HPC corpus from public documentation, operational examples, and synthetic pairs, applies QLoRA fine-tuning to Llama 3.1 8B, and reports performance on JetStream2 against independent open-weight baselines such as Qwen 2.5 14B. No equations, predictions, or first-principles derivations are present that reduce reported gains to quantities defined by the paper's own fitted parameters or self-citations. Evaluation uses external infrastructure and models, satisfying the criterion for self-contained results against external benchmarks. No load-bearing steps match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
free parameters (1)
- QLoRA adaptation hyperparameters
axioms (1)
- domain assumption Publicly available university HPC documentation combined with synthetic instruction pairs is sufficient to capture the operational knowledge needed for reliable cluster support.
Reference graph
Works this paper leans on
-
[2]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models.arXiv.https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021). Evaluating large language models trained on code.arXiv. https://arxiv.org/abs/ 2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InProceedings of the ACL Workshop: Text Summarization Branches Out(pp. 74–81). Association for Computational Linguistics
work page 2004
-
[12]
Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged resources to advance general Chinese embedding.arXiv.https://arxiv.org/abs/2309.07597
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., & Yih, W. (2023). REPLUG: Retrieval-augmented black-box language models.arXiv.https://arxiv.org/abs/2301.12652
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(pp. 1–22). ACM.https://doi.org/10.1145/3586183.3606763
-
[17]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv.https://arxiv.org/abs/2308.08155
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(pp. 611–626). ACM. https://doi.org/10.1145/3600006. 3613165
-
[19]
(2023).Chroma: The AI-native open-source embedding database[Software]
Chroma. (2023).Chroma: The AI-native open-source embedding database[Software]. https://www. trychroma.com/
work page 2023
-
[20]
(2022).TRL: Transformer reinforcement learning[Software]
von Werra, L., Belkada, Y ., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., & Huang, S. (2022).TRL: Transformer reinforcement learning[Software]. GitHub.https://github.com/huggingface/trl
work page 2022
-
[21]
(2018).FastAPI[Software].https://fastapi.tiangolo.com/
Ramírez, S. (2018).FastAPI[Software].https://fastapi.tiangolo.com/
work page 2018
-
[22]
Stewart, C. A., Boerner, T. M., Hazlewood, V ., Snapp-Childs, W., Vaughn, M., Marru, S., Coulter, J. E., Grimshaw, M., Skousen, P., Dick, S., Merchant, N., & Skidmore, E. (2021). Jetstream2: Accelerating cloud computing via Jetstream. InProceedings of the Practice and Experience in Advanced Research Computing(pp. 1–8). ACM. https://doi.org/10.1145/3437359.3465565
-
[23]
Language Models are Few-Shot Learners
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,3...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[24]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288. https://arxiv.o...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shaik, K., Wang, D., Zheng, W., & others. (2024). S3LLM: Large-scale scientific software understanding with LLMs using source, metadata, and document. InInternational Conference on Computational Science(pp. 391–405). Springer.https://doi.org/10.1007/978-3-031-63759-9_27
- [26]
-
[27]
Code Llama: Open Foundation Models for Code
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., & Synnaeve, G. (2023). Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., & Lin, J. (2024). Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186.https://arxiv.org/abs/2409.12186 11 APREPRINT- MAY19, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems,33, 9459–9474.https://arxiv.org/abs/2005.11401
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[30]
Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning(pp. 3929–3938). PMLR. https://arxiv.org/abs/2002.08909
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[31]
Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research,24(251), 1–43.https://arxiv.org/abs/2208.03299
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
-
[33]
C., Grantcharov, V ., Wanna, S., & others
Barron, R. C., Grantcharov, V ., Wanna, S., & others. (2024). Domain-specific retrieval-augmented generation using vector stores, knowledge graphs, and tensor factorization. InIEEE International Conference on Machine Learning and Applications.https://doi.org/10.1109/ICMLA61862.2024.00258
-
[34]
H., Chan, H., Vriza, A., & others
Prince, M. H., Chan, H., Vriza, A., & others. (2024). Opportunities for retrieval and tool augmented large language models in scientific facilities.npj Computational Materials,10(1). https://doi.org/10.1038/ s41524-024-01423-2
work page 2024
-
[35]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of the International Conference on Learning Representations. https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems,36.https://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Raft: Adapting language model to domain specific rag,
Zhang, T., Patil, S. G., Jain, N., & others. (2024). RAFT: Adapting language model to domain specific RAG.arXiv preprint arXiv:2403.10131.https://arxiv.org/abs/2403.10131
- [38]
- [39]
- [40]
- [41]
-
[42]
BERTScore: Evaluating Text Generation with BERT
Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., & Artzi, Y . (2020). BERTScore: Evaluating text generation with BERT. InProceedings of the International Conference on Learning Representations. https://arxiv.org/ abs/1904.09675 12
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.