HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support

Izzat Alsmadi; Nourin Shahin

arxiv: 2605.16347 · v1 · pith:M4HVZPJJnew · submitted 2026-05-08 · 💻 cs.LG

HPC-LLM: Practical Domain Adaptation and Retrieval-Augmented Generation for HPC Support

Nourin Shahin , Izzat Alsmadi This is my paper

Pith reviewed 2026-05-20 22:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords supportadaptationclustercomputingdomainmodeloperationaladapted

0 comments

The pith

An 8B LLM adapted for HPC tasks performs like much larger models but uses less memory and runs faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-performance computing clusters present operational challenges for many researchers who need help with job schedulers, parallel frameworks, and resource management. General-purpose LLMs often fall short because they lack specialized knowledge of HPC environments. This paper shows how to build an effective support assistant by ingesting public HPC documentation, creating synthetic training examples, and applying lightweight fine-tuning with QLoRA to an 8B model. The resulting system, paired with retrieval, achieves results close to those of 14B-scale models while requiring far fewer computational resources during use.

Core claim

The central claim is that domain adaptation of Llama 3.1 8B via QLoRA on an HPC corpus of 9,000-24,000 examples, when combined with retrieval-augmented generation, yields a practical assistant for Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting that approaches the performance of larger general-purpose models such as Qwen 2.5 14B under lower GPU memory and latency constraints.

What carries the argument

QLoRA-based lightweight domain adaptation of an 8B Llama model on a curated HPC corpus, integrated with dense retrieval for context-aware responses.

Load-bearing premise

That the constructed HPC corpus and the specific evaluation cases on JetStream2 sufficiently represent the diversity of real-world HPC user needs and cluster environments.

What would settle it

Observing a significant performance drop when the model is tested on HPC queries from a previously unseen university cluster or with novel troubleshooting scenarios not covered in the training corpus.

Figures

Figures reproduced from arXiv: 2605.16347 by Izzat Alsmadi, Nourin Shahin.

read the original abstract

Modern scientific research increasingly depends on High-Performance Computing (HPC) infrastructures, yet many researchers face significant operational barriers when interacting with cluster environments, job schedulers, GPU resources, and parallel computing frameworks. General-purpose large language models (LLMs) provide useful coding assistance but often lack the domain-specific operational knowledge required for reliable HPC support. This paper presents HPC-LLM, a retrieval augmented and domain-adapted assistant designed to support common HPC workflows including Slurm scheduling, MPI execution, GPU utilization, filesystem management, and cluster troubleshooting. The proposed framework integrates automated documentation ingestion, dense retrieval, lightweight domain adaptation using QLoRA, and local inference within a modular orchestration pipeline. To support domain adaptation, we construct an HPC-oriented corpus from publicly available university HPC documentation, curated operational examples, and synthetic instruction-answer pairs generated from retrieved HPC content. The resulting dataset contains approximately 9,000 to 24,000 HPC-focused training examples spanning job scheduling, GPU computing, distributed training, storage systems, and cluster administration topics. We fine-tune Llama 3.1 8B using QLoRA and evaluate the resulting model against several open weight baselines under retrieval-augmented settings on JetStream2 infrastructure. Experimental results indicate that the adapted 8B model achieves performance comparable to substantially larger general-purpose models while operating under significantly lower GPU memory requirements and inference latency. In particular, the adapted model approaches the performance of Qwen 2.5 14B while requiring substantially fewer computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HPC-LLM applies QLoRA plus RAG to a new university-derived corpus and shows an 8B model nearing 14B performance on JetStream2, but the evaluation leaves the independence of test queries from synthetic data unclear.

read the letter

The main thing to know is that this paper takes standard QLoRA adaptation and dense retrieval, builds a 9k-24k example HPC corpus from public docs plus synthetic pairs, and reports that the resulting Llama 3.1 8B model matches or approaches larger baselines like Qwen 2.5 14B on JetStream2 while using less memory and running faster under RAG settings. The modular pipeline for ingestion, retrieval, and local inference is described clearly enough to be useful for anyone trying to stand up a domain assistant for Slurm, MPI, or cluster troubleshooting. That practical focus and the real-infrastructure test are the parts that hold up without much fanfare. The work is incremental rather than foundational, but the resource savings and domain coverage make it a reasonable extension of existing techniques. The soft spot is exactly the one flagged in the stress test: the central comparability claim depends on test queries being independent of the synthetic corpus generation. The abstract gives no details on train/test splits, decontamination, or how external validation queries were chosen, so it is hard to rule out in-distribution effects. If the full methods section shows clean separation and external queries, that concern shrinks; otherwise the numbers are only moderately convincing. No load-bearing math or fitting issues appear, and the citation pattern looks normal for an applied paper. This is for HPC users who need better LLM support and for applied researchers doing domain adaptation on technical documentation. A reader who wants a working example of lightweight fine-tuning plus retrieval in a real setting will get value from the pipeline and the JetStream2 numbers. It deserves a serious referee because the evaluation uses actual infrastructure and the problem is real, even if the novelty is modest. Send it for review but ask specifically for the data separation protocol and full quantitative tables with baselines.

Referee Report

2 major / 2 minor

Summary. The paper presents HPC-LLM, a retrieval-augmented and domain-adapted LLM assistant for HPC workflows such as Slurm scheduling, MPI, GPU utilization, and cluster troubleshooting. It builds an HPC corpus of 9,000–24,000 examples from public documentation, operational cases, and synthetic pairs; applies QLoRA adaptation to Llama 3.1 8B; and reports that the resulting model achieves performance comparable to larger general-purpose models (e.g., approaching Qwen 2.5 14B) while using substantially lower GPU memory and latency, evaluated under RAG settings on JetStream2 infrastructure.

Significance. If the central comparability claim holds under properly controlled conditions, the work would offer a practical, resource-efficient path to domain-specific HPC support that lowers barriers for researchers without access to large-scale inference hardware. The modular pipeline combining automated ingestion, dense retrieval, and lightweight adaptation is a concrete contribution to applied LLM deployment in scientific computing.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.
[Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.

minor comments (2)

[Abstract] The range 'approximately 9,000 to 24,000' for the training corpus size should be replaced by a single precise figure or a clear breakdown by source.
[Framework description] Clarify the exact retrieval model, embedding dimension, and top-k value used in the RAG pipeline, as these parameters directly affect the reported inference latency and memory figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving the rigor and reproducibility of our claims. We address each major comment point by point below and have made revisions to the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim that the adapted 8B model 'approaches the performance of Qwen 2.5 14B' is load-bearing for the paper's central contribution, yet the manuscript provides no quantitative metrics, error bars, exact baseline configurations, retrieval-quality measurements, or statistical significance tests. Without these, the comparability result cannot be assessed or reproduced.

Authors: We agree that the current manuscript lacks sufficient quantitative detail to fully support the comparability claim. In the revised version, we will expand the Evaluation section to report concrete metrics (e.g., accuracy or success rate on HPC task categories), error bars derived from multiple independent runs, exact baseline configurations including retrieval parameters and prompting strategies, retrieval-quality measurements such as recall@5 and nDCG, and results of statistical significance tests comparing the adapted model to Qwen 2.5 14B. These additions will be placed in both the abstract summary and the main evaluation tables. revision: yes
Referee: [Data construction / Evaluation] Data construction and Evaluation sections: the corpus combines public documentation with synthetic instruction-answer pairs, but the manuscript does not describe train/test split construction, decontamination procedures, or whether evaluation queries on JetStream2 were drawn from an independent external source. This leaves open the possibility that reported gains reflect data overlap rather than genuine domain adaptation, directly undermining the fairness of the comparison to external baselines.

Authors: We acknowledge that the absence of explicit data-handling details creates ambiguity regarding potential contamination. We will revise the Data construction and Evaluation sections to describe the train/test split procedure (including the 80/20 ratio and hold-out criteria), decontamination steps (e.g., embedding-based similarity filtering to remove near-duplicates between training examples and evaluation queries), and confirmation that the JetStream2 evaluation queries were collected from live operational logs and user-submitted tickets that were never used in corpus construction or synthetic pair generation. This will be supported by a new subsection on data provenance and leakage prevention. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation and external benchmarking

full rationale

The paper constructs an HPC corpus from public documentation, operational examples, and synthetic pairs, applies QLoRA fine-tuning to Llama 3.1 8B, and reports performance on JetStream2 against independent open-weight baselines such as Qwen 2.5 14B. No equations, predictions, or first-principles derivations are present that reduce reported gains to quantities defined by the paper's own fitted parameters or self-citations. Evaluation uses external infrastructure and models, satisfying the criterion for self-contained results against external benchmarks. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that public HPC documentation plus synthetic pairs yield high-quality training data and that retrieval-augmented inference on JetStream2 reflects real operational value; no new physical entities or mathematical axioms are introduced.

free parameters (1)

QLoRA adaptation hyperparameters
Rank, alpha, and dropout values for the lightweight fine-tuning step are not specified in the abstract and must be chosen to achieve the reported performance.

axioms (1)

domain assumption Publicly available university HPC documentation combined with synthetic instruction pairs is sufficient to capture the operational knowledge needed for reliable cluster support.
This premise underpins the construction of the 9,000-24,000 example training set and the claim of effective domain adaptation.

pith-pipeline@v0.9.0 · 5805 in / 1554 out tokens · 44466 ms · 2026-05-20T22:23:35.403833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 15 internal anchors

[2]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models.arXiv.https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021). Evaluating large language models trained on code.arXiv. https://arxiv.org/abs/ 2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InProceedings of the ACL Workshop: Text Summarization Branches Out(pp. 74–81). Association for Computational Linguistics

work page 2004
[12]

Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged resources to advance general Chinese embedding.arXiv.https://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., & Yih, W. (2023). REPLUG: Retrieval-augmented black-box language models.arXiv.https://arxiv.org/abs/2301.12652

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(pp. 1–22). ACM.https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv.https://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(pp. 611–626). ACM. https://doi.org/10.1145/3600006. 3613165

work page doi:10.1145/3600006 2023
[19]

(2023).Chroma: The AI-native open-source embedding database[Software]

Chroma. (2023).Chroma: The AI-native open-source embedding database[Software]. https://www. trychroma.com/

work page 2023
[20]

(2022).TRL: Transformer reinforcement learning[Software]

von Werra, L., Belkada, Y ., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., & Huang, S. (2022).TRL: Transformer reinforcement learning[Software]. GitHub.https://github.com/huggingface/trl

work page 2022
[21]

(2018).FastAPI[Software].https://fastapi.tiangolo.com/

Ramírez, S. (2018).FastAPI[Software].https://fastapi.tiangolo.com/

work page 2018
[22]

A., Boerner, T

Stewart, C. A., Boerner, T. M., Hazlewood, V ., Snapp-Childs, W., Vaughn, M., Marru, S., Coulter, J. E., Grimshaw, M., Skousen, P., Dick, S., Merchant, N., & Skidmore, E. (2021). Jetstream2: Accelerating cloud computing via Jetstream. InProceedings of the Practice and Experience in Advanced Research Computing(pp. 1–8). ACM. https://doi.org/10.1145/3437359.3465565

work page doi:10.1145/3437359.3465565 2021
[23]

Language Models are Few-Shot Learners

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,3...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288. https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Shaik, K., Wang, D., Zheng, W., & others. (2024). S3LLM: Large-scale scientific software understanding with LLMs using source, metadata, and document. InInternational Conference on Computational Science(pp. 391–405). Springer.https://doi.org/10.1007/978-3-031-63759-9_27

work page doi:10.1007/978-3-031-63759-9_27 2024
[26]

Nguyen, Z., Annunziata, A., Luong, V ., & others. (2024). Enhancing Q&A with domain-specific fine-tuning and iterative reasoning: A comparative study.arXiv preprint arXiv:2404.11792. https://arxiv.org/abs/2404. 11792

work page arXiv 2024
[27]

Code Llama: Open Foundation Models for Code

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., & Synnaeve, G. (2023). Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., & Lin, J. (2024). Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186.https://arxiv.org/abs/2409.12186 11 APREPRINT- MAY19, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems,33, 9459–9474.https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[30]

Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning(pp. 3929–3938). PMLR. https://arxiv.org/abs/2002.08909

work page internal anchor Pith review Pith/arXiv arXiv 2020
[31]

Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research,24(251), 1–43.https://arxiv.org/abs/2208.03299

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Wang, C., Long, Q., Xiao, M., & others. (2024). BioRAG: A RAG-LLM framework for biological question reasoning.arXiv preprint arXiv:2408.01107.https://arxiv.org/abs/2408.01107

work page arXiv 2024
[33]

C., Grantcharov, V ., Wanna, S., & others

Barron, R. C., Grantcharov, V ., Wanna, S., & others. (2024). Domain-specific retrieval-augmented generation using vector stores, knowledge graphs, and tensor factorization. InIEEE International Conference on Machine Learning and Applications.https://doi.org/10.1109/ICMLA61862.2024.00258

work page doi:10.1109/icmla61862.2024.00258 2024
[34]

H., Chan, H., Vriza, A., & others

Prince, M. H., Chan, H., Vriza, A., & others. (2024). Opportunities for retrieval and tool augmented large language models in scientific facilities.npj Computational Materials,10(1). https://doi.org/10.1038/ s41524-024-01423-2

work page 2024
[35]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of the International Conference on Learning Representations. https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems,36.https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Raft: Adapting language model to domain specific rag,

Zhang, T., Patil, S. G., Jain, N., & others. (2024). RAFT: Adapting language model to domain specific RAG.arXiv preprint arXiv:2403.10131.https://arxiv.org/abs/2403.10131

work page arXiv 2024
[38]

Li, J., Yuan, Y ., & Zhang, Z. (2024). Enhancing LLM factual accuracy with RAG to counter hallucinations.arXiv preprint arXiv:2403.10446.https://arxiv.org/abs/2403.10446

work page arXiv 2024
[39]

Miyashita, Y ., Tung, P. K. M., & Barthélemy, J. (2025). LLM as HPC expert: Extending RAG architecture for HPC data.arXiv preprint arXiv:2501.14733.https://arxiv.org/abs/2501.14733

work page arXiv 2025
[40]

Gokdemir, O., Siebenschuh, C., Brace, A., & others. (2025). HiPerRAG: High-performance retrieval augmented generation for scientific insights.arXiv preprint arXiv:2505.04846.https://arxiv.org/abs/2505.04846

work page arXiv 2025
[41]

Zhang, T., Jiang, Z., Bai, S., & others. (2024). RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance.arXiv preprint arXiv:2410.15805. https://arxiv.org/abs/ 2410.15805

work page arXiv 2024
[42]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., & Artzi, Y . (2020). BERTScore: Evaluating text generation with BERT. InProceedings of the International Conference on Learning Representations. https://arxiv.org/ abs/1904.09675 12

work page internal anchor Pith review Pith/arXiv arXiv 2020

[1] [2]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models.arXiv.https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [7]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., . . . Zaremba, W. (2021). Evaluating large language models trained on code.arXiv. https://arxiv.org/abs/ 2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [11]

Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InProceedings of the ACL Workshop: Text Summarization Branches Out(pp. 74–81). Association for Computational Linguistics

work page 2004

[4] [12]

Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged resources to advance general Chinese embedding.arXiv.https://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [15]

Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., & Yih, W. (2023). REPLUG: Retrieval-augmented black-box language models.arXiv.https://arxiv.org/abs/2301.12652

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [16]

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(pp. 1–22). ACM.https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[7] [17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Wu, Q., Bansal, G., Zhang, J., Wu, Y ., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2023). AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv.https://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [18]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(pp. 611–626). ACM. https://doi.org/10.1145/3600006. 3613165

work page doi:10.1145/3600006 2023

[9] [19]

(2023).Chroma: The AI-native open-source embedding database[Software]

Chroma. (2023).Chroma: The AI-native open-source embedding database[Software]. https://www. trychroma.com/

work page 2023

[10] [20]

(2022).TRL: Transformer reinforcement learning[Software]

von Werra, L., Belkada, Y ., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., & Huang, S. (2022).TRL: Transformer reinforcement learning[Software]. GitHub.https://github.com/huggingface/trl

work page 2022

[11] [21]

(2018).FastAPI[Software].https://fastapi.tiangolo.com/

Ramírez, S. (2018).FastAPI[Software].https://fastapi.tiangolo.com/

work page 2018

[12] [22]

A., Boerner, T

Stewart, C. A., Boerner, T. M., Hazlewood, V ., Snapp-Childs, W., Vaughn, M., Marru, S., Coulter, J. E., Grimshaw, M., Skousen, P., Dick, S., Merchant, N., & Skidmore, E. (2021). Jetstream2: Accelerating cloud computing via Jetstream. InProceedings of the Practice and Experience in Advanced Research Computing(pp. 1–8). ACM. https://doi.org/10.1145/3437359.3465565

work page doi:10.1145/3437359.3465565 2021

[13] [23]

Language Models are Few-Shot Learners

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., & Amodei, D. (2020). Language models are few-shot learners.Advances in Neural Information Processing Systems,3...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [24]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288. https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [25]

Shaik, K., Wang, D., Zheng, W., & others. (2024). S3LLM: Large-scale scientific software understanding with LLMs using source, metadata, and document. InInternational Conference on Computational Science(pp. 391–405). Springer.https://doi.org/10.1007/978-3-031-63759-9_27

work page doi:10.1007/978-3-031-63759-9_27 2024

[16] [26]

Nguyen, Z., Annunziata, A., Luong, V ., & others. (2024). Enhancing Q&A with domain-specific fine-tuning and iterative reasoning: A comparative study.arXiv preprint arXiv:2404.11792. https://arxiv.org/abs/2404. 11792

work page arXiv 2024

[17] [27]

Code Llama: Open Foundation Models for Code

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y ., Liu, J., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., & Synnaeve, G. (2023). Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [28]

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., & Lin, J. (2024). Qwen2.5-Coder technical report.arXiv preprint arXiv:2409.12186.https://arxiv.org/abs/2409.12186 11 APREPRINT- MAY19, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [29]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems,33, 9459–9474.https://arxiv.org/abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [30]

Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning(pp. 3929–3938). PMLR. https://arxiv.org/abs/2002.08909

work page internal anchor Pith review Pith/arXiv arXiv 2020

[21] [31]

Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2023). Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research,24(251), 1–43.https://arxiv.org/abs/2208.03299

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [32]

Wang, C., Long, Q., Xiao, M., & others. (2024). BioRAG: A RAG-LLM framework for biological question reasoning.arXiv preprint arXiv:2408.01107.https://arxiv.org/abs/2408.01107

work page arXiv 2024

[23] [33]

C., Grantcharov, V ., Wanna, S., & others

Barron, R. C., Grantcharov, V ., Wanna, S., & others. (2024). Domain-specific retrieval-augmented generation using vector stores, knowledge graphs, and tensor factorization. InIEEE International Conference on Machine Learning and Applications.https://doi.org/10.1109/ICMLA61862.2024.00258

work page doi:10.1109/icmla61862.2024.00258 2024

[24] [34]

H., Chan, H., Vriza, A., & others

Prince, M. H., Chan, H., Vriza, A., & others. (2024). Opportunities for retrieval and tool augmented large language models in scientific facilities.npj Computational Materials,10(1). https://doi.org/10.1038/ s41524-024-01423-2

work page 2024

[25] [35]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of the International Conference on Learning Representations. https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [36]

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems,36.https://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [37]

Raft: Adapting language model to domain specific rag,

Zhang, T., Patil, S. G., Jain, N., & others. (2024). RAFT: Adapting language model to domain specific RAG.arXiv preprint arXiv:2403.10131.https://arxiv.org/abs/2403.10131

work page arXiv 2024

[28] [38]

Li, J., Yuan, Y ., & Zhang, Z. (2024). Enhancing LLM factual accuracy with RAG to counter hallucinations.arXiv preprint arXiv:2403.10446.https://arxiv.org/abs/2403.10446

work page arXiv 2024

[29] [39]

Miyashita, Y ., Tung, P. K. M., & Barthélemy, J. (2025). LLM as HPC expert: Extending RAG architecture for HPC data.arXiv preprint arXiv:2501.14733.https://arxiv.org/abs/2501.14733

work page arXiv 2025

[30] [40]

Gokdemir, O., Siebenschuh, C., Brace, A., & others. (2025). HiPerRAG: High-performance retrieval augmented generation for scientific insights.arXiv preprint arXiv:2505.04846.https://arxiv.org/abs/2505.04846

work page arXiv 2025

[31] [41]

Zhang, T., Jiang, Z., Bai, S., & others. (2024). RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance.arXiv preprint arXiv:2410.15805. https://arxiv.org/abs/ 2410.15805

work page arXiv 2024

[32] [42]

BERTScore: Evaluating Text Generation with BERT

Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., & Artzi, Y . (2020). BERTScore: Evaluating text generation with BERT. InProceedings of the International Conference on Learning Representations. https://arxiv.org/ abs/1904.09675 12

work page internal anchor Pith review Pith/arXiv arXiv 2020