Recognition: no theorem link
An End-to-End Framework for Building Large Language Models for Software Operations
Pith reviewed 2026-05-13 06:17 UTC · model grok-4.3
The pith
A specialized LLM for software operations outperforms general models on question answering and root cause analysis after targeted training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpsLLM supports QA and RCA in software operations through a workflow of human-in-the-loop data curation from raw logs into a 15K dataset, supervised fine-tuning to create a base model, and reinforcement learning with a domain process reward model to boost RCA accuracy. On tasks of varying difficulty, it outperforms open-source and closed-source LLMs with accuracy gains of 0.2% to 5.7% for QA and 2.7% to 70.3% for RCA, while showing strong transferability. Three parameter sizes of the model along with the dataset are to be open-sourced.
What carries the argument
Human-in-the-loop curation of operational data paired with a domain process reward model for reinforcement learning optimization.
If this is right
- The framework offers a replicable method for adapting LLMs to other domains with fragmented or low-quality data.
- Open-sourcing the models and dataset facilitates community-driven advancements in operations intelligence.
- The observed transferability indicates potential for the model to handle evolving operational scenarios without extensive retraining.
- Integration of human oversight and reward modeling mitigates risks from insufficient domain-specific learning.
Where Pith is reading between the lines
- Similar human-in-the-loop and reward model techniques could enhance LLM applications in fields like cybersecurity or network management where process knowledge is key.
- The approach might allow smaller models to compete with larger general ones in niche tasks, reducing computational demands for specialized uses.
- If validated on more diverse systems, this could accelerate the adoption of AI-assisted tools in IT operations.
Load-bearing premise
The human-in-the-loop mechanism produces high-quality data from raw operational logs that accurately represents the domain without introducing biases or gaps.
What would settle it
A test of the open-sourced OpsLLM models on fresh operational data from an unseen software environment, comparing RCA and QA performance against the same baseline models used in the paper.
Figures
read the original abstract
In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes OpsLLM, a domain-specific LLM for software operations supporting knowledge-based QA and root cause analysis (RCA). It outlines an end-to-end workflow: a human-in-the-loop mechanism to curate a 15K fine-tuning dataset from raw operational logs, supervised fine-tuning (SFT) to create a base model, and reinforcement learning guided by a domain process reward model (DPRM) to optimize RCA accuracy and reliability. The authors claim the resulting 7B/14B/32B models outperform open- and closed-source LLMs with accuracy gains of 0.2%–5.7% on QA tasks and 2.7%–70.3% on RCA tasks across diverse difficulties, exhibit strong transferability, and will be open-sourced along with the dataset.
Significance. If the performance claims hold under rigorous evaluation, the work provides a practical, reproducible framework for domain-adapting LLMs to software operations, addressing data quality and knowledge fragmentation issues. The explicit commitment to open-source three model sizes and the 15K dataset is a clear strength that supports verification and extension by the community. The approach could inform similar efforts in other technical domains requiring reliable reasoning over operational data.
major comments (2)
- [Abstract / Experimental Results] Abstract and Experimental Results section: The central claims of accuracy improvements (0.2%–5.7% QA, 2.7%–70.3% RCA) and strong transferability are presented without any description of experimental design, baseline models, number of runs, statistical significance tests, train/test splits, or error bars. This absence makes the reported gains unverifiable and is load-bearing for the paper's assertion that the framework outperforms existing LLMs.
- [Framework Description / Reinforcement Learning] Framework and Reinforcement Learning stage: No ablation is reported that isolates the DPRM's contribution during RL from the effects of the human-curated 15K SFT dataset. The large RCA gains could be driven primarily by data curation quality rather than the DPRM or full pipeline; an SFT-only baseline on the identical 15K data is required to substantiate the claim that DPRM is key to reliability and transferability.
minor comments (2)
- [Abstract] Inconsistent terminology: 'OpsLLM' and 'OpsLLMs' appear interchangeably in the abstract; adopt a single consistent name throughout.
- The description of 'tasks with diverse difficulties' and the evaluation of 'strong transferability' would benefit from explicit definitions, metrics, and example task breakdowns to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on experimental clarity and ablation studies. We address each major comment below and will revise the manuscript to strengthen verifiability while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The central claims of accuracy improvements (0.2%–5.7% QA, 2.7%–70.3% RCA) and strong transferability are presented without any description of experimental design, baseline models, number of runs, statistical significance tests, train/test splits, or error bars. This absence makes the reported gains unverifiable and is load-bearing for the paper's assertion that the framework outperforms existing LLMs.
Authors: We agree that the abstract and Experimental Results section should be self-contained for verifiability. The full paper describes the baselines (open-source models such as Llama-2-7B/13B/70B and closed-source models such as GPT-3.5/4), the 15K dataset curation process, and evaluation across difficulty levels. To address the gap, we will revise the abstract to briefly note the evaluation protocol and expand the Experimental Results section with: explicit train/test splits (80/20 on curated data with held-out operational logs), number of runs (3 independent runs with different random seeds), error bars on all accuracy figures, and statistical significance tests (paired t-tests with p-values reported for the observed gains). These additions will be incorporated in the next revision. revision: yes
-
Referee: [Framework Description / Reinforcement Learning] Framework and Reinforcement Learning stage: No ablation is reported that isolates the DPRM's contribution during RL from the effects of the human-curated 15K SFT dataset. The large RCA gains could be driven primarily by data curation quality rather than the DPRM or full pipeline; an SFT-only baseline on the identical 15K data is required to substantiate the claim that DPRM is key to reliability and transferability.
Authors: We recognize that an SFT-only ablation on the identical 15K dataset is necessary to isolate the DPRM's incremental value during RL. The current experiments compare the full pipeline against general-purpose LLMs, but do not include this specific control. We will add the requested SFT-only baseline (trained on the same human-curated 15K data without the RL/DPRM stage) and report its performance on QA and RCA tasks, including transferability metrics. This will allow direct quantification of the DPRM's contribution to reliability and will be included in the revised Experimental Results section. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical pipeline of human-in-the-loop data curation from operational logs, supervised fine-tuning on a 15K dataset, and RL with a domain process reward model, followed by accuracy measurements on QA and RCA tasks. All performance claims (0.2%–5.7% QA gains, 2.7%–70.3% RCA gains) are defined via direct comparison to independent external open- and closed-source LLMs rather than any quantity constructed from the model's own fitted parameters or self-referential metrics. No equations, uniqueness theorems, or ansatzes are introduced that reduce to prior self-citations or internal definitions by construction; the evaluation remains externally falsifiable against held-out benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MCP 1st Birthday. 2025. smoltrace-aiops-tasks. https://huggingface.co/datasets/ MCP-1st-Birthday/smoltrace-aiops-tasks
work page 2025
-
[2]
Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending root-cause and mitigation steps for cloud incidents using large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1737–1749
work page 2023
-
[3]
AIOps. 2025. AIOps Challenges. https://challenge.aiops.cn/home/competition
work page 2025
- [4]
-
[5]
Nils Bars, Moritz Schloegel, Tobias Scharnowski, Nico Schiller, and Thorsten Holz. 2023. Fuzztruction: Using Fault Injection-based Fuzzing to Leverage Im- plicit Domain Knowledge. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 1847–1864. https://www.usenix.org/ conference/usenixsecurity23/presentation/bars
work page 2023
-
[6]
CAMEL-AI. 2025. AIOps (CAMEL-AI). https://huggingface.co/datasets/camel- ai/aiops
work page 2025
-
[7]
camel ai. 2026. aiops-qwen-4b. https://modelscope.cn/models/camel-ai/aiops- qwen-4b. Accessed: 2026-02
work page 2026
- [8]
-
[9]
Yuanliang Chen, Fuchen Ma, Yuanhang Zhou, Zhen Yan, and Yu Jiang. 2025. CAFault: enhance fault injection technique in practical distributed systems via abundant fault-dependent configurations. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference(Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 83, 16 pages
work page 2025
-
[10]
Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOp- sLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. arXiv:2501.06706 [cs.AI] https://arxiv.org/abs/2501.06706
- [11]
-
[12]
Cloudflare. 2025. Cloudflare outage on November 18, 2025. https://blog.cloudflare. com/18-november-2025-outage/. Accessed: 2026-02
work page 2025
- [13]
-
[14]
Mingzhe Du, Luu Anh Tuan, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See Kiong Ng. 2025. Codearena: A collective evalu- ation platform for llm code generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 502–512
work page 2025
-
[15]
Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, and Saravan Rajmohan. 2024. X-lifecycle learning for cloud incident management using llms. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 417– 428
work page 2024
-
[16]
Ahmed Gongi. 2024. Devops_LLM. https://huggingface.co/datasets/ahmedgongi/ Devops_LLM
work page 2024
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [18]
-
[19]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3
work page 2022
- [20]
-
[21]
Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em- powering incident management with query recommendations via large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
work page 2024
-
[22]
Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, et al. 2023. Assess and summarize: Improve outage understanding with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1657–1668
work page 2023
-
[23]
kim dong gyu. 2025. aiops_finetuning_qaset. https://huggingface.co/datasets/ DKCode9/aiops_finetuning_qaset
work page 2025
-
[24]
Chenxiao Liu, Zheyong Xie, Sirui Zhao, Jin Zhou, Tong Xu, Minglei Li, and Enhong Chen. 2024. Speak from heart: an emotion-guided LLM-based multimodal method for emotional dialogue generation. InProceedings of the 2024 International Conference on Multimedia Retrieval. 533–542
work page 2024
- [25]
-
[26]
Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang
-
[27]
InProceedings of the IEEE/ACM 46th International Conference on Software Engineering
Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
- [28]
-
[29]
MingtaoHu. 2025. AIOps (Hu et al.). https://huggingface.co/datasets/Epiphanyi/ aiops
work page 2025
-
[30]
Eduardo Mosqueira-Rey, Elena Hernández-Pereira, David Alonso-Ríos, José Bobes-Bascarán, and Ángel Fernández-Leal. 2023. Human-in-the-loop machine learning: a state of the art.Artificial Intelligence Review56, 4 (2023), 3005–3054
work page 2023
-
[31]
OpenAI. 2025. ChatGPT. https://chatgpt.com. Accessed: 2026-02
work page 2025
-
[32]
CloudWise OpenSource. 2025. GAIA-DataSet. https://github.com/CloudWise- OpenSource/GAIA-DataSet. Conference’17, July 2017, Washington, DC, USA Trovato et al
work page 2025
-
[33]
Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, et al . 2025. Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis. In Companion Proceedings of the ACM on Web Conference 2025. 422–431
work page 2025
-
[34]
Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. 2025. RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data. InCompanion Proceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 777–780
work page 2025
-
[35]
phpcool. 2025. DeepSeek-R1-Distill-SRE-Qwen-32B-INT8. https://modelscope. cn/models/phpcool/DeepSeek-R1-Distill-SRE-Qwen-32B-INT8. Accessed: 2026- 02
work page 2025
-
[36]
phpcool. 2025. DeepSeek-R1-Distill-SRE-Qwen-7B. https://modelscope.cn/ models/phpcool/DeepSeek-R1-Distill-SRE-Qwen-7B. Accessed: 2026-02
work page 2025
-
[37]
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan. 2024. Exploring llm-based agents for root cause analysis. InCompanion proceedings of the 32nd ACM international conference on the foundations of software engineering. 208–219
work page 2024
-
[38]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv: 2409.19256(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum learning: A survey.International Journal of Computer Vision130, 6 (2022), 1526– 1565
work page 2022
-
[41]
StabilityLab. 2025. zhiyu. https://modelscope.cn/models/StabilityLab/zhiyu. Accessed: 2026-02
work page 2025
-
[42]
ITBench Team. 2025. ITBench Agent Trajectories: LLM Agent Executions for IT Automation Tasks. https://huggingface.co/datasets/itbench/itbench-trajectories
work page 2025
-
[43]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/
work page 2024
-
[44]
Rupa Vemu. 2025. AIOps_training. https://huggingface.co/datasets/rupa99/ AIOps_training
work page 2025
- [45]
-
[46]
Jianxun Wang and Yixiang Chen. 2023. A review on code generation with llms: Application and evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI). IEEE, 284–289
work page 2023
-
[47]
Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, and Yongxiang Li
- [48]
-
[49]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä
Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä. 2025. Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning.Proc. ACM Softw. Eng.2, FSE, Article FSE027 (June 2025), 23 pages
work page 2025
-
[51]
Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can large language models locate the root cause of software failures?. InThe Thirteenth International Conference on Learning Representations
work page 2025
-
[52]
Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems.Comput. Surveys(2024)
work page 2024
-
[53]
Guangba Yu, Pengfei Chen, Zilong He, Qiuyu Yan, Yu Luo, Fangyuan Li, and Zibin Zheng. 2024. ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems.Proc. ACM Softw. Eng.1, FSE, Article 2 (July 2024), 23 pages
work page 2024
-
[54]
Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, and Saravan Rajmohan. 2024. Automated root causing of cloud incidents using in-context learning with GPT-4. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 266–277
work page 2024
- [55]
-
[56]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. Logparser-llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4559–4570
work page 2024
-
[58]
Li Zhong and Zilong Wang. 2024. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21841–21849
work page 2024
-
[59]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
-
[60]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). A Implementation Details A.1 Representation Fusion This appendix provides additional implementation details of the multi-modal representation fusion process described in Section 3.2, which are omitted from the main text fo...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.