arxiv: 2605.02906 · v2 · submitted 2026-04-06 · 💻 cs.LG

Recognition: no theorem link

An End-to-End Framework for Building Large Language Models for Software Operations

Jingkai He , Pengfei Chen , Chenghui Wu , Shuang Liang , Ye Li , Gou Tan , Xiadao Wen , Chuanfu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelssoftware operationsroot cause analysisquestion answeringhuman-in-the-loopreinforcement learningdomain-specific modelsfine-tuning

0 comments

The pith

A specialized LLM for software operations outperforms general models on question answering and root cause analysis after targeted training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpsLLM, an end-to-end framework for developing large language models tailored to software operations that handle both knowledge-based question answering and root cause analysis. A human-in-the-loop mechanism curates a fine-tuning dataset from raw operational logs, followed by supervised fine-tuning and reinforcement learning enhanced by a domain process reward model. This process enables the model to better incorporate operational domain knowledge, leading to superior performance compared to general-purpose LLMs. The authors support this by reporting accuracy improvements and committing to release the models and dataset for broader use.

Core claim

OpsLLM supports QA and RCA in software operations through a workflow of human-in-the-loop data curation from raw logs into a 15K dataset, supervised fine-tuning to create a base model, and reinforcement learning with a domain process reward model to boost RCA accuracy. On tasks of varying difficulty, it outperforms open-source and closed-source LLMs with accuracy gains of 0.2% to 5.7% for QA and 2.7% to 70.3% for RCA, while showing strong transferability. Three parameter sizes of the model along with the dataset are to be open-sourced.

What carries the argument

Human-in-the-loop curation of operational data paired with a domain process reward model for reinforcement learning optimization.

If this is right

The framework offers a replicable method for adapting LLMs to other domains with fragmented or low-quality data.
Open-sourcing the models and dataset facilitates community-driven advancements in operations intelligence.
The observed transferability indicates potential for the model to handle evolving operational scenarios without extensive retraining.
Integration of human oversight and reward modeling mitigates risks from insufficient domain-specific learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar human-in-the-loop and reward model techniques could enhance LLM applications in fields like cybersecurity or network management where process knowledge is key.
The approach might allow smaller models to compete with larger general ones in niche tasks, reducing computational demands for specialized uses.
If validated on more diverse systems, this could accelerate the adoption of AI-assisted tools in IT operations.

Load-bearing premise

The human-in-the-loop mechanism produces high-quality data from raw operational logs that accurately represents the domain without introducing biases or gaps.

What would settle it

A test of the open-sourced OpsLLM models on fresh operational data from an unseen software environment, comparing RCA and QA performance against the same baseline models used in the paper.

Figures

Figures reproduced from arXiv: 2605.02906 by Chenghui Wu, Chuanfu Zhang, Gou Tan, Jingkai He, Pengfei Chen, Shuang Liang, Xiadao Wen, Ye Li.

**Figure 2.** Figure 2: Across different LLMs, more than 80% of responses [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The workflow of model construction and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Ops fine-tuning dataset construction workflow. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Workflow of QA pairs to single-choice MCQ. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: (a) shows the improvement of LLaMA-based models [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Testing removing each subset from the fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt templates for the three difficulty levels of RCA. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of practical RCA reasoning examples across different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpsLLM lays out a usable workflow for ops-domain LLMs and commits to releasing the models plus the 15K dataset, but the RCA gains are not isolated from the curated data.

read the letter

The main thing here is a concrete pipeline for adapting LLMs to software operations. It covers human-in-the-loop curation of raw logs into a 15K fine-tuning set, supervised fine-tuning to get a base model, and then a domain process reward model during reinforcement learning to target root cause analysis accuracy. They plan to open-source the 7B, 14B, and 32B versions along with the dataset, which removes a common barrier for follow-up work in this area.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes OpsLLM, a domain-specific LLM for software operations supporting knowledge-based QA and root cause analysis (RCA). It outlines an end-to-end workflow: a human-in-the-loop mechanism to curate a 15K fine-tuning dataset from raw operational logs, supervised fine-tuning (SFT) to create a base model, and reinforcement learning guided by a domain process reward model (DPRM) to optimize RCA accuracy and reliability. The authors claim the resulting 7B/14B/32B models outperform open- and closed-source LLMs with accuracy gains of 0.2%–5.7% on QA tasks and 2.7%–70.3% on RCA tasks across diverse difficulties, exhibit strong transferability, and will be open-sourced along with the dataset.

Significance. If the performance claims hold under rigorous evaluation, the work provides a practical, reproducible framework for domain-adapting LLMs to software operations, addressing data quality and knowledge fragmentation issues. The explicit commitment to open-source three model sizes and the 15K dataset is a clear strength that supports verification and extension by the community. The approach could inform similar efforts in other technical domains requiring reliable reasoning over operational data.

major comments (2)

[Abstract / Experimental Results] Abstract and Experimental Results section: The central claims of accuracy improvements (0.2%–5.7% QA, 2.7%–70.3% RCA) and strong transferability are presented without any description of experimental design, baseline models, number of runs, statistical significance tests, train/test splits, or error bars. This absence makes the reported gains unverifiable and is load-bearing for the paper's assertion that the framework outperforms existing LLMs.
[Framework Description / Reinforcement Learning] Framework and Reinforcement Learning stage: No ablation is reported that isolates the DPRM's contribution during RL from the effects of the human-curated 15K SFT dataset. The large RCA gains could be driven primarily by data curation quality rather than the DPRM or full pipeline; an SFT-only baseline on the identical 15K data is required to substantiate the claim that DPRM is key to reliability and transferability.

minor comments (2)

[Abstract] Inconsistent terminology: 'OpsLLM' and 'OpsLLMs' appear interchangeably in the abstract; adopt a single consistent name throughout.
The description of 'tasks with diverse difficulties' and the evaluation of 'strong transferability' would benefit from explicit definitions, metrics, and example task breakdowns to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental clarity and ablation studies. We address each major comment below and will revise the manuscript to strengthen verifiability while preserving the core contributions.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The central claims of accuracy improvements (0.2%–5.7% QA, 2.7%–70.3% RCA) and strong transferability are presented without any description of experimental design, baseline models, number of runs, statistical significance tests, train/test splits, or error bars. This absence makes the reported gains unverifiable and is load-bearing for the paper's assertion that the framework outperforms existing LLMs.

Authors: We agree that the abstract and Experimental Results section should be self-contained for verifiability. The full paper describes the baselines (open-source models such as Llama-2-7B/13B/70B and closed-source models such as GPT-3.5/4), the 15K dataset curation process, and evaluation across difficulty levels. To address the gap, we will revise the abstract to briefly note the evaluation protocol and expand the Experimental Results section with: explicit train/test splits (80/20 on curated data with held-out operational logs), number of runs (3 independent runs with different random seeds), error bars on all accuracy figures, and statistical significance tests (paired t-tests with p-values reported for the observed gains). These additions will be incorporated in the next revision. revision: yes
Referee: [Framework Description / Reinforcement Learning] Framework and Reinforcement Learning stage: No ablation is reported that isolates the DPRM's contribution during RL from the effects of the human-curated 15K SFT dataset. The large RCA gains could be driven primarily by data curation quality rather than the DPRM or full pipeline; an SFT-only baseline on the identical 15K data is required to substantiate the claim that DPRM is key to reliability and transferability.

Authors: We recognize that an SFT-only ablation on the identical 15K dataset is necessary to isolate the DPRM's incremental value during RL. The current experiments compare the full pipeline against general-purpose LLMs, but do not include this specific control. We will add the requested SFT-only baseline (trained on the same human-curated 15K data without the RL/DPRM stage) and report its performance on QA and RCA tasks, including transferability metrics. This will allow direct quantification of the DPRM's contribution to reliability and will be included in the revised Experimental Results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical pipeline of human-in-the-loop data curation from operational logs, supervised fine-tuning on a 15K dataset, and RL with a domain process reward model, followed by accuracy measurements on QA and RCA tasks. All performance claims (0.2%–5.7% QA gains, 2.7%–70.3% RCA gains) are defined via direct comparison to independent external open- and closed-source LLMs rather than any quantity constructed from the model's own fitted parameters or self-referential metrics. No equations, uniqueness theorems, or ansatzes are introduced that reduce to prior self-citations or internal definitions by construction; the evaluation remains externally falsifiable against held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes standard supervised fine-tuning and reinforcement learning components without introducing new free parameters, unproved axioms, or postulated entities beyond the DPRM, which is presented as a domain-adapted instance of an existing reward-model technique.

pith-pipeline@v0.9.0 · 5589 in / 1424 out tokens · 76620 ms · 2026-05-13T06:17:21.345098+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 6 internal anchors

[1]

MCP 1st Birthday. 2025. smoltrace-aiops-tasks. https://huggingface.co/datasets/ MCP-1st-Birthday/smoltrace-aiops-tasks

work page 2025
[2]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending root-cause and mitigation steps for cloud incidents using large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1737–1749

work page 2023
[3]

AIOps. 2025. AIOps Challenges. https://challenge.aiops.cn/home/competition

work page 2025
[4]

Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, et al. 2024. Nissist: An incident mitigation copilot based on troubleshooting guides.arXiv preprint arXiv:2402.17531(2024)

work page arXiv 2024
[5]

Nils Bars, Moritz Schloegel, Tobias Scharnowski, Nico Schiller, and Thorsten Holz. 2023. Fuzztruction: Using Fault Injection-based Fuzzing to Leverage Im- plicit Domain Knowledge. In32nd USENIX Security Symposium (USENIX Security 23). USENIX Association, Anaheim, CA, 1847–1864. https://www.usenix.org/ conference/usenixsecurity23/presentation/bars

work page 2023
[6]

CAMEL-AI. 2025. AIOps (CAMEL-AI). https://huggingface.co/datasets/camel- ai/aiops

work page 2025
[7]

camel ai. 2026. aiops-qwen-4b. https://modelscope.cn/models/camel-ai/aiops- qwen-4b. Accessed: 2026-02

work page 2026
[8]

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, et al . 2025. Learn- ing to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470(2025)

work page arXiv 2025
[9]

Yuanliang Chen, Fuchen Ma, Yuanhang Zhou, Zhen Yan, and Yu Jiang. 2025. CAFault: enhance fault injection technique in practical distributed systems via abundant fault-dependent configurations. InProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference(Boston, MA, USA)(USENIX ATC ’25). USENIX Association, USA, Article 83, 16 pages

work page 2025
[10]

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. 2025. AIOp- sLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. arXiv:2501.06706 [cs.AI] https://arxiv.org/abs/2501.06706

work page arXiv 2025
[11]

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, et al. 2023. Empowering practical root cause analysis by large language models for cloud incidents.arXiv preprint arXiv:2305.1577815 (2023)

work page arXiv 2023
[12]

Cloudflare. 2025. Cloudflare outage on November 18, 2025. https://blog.cloudflare. com/18-november-2025-outage/. Accessed: 2026-02

work page 2025
[13]

Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre FT Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Mor- gado, et al. 2024. Saullm-7b: A pioneering large language model for law.arXiv preprint arXiv:2403.03883(2024)

work page arXiv 2024
[14]

Mingzhe Du, Luu Anh Tuan, Bin Ji, Xiaobao Wu, Yuhao Qing, Dong Huang, Terry Yue Zhuo, Qian Liu, and See Kiong Ng. 2025. Codearena: A collective evalu- ation platform for llm code generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 502–512

work page 2025
[15]

Drishti Goel, Fiza Husain, Aditya Singh, Supriyo Ghosh, Anjaly Parayil, Chetan Bansal, Xuchao Zhang, and Saravan Rajmohan. 2024. X-lifecycle learning for cloud incident management using llms. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 417– 428

work page 2024
[16]

Ahmed Gongi. 2024. Devops_LLM. https://huggingface.co/datasets/ahmedgongi/ Devops_LLM

work page 2024
[17]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, et al . 2023. Owl: A large language model for it operations.arXiv preprint arXiv:2309.09298(2023)

work page arXiv 2023
[19]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

work page 2022
[20]

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Robby T Tan. 2025. Codejudgebench: Benchmarking llm-as-a-judge for coding tasks.arXiv preprint arXiv:2507.10535(2025)

work page arXiv 2025
[21]

Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, et al. 2024. Xpert: Em- powering incident management with query recommendations via large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[22]

Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, et al. 2023. Assess and summarize: Improve outage understanding with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1657–1668

work page 2023
[23]

kim dong gyu. 2025. aiops_finetuning_qaset. https://huggingface.co/datasets/ DKCode9/aiops_finetuning_qaset

work page 2025
[24]

Chenxiao Liu, Zheyong Xie, Sirui Zhao, Jin Zhou, Tong Xu, Minglei Li, and Enhong Chen. 2024. Speak from heart: an emotion-guided LLM-based multimodal method for emotional dialogue generation. InProceedings of the 2024 International Conference on Multimedia Retrieval. 533–542

work page 2024
[25]

Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin Zhang, Kun Wang, Haiming Zhang, et al. 2023. Opseval: A comprehensive it operations benchmark suite for large language models.arXiv preprint arXiv:2310.07637(2023)

work page arXiv 2023
[26]

Zeyang Ma, An Ran Chen, Dong Jae Kim, Tse-Hsun Chen, and Shaowei Wang

work page
[27]

InProceedings of the IEEE/ACM 46th International Conference on Software Engineering

Llmparser: An exploratory study on using large language models for log parsing. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page
[28]

Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, and Richong Zhang. 2025. Easy Dataset: A Unified and Extensible Frame- work for Synthesizing LLM Fine-Tuning Data from Unstructured Documents. arXiv:2507.04009 [cs.CL] https://arxiv.org/abs/2507.04009

work page arXiv 2025
[29]

MingtaoHu. 2025. AIOps (Hu et al.). https://huggingface.co/datasets/Epiphanyi/ aiops

work page 2025
[30]

Eduardo Mosqueira-Rey, Elena Hernández-Pereira, David Alonso-Ríos, José Bobes-Bascarán, and Ángel Fernández-Leal. 2023. Human-in-the-loop machine learning: a state of the art.Artificial Intelligence Review56, 4 (2023), 3005–3054

work page 2023
[31]

OpenAI. 2025. ChatGPT. https://chatgpt.com. Accessed: 2026-02

work page 2025
[32]

CloudWise OpenSource. 2025. GAIA-DataSet. https://github.com/CloudWise- OpenSource/GAIA-DataSet. Conference’17, July 2017, Washington, DC, USA Trovato et al

work page 2025
[33]

Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, et al . 2025. Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis. In Companion Proceedings of the ACM on Web Conference 2025. 422–431

work page 2025
[34]

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. 2025. RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data. InCompanion Proceedings of the ACM on Web Conference 2025 (Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 777–780

work page 2025
[35]

phpcool. 2025. DeepSeek-R1-Distill-SRE-Qwen-32B-INT8. https://modelscope. cn/models/phpcool/DeepSeek-R1-Distill-SRE-Qwen-32B-INT8. Accessed: 2026- 02

work page 2025
[36]

phpcool. 2025. DeepSeek-R1-Distill-SRE-Qwen-7B. https://modelscope.cn/ models/phpcool/DeepSeek-R1-Distill-SRE-Qwen-7B. Accessed: 2026-02

work page 2025
[37]

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Ro- drigo Fonseca, and Saravan Rajmohan. 2024. Exploring llm-based agents for root cause analysis. InCompanion proceedings of the 32nd ACM international conference on the foundations of software engineering. 208–219

work page 2024
[38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework.arXiv preprint arXiv: 2409.19256(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum learning: A survey.International Journal of Computer Vision130, 6 (2022), 1526– 1565

work page 2022
[41]

StabilityLab. 2025. zhiyu. https://modelscope.cn/models/StabilityLab/zhiyu. Accessed: 2026-02

work page 2025
[42]

ITBench Team. 2025. ITBench Agent Trajectories: LLM Agent Executions for IT Automation Tasks. https://huggingface.co/datasets/itbench/itbench-trajectories

work page 2025
[43]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm. github.io/blog/qwen2.5/

work page 2024
[44]

Rupa Vemu. 2025. AIOps_training. https://huggingface.co/datasets/rupa99/ AIOps_training

work page 2025
[45]

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. 2024. Interpretable preferences via multi-objective reward modeling and mixture-of- experts.arXiv preprint arXiv:2406.12845(2024)

work page arXiv 2024
[46]

Jianxun Wang and Yixiang Chen. 2023. A review on code generation with llms: Application and evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI). IEEE, 284–289

work page 2023
[47]

Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, and Yongxiang Li

work page
[48]

Emotional support with llm-based empathetic dialogue generation.arXiv preprint arXiv:2507.12820(2025)

work page arXiv 2025
[49]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Ad- vancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä

Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä. 2025. Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning.Proc. ACM Softw. Eng.2, FSE, Article FSE027 (June 2025), 23 pages

work page 2025
[51]

Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can large language models locate the root cause of software failures?. InThe Thirteenth International Conference on Learning Representations

work page 2025
[52]

Zihao Yi, Jiarui Ouyang, Zhe Xu, Yuwen Liu, Tianhao Liao, Haohao Luo, and Ying Shen. 2024. A survey on recent advances in llm-based multi-turn dialogue systems.Comput. Surveys(2024)

work page 2024
[53]

Guangba Yu, Pengfei Chen, Zilong He, Qiuyu Yan, Yu Luo, Fangyuan Li, and Zibin Zheng. 2024. ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems.Proc. ACM Softw. Eng.1, FSE, Article 2 (July 2024), 23 pages

work page 2024
[54]

Xuchao Zhang, Supriyo Ghosh, Chetan Bansal, Rujia Wang, Minghua Ma, Yu Kang, and Saravan Rajmohan. 2024. Automated root causing of cloud incidents using in-context learning with GPT-4. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 266–277

work page 2024
[55]

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al . 2025. Stabilizing Rein- forcement Learning with LLMs: Formulation and Practices.arXiv preprint arXiv:2512.01374(2025)

work page arXiv 2025
[56]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thaila...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. Logparser-llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4559–4570

work page 2024
[58]

Li Zhong and Zilong Wang. 2024. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21841–21849

work page 2024
[59]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al

work page
[60]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). A Implementation Details A.1 Representation Fusion This appendix provides additional implementation details of the multi-modal representation fusion process described in Section 3.2, which are omitted from the main text fo...

work page internal anchor Pith review arXiv 2024