OpenRCA 2.0: From Outcome Labels to Causal Process Supervision

Aoyang Fang; Boxi Yu; Jin'ao Shang; Junjielung Xu; Pinjia He; Qisheng Lu; Rui Wang; Songhan Zhang; Yifan Yang; Yuzhong Zhang

arxiv: 2606.27154 · v1 · pith:766HXIZBnew · submitted 2026-06-25 · 💻 cs.AI

OpenRCA 2.0: From Outcome Labels to Causal Process Supervision

Aoyang Fang , Yifan Yang , Jin'ao Shang , Qisheng Lu , Junjielung Xu , Rui Wang , Songhan Zhang , Yuzhong Zhang

show 2 more authors

Boxi Yu Pinjia He

This is my paper

Pith reviewed 2026-06-26 04:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords root cause analysisLLM agentscausal propagationbenchmark datasetfault injectionstep-wise supervisionoutcome labels

0 comments

The pith

Step-wise causal annotations reveal LLM agents ground correct root-cause services in verified paths in only 61.5 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing root cause analysis datasets provide only outcome labels for the root cause itself, which reduces the task to pattern matching rather than requiring agents to trace how a fault propagates to the observed symptom. The paper introduces the PAVE labeling protocol that uses known interventions from fault injection experiments to reconstruct full causal propagation paths via forward verification. This produces OpenRCA 2.0, a 500-instance benchmark with step-wise causal annotations across systems. Evaluation of 11 frontier LLMs shows exact root-cause set recovery succeeds in 20.7 percent of cases on average. Agents name at least one correct root-cause service in 76 percent of cases but successfully ground that service in a verified causal path in only 61.5 percent, showing that outcome-only evaluation conceals the ungrounded diagnosis failure mode.

Core claim

The PAVE protocol reconstructs causal propagation paths from known fault injection interventions using forward verification from cause to effect rather than backward inference from symptoms. Applying it creates OpenRCA 2.0, the first cross-system RCA benchmark with step-wise causal annotations. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7 percent of cases. Agents identify at least one correct root-cause service in 76.0 percent of cases but ground that service in a verified causal propagation path to the observed symptom in only 61.5 percent. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is required for trustworthy LLM-bas

What carries the argument

The PAVE protocol, which reconstructs causal propagation paths using known interventions from fault injection experiments through forward verification from cause to effect.

If this is right

Exact recovery of the complete root-cause set occurs in roughly one-fifth of cases across current frontier models.
The 14.5 percentage point gap between naming a correct service and verifying its causal path to the symptom persists under relaxed criteria.
Outcome-only benchmarks systematically underestimate the difficulty of producing reliable RCA diagnoses.
Agent training and evaluation must incorporate supervision on intermediate causal propagation steps rather than final outcomes alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If path reconstruction proves harder than service identification, methods that explicitly supervise intermediate reasoning steps during training could narrow the observed grounding gap.
The same distinction between naming a correct entity and tracing its causal role may appear in other multi-step diagnostic domains such as medical or network troubleshooting.
The low exact-match rate suggests that simply scaling model size without changes to supervision signals is unlikely to close the performance shortfall on path-grounded RCA.

Load-bearing premise

The PAVE protocol accurately captures complete and unbiased causal structures without missing paths or artifacts from the injection process.

What would settle it

A direct comparison in which the same LLM agents are evaluated on identical RCA instances once with only outcome labels and once with the requirement to output the full verified causal path, checking whether the 14.5-point gap between service identification and path grounding disappears.

Figures

Figures reproduced from arXiv: 2606.27154 by Aoyang Fang, Boxi Yu, Jin'ao Shang, Junjielung Xu, Pinjia He, Qisheng Lu, Rui Wang, Songhan Zhang, Yifan Yang, Yuzhong Zhang.

**Figure 1.** Figure 1: An ungrounded diagnosis on the running NetworkDelay failure in Seat service. From the known intervention (top), PAVE reconstructs the verified causal path Seat→Travel→Order→Gateway (middle). The agent names the correct root cause but produces a graph that skips Travel (bottom): outcome-only evaluation scores this as a success, whereas processlevel evaluation surfaces the missing edge. into the agent’s pr… view at source ↗

**Figure 2.** Figure 2: The RCA setting. (a) A microservice system forms a two-layer dependency graph. A fault injected at Seat cascades along the RPC chain (Seat → Travel → Order) and may also propagate vertically through shared infrastructure (e.g., co-located pods). (b) The agent observes only telemetry (traces, metrics, logs) and must reason backward to the originating fault; the intervention itself is hidden. set Π∗ . Phase … view at source ↗

**Figure 3.** Figure 3: Coarse-to-Fine verification refines a cluttered observation graph into one verified causal chain. Running TrainTicket example: the true root cause is Seat, while Route is a benign service that appears anomalous due to background noise. Input. Every observed anomaly and every potential edge are kept, so the root cause cannot be picked out. Phase 1 (Structural Pruning). Edges marked × are removed because the… view at source ↗

**Figure 4.** Figure 4: Spurious causal reasoning case. (a) Ground truth shows fault propagation through inter [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new RCA benchmark with step-wise causal path labels via PAVE, but the label construction details are too thin to fully trust the reported gaps.

read the letter

The main takeaway is that OpenRCA 2.0 supplies the first cross-system RCA dataset with explicit causal propagation paths instead of just root-cause labels. On 11 frontier LLMs it shows exact root-cause set recovery at 20.7 percent, with a drop from 76 percent service identification to 61.5 percent when the diagnosis must be grounded in a verified path.

What the work does is useful: it demonstrates that outcome-only scoring hides a real failure mode in multi-step causal reasoning. The forward-verification idea from known fault injections is a reasonable way to build those paths, and the distinction between ungrounded and grounded diagnoses is worth keeping in future agent evaluations.

The soft spot is the dataset itself. The abstract gives no numbers on inter-annotator agreement, no description of how many paths might be missing because they were never injected, and no cross-check against observational data or expert review. If the injection process creates its own artifacts or leaves real edges out, the 14.5-point gap between service hit and grounded path becomes harder to interpret. That concern is not fatal but it is load-bearing for the central claim.

This paper is for people who build or evaluate LLM agents on system reliability tasks and for benchmark designers who want stricter causal tests. Readers working on agentic reasoning will get value from the labeling protocol even if they end up re-validating the instances.

It deserves peer review. The idea of process supervision over outcome labels is worth community discussion, and the empirical numbers are concrete enough to debate once the construction details are filled in.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the PAVE protocol, which reconstructs causal propagation paths via forward verification from known fault-injection interventions, to create OpenRCA 2.0—a 500-instance cross-system RCA benchmark with step-wise causal annotations. It evaluates 11 frontier LLMs and reports that exact root-cause set recovery averages 20.7%, while agents identify at least one correct root-cause service in 76.0% of cases but ground that service in a verified causal path in only 61.5%, arguing that outcome-only labels hide this failure mode.

Significance. If the ground-truth paths are reliable, the work supplies a concrete, falsifiable demonstration that current LLM agents frequently produce ungrounded diagnoses in RCA settings and supplies the first publicly described step-wise causal benchmark for the task. The distinction between service identification and path grounding is a useful diagnostic lens for agent evaluation.

major comments (2)

[Abstract] Abstract: the central performance gap (76.0 % service hit vs. 61.5 % grounded path) is presented as evidence of a distinct failure mode, yet the manuscript supplies no quantitative check (e.g., comparison against observational causal discovery, expert review, or held-out interventions) that the PAVE-reconstructed paths are complete and free of injection-specific artifacts. This validation is load-bearing for interpreting the gap as a property of the models rather than of the reference.
[Dataset section] Dataset section: no information is provided on inter-annotator agreement for the causal-path labels, the selection criteria used to arrive at the final 500 instances, or the distribution of systems and fault types. These details are required to assess whether the reported averages are robust to selection effects or label noise.

minor comments (1)

[Abstract] The abstract states results to one decimal place but does not indicate whether the 500 instances are balanced across systems; adding this context would improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our results. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance gap (76.0 % service hit vs. 61.5 % grounded path) is presented as evidence of a distinct failure mode, yet the manuscript supplies no quantitative check (e.g., comparison against observational causal discovery, expert review, or held-out interventions) that the PAVE-reconstructed paths are complete and free of injection-specific artifacts. This validation is load-bearing for interpreting the gap as a property of the models rather than of the reference.

Authors: The PAVE protocol reconstructs paths via forward verification from known fault-injection interventions, providing an interventional (not observational) basis that directly tests each propagation step from cause to observed symptom. This design is intended to ensure completeness and minimize injection-specific artifacts. We will revise the abstract and methods to more explicitly articulate how the forward-verification mechanism itself serves as the quantitative check and to include any expert review or held-out validation steps performed. We view this as a clarification rather than a fundamental change to the protocol. revision: partial
Referee: [Dataset section] Dataset section: no information is provided on inter-annotator agreement for the causal-path labels, the selection criteria used to arrive at the final 500 instances, or the distribution of systems and fault types. These details are required to assess whether the reported averages are robust to selection effects or label noise.

Authors: The referee correctly notes that these details are absent from the current Dataset section. We will expand the section to report inter-annotator agreement statistics for the causal-path labels, describe the selection criteria applied to reach the final 500 instances, and provide the distribution of systems and fault types. These additions will allow readers to evaluate robustness to selection effects and label noise. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper constructs OpenRCA 2.0 by applying the PAVE protocol (forward verification from known fault-injection interventions) to produce step-wise causal labels, then measures LLM performance directly on the resulting 500 instances. No equations, fitted parameters, self-citations used as load-bearing premises, or derivations appear in the provided text. All reported figures (20.7 %, 76.0 %, 61.5 %) are raw empirical counts on the newly labeled data; the central claims do not reduce to prior quantities by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the domain assumption that fault injection experiments yield complete causal paths and on the newly introduced PAVE protocol and OpenRCA 2.0 dataset as the evaluation substrate.

axioms (1)

domain assumption Fault injection experiments provide complete and accurate causal propagation paths without missing links or injection artifacts.
Invoked to justify the forward verification mechanism in PAVE.

invented entities (2)

PAVE protocol no independent evidence
purpose: Step-wise labeling protocol to reconstruct causal paths from fault injections
Newly defined method for creating the benchmark annotations.
OpenRCA 2.0 dataset no independent evidence
purpose: 500-instance cross-system RCA benchmark with causal annotations
Constructed via PAVE and used for the LLM evaluations.

pith-pipeline@v0.9.1-grok · 5794 in / 1371 out tokens · 59194 ms · 2026-06-26T04:26:11.989266+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages

[1]

Openrca: Can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learning Representations, 2025

Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. Openrca: Can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learning Representations, 2025

2025
[2]

Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds. InMLSys, 2025. URL https:// openreview.net/forum?id=3EXBLwGxtq

2025
[3]

Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009v2, 05 2025

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009v2, 05 2025. URL https://arxiv.org/abs/2506.02009v2

arXiv 2025
[4]

Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning.ACM Transactions on Software Engineering and Methodology,

Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Siyu Yu, Jinyang Gao, Bolin Ding, Zhonghai Wu, and Ying Li. Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning.ACM Transactions on Software Engineering and Methodology,
[5]

URLhttps://doi.org/10.1145/3789262

doi: 10.1145/3789262. URLhttps://doi.org/10.1145/3789262

work page doi:10.1145/3789262
[6]

Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering, 47(2):243–260, 2018

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering, 47(2):243–260, 2018

2018
[7]

Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data. InCompanion Proceedings of the ACM on Web Conference 2025, pages 777–780, 2025

2025
[8]

Rethinking the evaluation of microservice rca with a fault propagation-aware benchmark.arXiv preprint arXiv:2510.04711v2, 10 2025

Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He. Rethinking the evaluation of microservice rca with a fault propagation-aware benchmark.arXiv preprint arXiv:2510.04711v2, 10 2025. URL https://arxiv.org/abs/2510.04711v2

arXiv 2025
[9]

Cambridge university press, 2009

Judea Pearl.Causality. Cambridge university press, 2009

2009
[10]

Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218v1, 02 2026

Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218v1, 02 2026. URL https://arxiv.org/abs/2603.02218v1

Pith/arXiv arXiv 2026
[11]

https://github.com/delimitrou/DeathStarBench/tree/master,

Deathstarbench. https://github.com/delimitrou/DeathStarBench/tree/master,
[12]

Accessed: 2026-02-10

2026
[13]

https://github.com/open-telemetry/opentelemetry-demo,

Opentelemetry demo. https://github.com/open-telemetry/opentelemetry-demo,
[14]

Accessed: 2026-05-05

2026
[15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[16]

Self-balancing agentic AI: Test-time diffusion and context engineering re-imagined for deep research.https://github.com/thinkdepthai/Deep_Research, 2025

Paichun Lin. Self-balancing agentic AI: Test-time diffusion and context engineering re-imagined for deep research.https://github.com/thinkdepthai/Deep_Research, 2025

2025
[17]

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of the Web Conference 2021, pages 3087–3098, 2021. 10

2021
[18]

Dynacausal: Dynamic causality-aware root cause analysis for distributed microservices.arXiv preprint arXiv:2510.22613v1, 10 2025

Songhan Zhang, Aoyang Fang, Yifan Yang, Ruiyi Cheng, Xiaoying Tang, and Pinjia He. Dynacausal: Dynamic causality-aware root cause analysis for distributed microservices.arXiv preprint arXiv:2510.22613v1, 10 2025. URLhttps://arxiv.org/abs/2510.22613v1

arXiv 2025
[19]

MIT press, 2000

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000

2000
[20]

Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

2022
[21]

Root cause analysis of anomalies in multivariate time series through granger causal discovery

Xiao Han, Saima Absar, Lu Zhang, and Shuhan Yuan. Root cause analysis of anomalies in multivariate time series through granger causal discovery. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[22]

Constructing large-scale real-world benchmark datasets for aiops.arXiv preprint arXiv:2208.03938v1, 08 2022

Zeyan Li, Nengwen Zhao, Shenglin Zhang, Yongqian Sun, Pengfei Chen, Xidao Wen, Minghua Ma, and Dan Pei. Constructing large-scale real-world benchmark datasets for aiops.arXiv preprint arXiv:2208.03938v1, 08 2022. URLhttps://arxiv.org/abs/2208.03938v1

arXiv 2022
[23]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 553–565, 2023

2023
[24]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

2023
[25]

Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275v1, 11 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275v1, 11 2022. URL https://arxiv.org/abs/2211.14275v1

Pith/arXiv arXiv 2022
[26]

Versaprm: Multi-domain process reward model via synthetic reasoning data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, et al. Versaprm: Multi-domain process reward model via synthetic reasoning data. InForty-second International Conference on Machine Learning
[27]

Dynamic and generalizable process reward modeling

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Dynamic and generalizable process reward modeling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4203–4233, 2025

2025
[28]

Applicable Levels

Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, and Xuming Hu. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11798–11827, 2025. 11 Appendix organization...

2025
[29]

query_parquet_files: DuckDB SQL on parquets in this case dir
[30]

list_tables_in_directory: list parquets
[31]

get_schema: column types of a parquet
[32]

## Hard limits •Tool-call budget: aim for∼50 calls; extend if the evidence genuinely warrants it

think_tool: REQUIRED after each query; summarize, plan next step. ## Hard limits •Tool-call budget: aim for∼50 calls; extend if the evidence genuinely warrants it. Hard cap is 100, at which point the runtime forces a stop. •Spend the budget efficiently: list_tables_in_directory once, get_schema on the files you actually plan to query, then spend the rest ...
[33]

list_tables_in_directory to confirm the parquet files
[34]

get_schema on the relevant ones (start with abnormal_traces)
[35]

Diff abnormal vs normal: error rates, latency, status codes, log levels
[36]

Trace the call chain (parent_span_id→span_id) to find the earliest service whose own work, not its dependency’s, went wrong
[37]

Agent Contract

Decide every root cause and every propagation edge. More than one root cause is possible; note each separately when evidence supports it. USER: RCA_ANAL YSIS_UP {incident_description} 24 F.4.2 Synthesis Phase SYSTEM: COMPRESS_FINDINGS_SP You are an RCA synthesizer. Today’s date is {date}. Your job: convert the investigation messages above into a single ST...

[1] [1]

Openrca: Can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learning Representations, 2025

Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. Openrca: Can large language models locate the root cause of software failures? InThe Thirteenth International Conference on Learning Representations, 2025

2025

[2] [2]

Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, and Saravan Rajmohan. Aiopslab: A holistic framework to evaluate ai agents for enabling autonomous clouds. InMLSys, 2025. URL https:// openreview.net/forum?id=3EXBLwGxtq

2025

[3] [3]

Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009v2, 05 2025

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. Stratus: A multi-agent system for autonomous reliability engineering of modern clouds.arXiv preprint arXiv:2506.02009v2, 05 2025. URL https://arxiv.org/abs/2506.02009v2

arXiv 2025

[4] [4]

Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning.ACM Transactions on Software Engineering and Methodology,

Lingzhe Zhang, Yunpeng Zhai, Tong Jia, Chiming Duan, Siyu Yu, Jinyang Gao, Bolin Ding, Zhonghai Wu, and Ying Li. Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning.ACM Transactions on Software Engineering and Methodology,

[5] [5]

URLhttps://doi.org/10.1145/3789262

doi: 10.1145/3789262. URLhttps://doi.org/10.1145/3789262

work page doi:10.1145/3789262

[6] [6]

Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering, 47(2):243–260, 2018

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering, 47(2):243–260, 2018

2018

[7] [7]

Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. Rcaeval: A bench- mark for root cause analysis of microservice systems with telemetry data. InCompanion Proceedings of the ACM on Web Conference 2025, pages 777–780, 2025

2025

[8] [8]

Rethinking the evaluation of microservice rca with a fault propagation-aware benchmark.arXiv preprint arXiv:2510.04711v2, 10 2025

Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He. Rethinking the evaluation of microservice rca with a fault propagation-aware benchmark.arXiv preprint arXiv:2510.04711v2, 10 2025. URL https://arxiv.org/abs/2510.04711v2

arXiv 2025

[9] [9]

Cambridge university press, 2009

Judea Pearl.Causality. Cambridge university press, 2009

2009

[10] [10]

Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218v1, 02 2026

Wei Liu, Siya Qi, Yali Du, and Yulan He. Self-play only evolves when self-synthetic pipeline ensures learnable information gain.arXiv preprint arXiv:2603.02218v1, 02 2026. URL https://arxiv.org/abs/2603.02218v1

Pith/arXiv arXiv 2026

[11] [11]

https://github.com/delimitrou/DeathStarBench/tree/master,

Deathstarbench. https://github.com/delimitrou/DeathStarBench/tree/master,

[12] [12]

Accessed: 2026-02-10

2026

[13] [13]

https://github.com/open-telemetry/opentelemetry-demo,

Opentelemetry demo. https://github.com/open-telemetry/opentelemetry-demo,

[14] [14]

Accessed: 2026-05-05

2026

[15] [15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022

[16] [16]

Self-balancing agentic AI: Test-time diffusion and context engineering re-imagined for deep research.https://github.com/thinkdepthai/Deep_Research, 2025

Paichun Lin. Self-balancing agentic AI: Test-time diffusion and context engineering re-imagined for deep research.https://github.com/thinkdepthai/Deep_Research, 2025

2025

[17] [17]

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of the Web Conference 2021, pages 3087–3098, 2021. 10

2021

[18] [18]

Dynacausal: Dynamic causality-aware root cause analysis for distributed microservices.arXiv preprint arXiv:2510.22613v1, 10 2025

Songhan Zhang, Aoyang Fang, Yifan Yang, Ruiyi Cheng, Xiaoying Tang, and Pinjia He. Dynacausal: Dynamic causality-aware root cause analysis for distributed microservices.arXiv preprint arXiv:2510.22613v1, 10 2025. URLhttps://arxiv.org/abs/2510.22613v1

arXiv 2025

[19] [19]

MIT press, 2000

Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000

2000

[20] [20]

Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. Root cause analysis of failures in microservices through causal discovery.Advances in Neural Information Processing Systems, 35:31158–31170, 2022

2022

[21] [21]

Root cause analysis of anomalies in multivariate time series through granger causal discovery

Xiao Han, Saima Absar, Lu Zhang, and Shuhan Yuan. Root cause analysis of anomalies in multivariate time series through granger causal discovery. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[22] [22]

Constructing large-scale real-world benchmark datasets for aiops.arXiv preprint arXiv:2208.03938v1, 08 2022

Zeyan Li, Nengwen Zhao, Shenglin Zhang, Yongqian Sun, Pengfei Chen, Xidao Wen, Minghua Ma, and Dan Pei. Constructing large-scale real-world benchmark datasets for aiops.arXiv preprint arXiv:2208.03938v1, 08 2022. URLhttps://arxiv.org/abs/2208.03938v1

arXiv 2022

[23] [23]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 553–565, 2023

2023

[24] [24]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

2023

[25] [25]

Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275v1, 11 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275v1, 11 2022. URL https://arxiv.org/abs/2211.14275v1

Pith/arXiv arXiv 2022

[26] [26]

Versaprm: Multi-domain process reward model via synthetic reasoning data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, et al. Versaprm: Multi-domain process reward model via synthetic reasoning data. InForty-second International Conference on Machine Learning

[27] [27]

Dynamic and generalizable process reward modeling

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Dynamic and generalizable process reward modeling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4203–4233, 2025

2025

[28] [28]

Applicable Levels

Yibo Yan, Jiamin Su, Jianxiang He, Fangteng Fu, Xu Zheng, Yuanhuiyi Lyu, Kun Wang, Shen Wang, Qingsong Wen, and Xuming Hu. A survey of mathematical reasoning in the era of multimodal large language model: Benchmark, method & challenges. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11798–11827, 2025. 11 Appendix organization...

2025

[29] [29]

query_parquet_files: DuckDB SQL on parquets in this case dir

[30] [30]

list_tables_in_directory: list parquets

[31] [31]

get_schema: column types of a parquet

[32] [32]

## Hard limits •Tool-call budget: aim for∼50 calls; extend if the evidence genuinely warrants it

think_tool: REQUIRED after each query; summarize, plan next step. ## Hard limits •Tool-call budget: aim for∼50 calls; extend if the evidence genuinely warrants it. Hard cap is 100, at which point the runtime forces a stop. •Spend the budget efficiently: list_tables_in_directory once, get_schema on the files you actually plan to query, then spend the rest ...

[33] [33]

list_tables_in_directory to confirm the parquet files

[34] [34]

get_schema on the relevant ones (start with abnormal_traces)

[35] [35]

Diff abnormal vs normal: error rates, latency, status codes, log levels

[36] [36]

Trace the call chain (parent_span_id→span_id) to find the earliest service whose own work, not its dependency’s, went wrong

[37] [37]

Agent Contract

Decide every root cause and every propagation edge. More than one root cause is possible; note each separately when evidence supports it. USER: RCA_ANAL YSIS_UP {incident_description} 24 F.4.2 Synthesis Phase SYSTEM: COMPRESS_FINDINGS_SP You are an RCA synthesizer. Today’s date is {date}. Your job: convert the investigation messages above into a single ST...