Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks

Lining Hu; Ting Liu; Yuzhuo Fu

arxiv: 2606.29159 · v1 · pith:H74STC72new · submitted 2026-06-28 · 💻 cs.AI

Pooled Leaderboards Hide System-Specific Winners: A Reporting-Protocol Audit of Offline Root-Cause Analysis Benchmarks

Lining Hu , Ting Liu , Yuzhuo Fu This is my paper

Pith reviewed 2026-06-30 07:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords root cause analysisbenchmarksleaderboardspooled rankingssubsystem variationoffline evaluationinteraction effects

0 comments

The pith

Pooled leaderboards in root-cause analysis benchmarks select a method that loses on many individual subsystems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits the common practice of ranking RCA methods by one pooled top-1 accuracy number across multiple subsystems and treating that winner as the recommendation for any given subsystem. On three public benchmark families covering 11 subsystems and 778 matched cases, it keeps only the four methods with complete coverage and runs all pairwise comparisons. Every pair shows performance effects that flip sign across subsystems, every random-effects interval crosses zero, and interaction tests reject the idea that the methods are exchangeable in five of six pairs. Selecting the pooled winner and applying it to a held-out subsystem picks the inferior method on as many as five subsystems, with accuracy regret reaching 24.8 points.

Core claim

Pooled top-1 accuracy across subsystems does not identify a method that wins on any particular subsystem; all six pairwise comparisons exhibit subsystem-level effects of both signs, every random-effects 95% prediction interval crosses zero, interaction tests reject exchangeability in five of six pairs, and leave-one-system-out selection incurs regret up to 24.8 percentage points on held-out subsystems.

What carries the argument

Matched scoring units across subsystems, random-effects meta-analysis, case-level interaction tests for non-exchangeability, and leave-one-system-out validation to quantify selection regret.

If this is right

Engineers cannot safely treat a pooled leaderboard winner as the method to deploy on their own subsystem.
Per-subsystem stability checks are required before generalizing a ranking.
Reporting protocols should include both pooled scores and the per-subsystem variation metrics shown here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pooled-versus-specific mismatch is likely to appear in other benchmark domains that aggregate across heterogeneous test environments.
Benchmarks could adopt the released 320-line audit module as a standard post-processing step to surface these effects automatically.
The regret number offers a practical way to set a minimum acceptable stability threshold when choosing among methods.

Load-bearing premise

The four methods kept for full coverage represent the broader space of RCA techniques and the 778 matched units allow unbiased pairwise comparisons without differential missingness across subsystems.

What would settle it

A new RCA benchmark in which the method with the single highest pooled score also records the highest score on every individual subsystem would contradict the central claim.

Figures

Figures reproduced from arXiv: 2606.29159 by Lining Hu, Ting Liu, Yuzhuo Fu.

**Figure 1.** Figure 1: A pooled leaderboard can hide subsystem-level rankings. Left: pooled top-1 accuracy across all 11 audited subsystems (778 cases) for two BARO-centered comparisons. Right: the same comparisons decomposed by subsystem, showing the paired effect ∆s (in pp acc@1, oriented as BARO minus comparator). Both comparisons reverse sign across subsystems: BARO is the pooled loser against max-|Z| but scores higher on Ba… view at source ↗

**Figure 2.** Figure 2: Complete pairwise comparison set over the four methods that satisfy the full-coverage inclusion criterion. Rows are the six pairwise comparisons among BARO, max-|Z|, alert-count, and CD-1min; columns are the 11 audited subsystems. Each cell is the paired per-system effect ∆s in pp acc@1, oriented as row method minus second-named method (e.g., for BARO vs max-|Z|, ∆s = BARO − max -|Z|). Every method pair ha… view at source ↗

**Figure 3.** Figure 3: Leave-one-system-out selection regret on the two motivating BARO-centered comparisons. Bars show the per-subsystem regret incurred when the pooled recommendation from the other ten systems is applied to the held-out subsystem. Subsystems without a selection reversal have zero regret by definition. Bars with black outlines mark subsystems with selection reversals; light-gray placeholder bars mark subsystem… view at source ↗

**Figure 4.** Figure 4: CIRCA modal-output collapse across 11 audited subsystems. Rows are subsystems grouped by benchmark family; columns report CIRCA’s modal-prediction frequency (red colormap; high = collapsed to a single output, bad) and top-1 accuracy (green colormap; high = good), each shown before and after the RHT schema-repair adapter. On all four OpenRCA subsystems the adapter drops modal_freq from 100% to ≤ 25% and CIR… view at source ↗

read the original abstract

Offline root-cause-analysis (RCA) benchmarks commonly rank methods by a single pooled top-1 accuracy across multiple subsystems, and engineers often read the pooled winner as a recommendation for their own subsystem. We audit that reading on three public RCA benchmark families -- OpenRCA, RCAEval, and PetShop -- covering 11 subsystems and 778 matched scoring units. To keep pairwise comparisons on identical cases, the main analysis retains four methods or comparators with complete coverage: BARO, a CD-1min adapter, max-$|Z|$, and per-service alert-count. All six pairwise comparisons show subsystem-level effects of both signs, every random-effects 95\% prediction interval crosses zero, and case-level interaction tests reject exchangeability in 5 of 6 pairs. Leave-one-system-out selection picks the lower-scoring method on up to 5 of 11 held-out subsystems, with regret reaching 24.8 pp on RCAEval / Sock-Shop. We release a 320-line audit module; given a matched RCA benchmark score table, it recomputes the same per-subsystem stability checks alongside pooled scores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pooled RCA leaderboards can pick the wrong method for a given subsystem, with leave-one-out regret up to 24.8 points on the data shown.

read the letter

The main takeaway is that across the three benchmark families, every pairwise method comparison flips direction on at least some subsystems, the random-effects intervals all cross zero, and picking the pooled winner leaves regret on up to five of eleven held-out cases.

What is new is the concrete leave-one-system-out numbers and the per-subsystem stability checks on 778 matched units. Prior RCA papers report pooled scores; this one quantifies how often that ranking misleads on individual subsystems and releases the 320-line module so others can run the same checks.

The work is straightforward: they keep comparisons on identical cases by restricting to the four methods with full coverage, run standard interaction tests, and report the results plainly. That matched design and the code release are the parts that hold up.

The soft spot is exactly the one the stress-test flags. Limiting to BARO, the CD adapter, max-|Z|, and alert-count means the sign flips and regret figures are conditional on that filtered set. If the excluded methods show less heterogeneity, the claim that pooled leaderboards systematically hide winners would be narrower than the abstract suggests. Eleven subsystems is also a thin base for broad advice about benchmark reporting.

This is for people who maintain or consume RCA benchmarks and for reliability teams that treat leaderboard ranks as deployment recommendations. It is narrow but the evidence on the retained methods is direct, so it deserves a serious referee who can check the data-handling details and see whether the four-method restriction is justified or needs expansion.

Referee Report

1 major / 0 minor

Summary. The manuscript audits the common practice of ranking offline root-cause analysis (RCA) methods via a single pooled top-1 accuracy across multiple subsystems. On three public benchmark families (OpenRCA, RCAEval, PetShop) spanning 11 subsystems and 778 matched scoring units, the authors retain four methods with complete coverage (BARO, CD-1min adapter, max-|Z|, per-service alert-count) and show that all six pairwise comparisons exhibit subsystem-level effects of both signs, every random-effects 95% prediction interval crosses zero, case-level interaction tests reject exchangeability in 5 of 6 pairs, and leave-one-system-out selection picks the lower-scoring method on up to 5 of 11 held-out subsystems (regret up to 24.8 pp on RCAEval/Sock-Shop). A 320-line audit module is released to recompute the checks from any matched score table.

Significance. If the results hold, the work demonstrates a concrete reporting-protocol flaw in RCA benchmarks and supplies a reusable tool for per-subsystem stability checks. The matched-pair design and release of reproducible code are strengths that allow direct verification and extension to new benchmarks.

major comments (1)

[Abstract / Methods] Abstract / Methods: The restriction to the four methods with complete coverage (BARO, CD-1min adapter, max-|Z|, per-service alert-count) for the main pairwise analysis is load-bearing for the claim that pooled leaderboards hide system-specific winners. If methods excluded for incomplete coverage exhibit systematically different subsystem-level heterogeneity or if missingness correlates with subsystem traits, the observed sign flips, prediction intervals crossing zero, and exchangeability rejections may not generalize beyond this filtered set. A sensitivity analysis or explicit discussion of selection effects is needed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on selection effects from restricting to methods with complete coverage. We address the point directly below and will revise the manuscript to incorporate an explicit discussion of this limitation.

read point-by-point responses

Referee: [Abstract / Methods] Abstract / Methods: The restriction to the four methods with complete coverage (BARO, CD-1min adapter, max-|Z|, per-service alert-count) for the main pairwise analysis is load-bearing for the claim that pooled leaderboards hide system-specific winners. If methods excluded for incomplete coverage exhibit systematically different subsystem-level heterogeneity or if missingness correlates with subsystem traits, the observed sign flips, prediction intervals crossing zero, and exchangeability rejections may not generalize beyond this filtered set. A sensitivity analysis or explicit discussion of selection effects is needed.

Authors: The restriction to the four methods with complete coverage is required to ensure all pairwise comparisons use identical scoring units, as stated in the Methods section; without it, missingness would confound the matched-pair design and the random-effects models. We agree that this choice is load-bearing for the reported sign flips, prediction intervals, and exchangeability tests, and that the findings may not automatically extend to methods with incomplete coverage. A full sensitivity analysis is not possible from the public benchmark releases because the excluded methods lack scores on the same cases. In the revised manuscript we will add an explicit subsection in the Discussion that (i) lists the coverage patterns of the excluded methods, (ii) notes that missingness could in principle correlate with subsystem traits, and (iii) states that the released audit module is intended precisely to allow such checks once more complete score tables become available. This addition will qualify the scope of the current results without altering the core claim for the comparable methods. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical audit on public data with standard statistics

full rationale

The paper performs a direct empirical audit of public RCA benchmark score tables using standard statistical procedures (random-effects models, 95% prediction intervals, case-level interaction tests, and leave-one-system-out selection). No derivation chain, fitted parameters, or predictions are defined in terms of the target quantities; all reported effects and regrets are computed from the 778 matched scoring units. No self-citations are load-bearing for the central claims, and the released audit module simply recomputes the same checks. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard statistical assumptions for random-effects models and matched-pair comparisons; no free parameters, ad-hoc axioms, or invented entities are introduced in the abstract.

axioms (2)

domain assumption Random-effects model assumptions hold for subsystem-level performance differences
Invoked when constructing 95% prediction intervals that cross zero
domain assumption The 778 matched scoring units permit unbiased pairwise comparisons
Required for the claim that all six pairwise comparisons are valid

pith-pipeline@v0.9.1-grok · 5735 in / 1426 out tokens · 28884 ms · 2026-06-30T07:50:01.259881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages

[1]

Bender and Alexander Koller

Emily M. Bender and Alexander Koller. Climbing towards NLU : On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5185--5198. Association for Computational Linguistics, 2020

2020
[2]

Bowman and George E

Samuel R. Bowman and George E. Dahl. What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4843--4855. Association for Computational Linguistics, 2021

2021
[3]

AIOpsLab : A holistic framework to evaluate AI agents for enabling autonomous clouds

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, et al. AIOpsLab : A holistic framework to evaluate AI agents for enabling autonomous clouds. arXiv preprint arXiv:2501.06706, 2025

work page arXiv 2025
[4]

William G. Cochran. The combination of estimates from different experiments. Biometrics, 10 0 (1): 0 101--129, 1954

1954
[5]

Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals

Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery. arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021
[6]

Meta-analysis in clinical trials

Rebecca DerSimonian and Nan Laird. Meta-analysis in clinical trials. Controlled Clinical Trials, 7 0 (3): 0 177--188, 1986

1986
[7]

Rethinking the evaluation of microservice RCA with a fault propagation-aware benchmark

Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He. Rethinking the evaluation of microservice RCA with a fault propagation-aware benchmark. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE), 2026

2026
[8]

Orchard, Patrick Bl \"o baum, Elke Kirschbaum, and Shiva Kasiviswanathan

Michaela Hardt, William R. Orchard, Patrick Bl \"o baum, Elke Kirschbaum, and Shiva Kasiviswanathan. The PetShop dataset --- finding causes of performance issues across microservices. In Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 of Proceedings of Machine Learning Research, pages 957--978. PMLR, 2024

2024
[9]

On tests of the overall treatment effect in meta-analysis with normally distributed responses

Joachim Hartung and Guido Knapp. On tests of the overall treatment effect in meta-analysis with normally distributed responses. Statistics in Medicine, 20 0 (12): 0 1771--1782, 2001

2001
[10]

Joanna IntHout, John P. A. Ioannidis, Maroeska M. Rovers, and Jelle J. Goeman. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open, 6 0 (7): 0 e010247, 2016

2016
[11]

ITBench : Evaluating AI agents across diverse real-world IT automation tasks

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, et al. ITBench : Evaluating AI agents across diverse real-world IT automation tasks. arXiv preprint arXiv:2502.05352, 2025

work page arXiv 2025
[12]

Why do AI agents systematically fail at cloud root cause analysis? arXiv preprint arXiv:2602.09937, 2026

Taeyoon Kim, Woohyeok Park, Hoyeong Yun, and Kyungyong Lee. Why do AI agents systematically fail at cloud root cause analysis? arXiv preprint arXiv:2602.09937, 2026

work page arXiv 2026
[13]

Causal inference-based root cause analysis for online service systems with intervention recognition

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. In Aidong Zhang and Huzefa Rangwala, editors, KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14--18, 2022, p...

2022
[14]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR), 2023

2023
[15]

Papke and Jeffrey M

Leslie E. Papke and Jeffrey M. Wooldridge. Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics, 11 0 (6): 0 619--632, 1996

1996
[16]

Paule and John Mandel

Robert C. Paule and John Mandel. Consensus values and weighting factors. Journal of Research of the National Bureau of Standards, 87 0 (5): 0 377--385, 1982

1982
[17]

BARO : Robust root cause analysis for microservices via multivariate Bayesian online change point detection

Luan Pham, Huong Ha, and Hongyu Zhang. BARO : Robust root cause analysis for microservices via multivariate Bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 2214--2237, 2024 a

2024
[18]

Luan Pham, Huong Ha, and Hongyu Zhang. Root cause analysis for microservice system based on causal inference: How far are we? In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 706--715. ACM, 2024 b

2024
[19]

RCAEval : A benchmark for root cause analysis of microservice systems with telemetry data

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. RCAEval : A benchmark for root cause analysis of microservice systems with telemetry data. In Companion Proceedings of the ACM on Web Conference 2025 (WWW Companion), pages 777--780. ACM, 2025

2025
[20]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco T \'u lio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4902--4912. Association for Computational Linguistics, 2020

2020
[21]

Stalled, biased, and confused: Uncovering reasoning failures in LLMs for cloud-based root cause analysis

Evelien Riddell, James Riddell, Gengyi Sun, Micha Antkiewicz, and Krzysztof Czarnecki. Stalled, biased, and confused: Uncovering reasoning failures in LLMs for cloud-based root cause analysis. In Proceedings of the 2026 IEEE/ACM Third International Conference on AI Foundation Models and Software Engineering (FORGE). ACM, 2026

2026
[22]

Kurex Sidik and Jeffrey N. Jonkman. Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54 0 (2): 0 367--384, 2005

2005
[23]

Kurex Sidik and Jeffrey N. Jonkman. Robust variance estimation for random effects meta-analysis. Computational Statistics & Data Analysis, 50 0 (12): 0 3681--3701, 2006

2006
[24]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (TMLR), 2023

2023
[25]

Conducting meta-analyses in R with the metafor package

Wolfgang Viechtbauer. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36 0 (3): 0 1--48, 2010

2010
[26]

Yilun Wang, Guangba Yu, Haiyu Huang, Zirui Wang, Yujie Huang, Pengfei Chen, and Michael R. Lyu. Cloud-OpsBench : A reproducible benchmark for agentic root cause analysis in cloud systems. arXiv preprint arXiv:2603.00468, 2026

work page arXiv 2026
[27]

OpenRCA : Can large language models locate the root cause of software failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. OpenRCA : Can large language models locate the root cause of software failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[28]

Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis,

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen. LEMMA-RCA : A large multi-modal multi-domain dataset for root cause analysis. arXiv preprint arXiv:2406.05375, 2024

work page arXiv 2024
[29]

Latent error prediction and fault localization for microservice applications by learning from system trace logs

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE),...

2019

[1] [1]

Bender and Alexander Koller

Emily M. Bender and Alexander Koller. Climbing towards NLU : On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5185--5198. Association for Computational Linguistics, 2020

2020

[2] [2]

Bowman and George E

Samuel R. Bowman and George E. Dahl. What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4843--4855. Association for Computational Linguistics, 2021

2021

[3] [3]

AIOpsLab : A holistic framework to evaluate AI agents for enabling autonomous clouds

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, et al. AIOpsLab : A holistic framework to evaluate AI agents for enabling autonomous clouds. arXiv preprint arXiv:2501.06706, 2025

work page arXiv 2025

[4] [4]

William G. Cochran. The combination of estimates from different experiments. Biometrics, 10 0 (1): 0 101--129, 1954

1954

[5] [5]

Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals

Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. The benchmark lottery. arXiv preprint arXiv:2107.07002, 2021

work page arXiv 2021

[6] [6]

Meta-analysis in clinical trials

Rebecca DerSimonian and Nan Laird. Meta-analysis in clinical trials. Controlled Clinical Trials, 7 0 (3): 0 177--188, 1986

1986

[7] [7]

Rethinking the evaluation of microservice RCA with a fault propagation-aware benchmark

Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He. Rethinking the evaluation of microservice RCA with a fault propagation-aware benchmark. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE), 2026

2026

[8] [8]

Orchard, Patrick Bl \"o baum, Elke Kirschbaum, and Shiva Kasiviswanathan

Michaela Hardt, William R. Orchard, Patrick Bl \"o baum, Elke Kirschbaum, and Shiva Kasiviswanathan. The PetShop dataset --- finding causes of performance issues across microservices. In Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 of Proceedings of Machine Learning Research, pages 957--978. PMLR, 2024

2024

[9] [9]

On tests of the overall treatment effect in meta-analysis with normally distributed responses

Joachim Hartung and Guido Knapp. On tests of the overall treatment effect in meta-analysis with normally distributed responses. Statistics in Medicine, 20 0 (12): 0 1771--1782, 2001

2001

[10] [10]

Joanna IntHout, John P. A. Ioannidis, Maroeska M. Rovers, and Jelle J. Goeman. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open, 6 0 (7): 0 e010247, 2016

2016

[11] [11]

ITBench : Evaluating AI agents across diverse real-world IT automation tasks

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, et al. ITBench : Evaluating AI agents across diverse real-world IT automation tasks. arXiv preprint arXiv:2502.05352, 2025

work page arXiv 2025

[12] [12]

Why do AI agents systematically fail at cloud root cause analysis? arXiv preprint arXiv:2602.09937, 2026

Taeyoon Kim, Woohyeok Park, Hoyeong Yun, and Kyungyong Lee. Why do AI agents systematically fail at cloud root cause analysis? arXiv preprint arXiv:2602.09937, 2026

work page arXiv 2026

[13] [13]

Causal inference-based root cause analysis for online service systems with intervention recognition

Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. Causal inference-based root cause analysis for online service systems with intervention recognition. In Aidong Zhang and Huzefa Rangwala, editors, KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14--18, 2022, p...

2022

[14] [14]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, et al. Holistic evaluation of language models. Transactions on Machine Learning Research (TMLR), 2023

2023

[15] [15]

Papke and Jeffrey M

Leslie E. Papke and Jeffrey M. Wooldridge. Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics, 11 0 (6): 0 619--632, 1996

1996

[16] [16]

Paule and John Mandel

Robert C. Paule and John Mandel. Consensus values and weighting factors. Journal of Research of the National Bureau of Standards, 87 0 (5): 0 377--385, 1982

1982

[17] [17]

BARO : Robust root cause analysis for microservices via multivariate Bayesian online change point detection

Luan Pham, Huong Ha, and Hongyu Zhang. BARO : Robust root cause analysis for microservices via multivariate Bayesian online change point detection. Proceedings of the ACM on Software Engineering, 1 0 (FSE): 0 2214--2237, 2024 a

2024

[18] [18]

Luan Pham, Huong Ha, and Hongyu Zhang. Root cause analysis for microservice system based on causal inference: How far are we? In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 706--715. ACM, 2024 b

2024

[19] [19]

RCAEval : A benchmark for root cause analysis of microservice systems with telemetry data

Luan Pham, Hongyu Zhang, Huong Ha, Flora Salim, and Xiuzhen Zhang. RCAEval : A benchmark for root cause analysis of microservice systems with telemetry data. In Companion Proceedings of the ACM on Web Conference 2025 (WWW Companion), pages 777--780. ACM, 2025

2025

[20] [20]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco T \'u lio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4902--4912. Association for Computational Linguistics, 2020

2020

[21] [21]

Stalled, biased, and confused: Uncovering reasoning failures in LLMs for cloud-based root cause analysis

Evelien Riddell, James Riddell, Gengyi Sun, Micha Antkiewicz, and Krzysztof Czarnecki. Stalled, biased, and confused: Uncovering reasoning failures in LLMs for cloud-based root cause analysis. In Proceedings of the 2026 IEEE/ACM Third International Conference on AI Foundation Models and Software Engineering (FORGE). ACM, 2026

2026

[22] [22]

Kurex Sidik and Jeffrey N. Jonkman. Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54 0 (2): 0 367--384, 2005

2005

[23] [23]

Kurex Sidik and Jeffrey N. Jonkman. Robust variance estimation for random effects meta-analysis. Computational Statistics & Data Analysis, 50 0 (12): 0 3681--3701, 2006

2006

[24] [24]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (TMLR), 2023

2023

[25] [25]

Conducting meta-analyses in R with the metafor package

Wolfgang Viechtbauer. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36 0 (3): 0 1--48, 2010

2010

[26] [26]

Yilun Wang, Guangba Yu, Haiyu Huang, Zirui Wang, Yujie Huang, Pengfei Chen, and Michael R. Lyu. Cloud-OpsBench : A reproducible benchmark for agentic root cause analysis in cloud systems. arXiv preprint arXiv:2603.00468, 2026

work page arXiv 2026

[27] [27]

OpenRCA : Can large language models locate the root cause of software failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. OpenRCA : Can large language models locate the root cause of software failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[28] [28]

Lemma-rca: A large multi-modal multi-domain dataset for root cause analysis,

Lecheng Zheng, Zhengzhang Chen, Dongjie Wang, Chengyuan Deng, Reon Matsuoka, and Haifeng Chen. LEMMA-RCA : A large multi-modal multi-domain dataset for root cause analysis. arXiv preprint arXiv:2406.05375, 2024

work page arXiv 2024

[29] [29]

Latent error prediction and fault localization for microservice applications by learning from system trace logs

Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE),...

2019