arxiv: 2512.22113 · v3 · submitted 2025-12-26 · 💻 cs.DC · cs.AI· cs.SE

Recognition: 3 theorem links

· Lean Theorem

PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis

Shengkun Cui , Rahul Krishna , Saurabh Jha , Ravishankar K. Iyer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:14 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.SE

keywords root cause analysismicroservicesprogram dependence graphsLLM agentscloud incidentsobservabilityincident diagnosisagentic workflow

0 comments

The pith

PRAXIS directs LLMs through service and code dependence graphs to diagnose cloud incidents more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRAXIS as an orchestrator for agentic workflows that diagnose root causes of cloud incidents stemming from code or configuration problems. It has an LLM perform structured traversals on a service dependency graph for microservice interactions and a hammock-block program dependence graph for code details. This method is shown to outperform standard ReAct approaches by improving accuracy up to 6.3 times and cutting token use by 5.3 times across 30 real incidents. Such gains matter because unresolved incidents are expensive, averaging over two million dollars per hour in costs.

Core claim

PRAXIS integrates program analysis with observability for root-cause analysis by managing an LLM-driven workflow that traverses service dependency graphs capturing microservice dependencies and hammock-block program dependence graphs for code-level dependencies within services, resulting in up to 6.3x higher RCA accuracy and 5.3x lower token consumption than ReAct baselines on real-world incidents.

What carries the argument

PRAXIS orchestrator using LLM-structured traversal of service dependency graphs (SDG) and hammock-block program dependence graphs (PDG) to guide diagnosis.

If this is right

Root-cause analysis becomes more precise by combining high-level service views with low-level code dependencies.
LLM agents consume fewer tokens, making repeated diagnostics more feasible in production.
Diagnosis can handle both code and configuration issues in complex microservice setups.
The method supplies a ready benchmark of 30 real incidents for comparing future RCA techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption would require reliable automatic construction of these graphs for new services.
Similar graph-guided structures could improve LLM performance in other complex reasoning tasks like security auditing.
Companies might see faster incident response times leading to lower overall operational costs.

Load-bearing premise

Accurate service dependency graphs and hammock-block program dependence graphs can be automatically built for any production microservices, and the LLM can follow them without adding errors.

What would settle it

A collection of new incidents where graph-guided traversal produces wrong root causes or uses more tokens than direct ReAct reasoning would disprove the claimed improvements.

Figures

Figures reproduced from arXiv: 2512.22113 by Rahul Krishna, Ravishankar K. Iyer, Saurabh Jha, Shengkun Cui.

**Figure 1.** Figure 1: Incident: Degraded external database returned empty responses, triggering a silent retry loop in the Recommendation service that manifested solely as high latency alert associated with the Recommendation service, without explicit error logs or error traces. Cross-SDG-PDG traversal: (1) LLM selects the Recommendation service for investigation based on the observed alert. (2) Investigation of the Recommendat… view at source ↗

**Figure 3.** Figure 3: PRAXIS Phase 2: Initial microservice candidate(s) selection. B. Phase 2: Microservice Candidate(s) Selection In Phase 2 ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: PRAXIS Phase 3: RCA decision-making. This process is repeated for the next focal entity that is (a) a dependee of the current focal entity and/or (b) suggested by the LLM based on the focal entity’s RCA decision. program when observability data are absent or insufficient due to the incident or when the matching is ambiguous (e.g., multiple matches). PDG traversal. Once the starting hammock block b0 (i.e., … view at source ↗

**Figure 6.** Figure 6: PRAXIS Phase 4: Final RCA summary. (a), PRAXIS enqueues all unvisited dependees (i.e., cloud entities on which ei depends) according to the SDG GC , denoted as QGC = GC .getDependees(ei). For (b), PRAXIS prompts the LLM M using the entity-queuing prompt ψQ to identify up to k additional entities6 from GC . This selection is based on the judgment J, the observability context ci , the program context CP , an… view at source ↗

**Figure 5.** Figure 5: Example LLM-driven PDG traversal. link error signals in the observability data with program evidence (code snippets and dependency flows) that might explain the incident. (c) Entity role judgment and queue update. The final stage in this phase is entity role judgment and investigation queue update, in which PRAXIS provides a diagnostic judgment on the entity ei’s role in the incident and an admissible expl… view at source ↗

**Figure 7.** Figure 7: RCA reasoning of PRAXIS (Obs. Ctx.) and PRAXIS. Nodes are microservices; solid arrows are dependencies; the green dotted arrow is a dependency with missing traces and had to be derived from program context. improves overall RCR Pass@1 by 28.8 percentage points, from 32.7% to 61.5% (an 88.5% improvement), and RCI Pass@1 by 14.7 percentage points, from 59.2% to 73.9% (a 24.8% improvement). Against PRAXIS (Ob… view at source ↗

read the original abstract

Unresolved production cloud incidents cost an average of over $2M per hour. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 6.3x while reducing token consumption by 5.3x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRAXIS gets structured LLM traversal over service and hammock-block code graphs for cloud RCA and reports large gains over ReAct, but the gains rest on graph quality that is not yet shown to be reliable at scale.

read the letter

The main point is that PRAXIS wires service dependency graphs and hammock-block program dependence graphs into an LLM agent so the model follows concrete paths instead of free-form reasoning. On their 30 real incidents it claims 6.3 times higher accuracy and 5.3 times lower token use than ReAct baselines. That is the concrete advance: the graphs give the agent a map rather than letting it wander through logs and traces alone. The hammock-block PDG choice is a reasonable engineering decision for capturing certain control-flow patterns that standard PDGs can miss, and the orchestrator description shows a workable way to keep the agent on track. The paper also starts to turn the incidents into a public benchmark, which is useful even if the set is still small. The soft spot is exactly what the stress-test flags. There are no precision or recall numbers for the automatically built SDGs and PDGs against ground truth on those incidents, and no discussion of what happens when static analysis misses dynamic calls, config wiring, or third-party code. Without those checks it is hard to tell whether the accuracy jump comes from the method or from unusually clean graphs on this particular collection. If the full paper has only the abstract-level claims, the central result stays provisional. This is for people who build automated diagnosis tools for microservices or who extend ReAct-style agents with static analysis. A reader who already works on observability pipelines or LLM agents for ops will see a practical integration they can try. It deserves a serious referee because the problem is real, the baseline comparison exists, and the workflow is described in enough detail to reproduce the high-level idea. Referees should ask for the missing graph-quality metrics and failure cases before acceptance.

Referee Report

2 major / 1 minor

Summary. The paper introduces PRAXIS, an orchestrator for an LLM-driven agentic workflow that performs root-cause analysis of cloud incidents by structured traversal over automatically constructed service dependency graphs (SDGs) at the microservice level and hammock-block program dependence graphs (PDGs) at the code level. It reports empirical results on a benchmark of 30 real-world incidents, claiming up to 6.3x higher RCA accuracy and 5.3x lower token consumption relative to ReAct baselines.

Significance. If the automatically generated graphs prove accurate at scale and the traversal avoids injecting errors, the integration of static program analysis with observability could meaningfully advance automated RCA for production microservices, addressing a high-cost problem. The structured-graph guidance of LLM agents is a concrete technical contribution worth exploring further.

major comments (2)

[Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.
[Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.

minor comments (1)

[Abstract] The abstract and introduction refer to a 'comprehensive real-world incidents' benchmark that 'is being compiled'; clarify its current public status and any licensing or access details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and reproducibility, and we address each point below with commitments to revise the paper where needed.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.

Authors: We agree that the Evaluation section lacks sufficient detail for independent verification. In the revised manuscript, we will explicitly define the accuracy metric as the percentage of incidents for which the system's identified root cause matches the ground-truth root cause. We will describe the incident selection criteria, which focus on real-world cloud incidents with publicly available or accessible telemetry data and source code. Ground-truth root causes were established via expert annotation cross-referenced with official incident reports. We will also add statistical significance testing (e.g., paired t-tests or McNemar's test) for the reported improvements. These changes will be incorporated into the next version. revision: yes
Referee: [Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.

Authors: We acknowledge that quantitative validation of the SDGs and PDGs is missing from the current manuscript. In the revision, we will add precision/recall metrics for graph construction on the 30 incidents (computed against manually verified subsets where feasible). We will also include a discussion of failure modes, such as missed dynamic calls (mitigated by the LLM-driven traversal), configuration wiring errors, and third-party library handling. We maintain that the accuracy gains derive primarily from the structured traversal approach rather than graph perfection alone, as the same underlying data is available to the ReAct baselines; however, we will make this distinction clearer. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmark

full rationale

The paper reports PRAXIS accuracy and token reductions as direct empirical comparisons against ReAct baselines on a fixed set of 30 real-world incidents compiled into an RCA benchmark. No equations, fitted parameters, or derivations are present that reduce by construction to the same inputs; the central claims rest on observed performance differences rather than self-defined quantities or self-citation chains. The construction of SDGs and PDGs is presented as an engineering step whose accuracy is assumed for the evaluation, but this assumption is not turned into a circular prediction or uniqueness theorem within the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical engineering system rather than a formal derivation; the abstract introduces no explicit free parameters, mathematical axioms, or new postulated entities.

axioms (1)

domain assumption LLM agents can reliably perform structured traversal over service and program dependence graphs for diagnosis
The workflow depends on the LLM's ability to follow the graphs without hallucinating incorrect causal links.

pith-pipeline@v0.9.0 · 5449 in / 1289 out tokens · 48802 ms · 2026-05-16T19:14:09.187476+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) ... (2) a hammock-block program dependence graph (PDG)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hammock blocks ... control, data, and call dependencies ... hierarchical nesting structure
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 2 internal anchors

[1]

New Relic study reveals businesses face an annual median cost of $76 million from high-impact IT outages,

New Relic, Inc., “New Relic study reveals businesses face an annual median cost of $76 million from high-impact IT outages,” Sep. 2025, press release. [Online]. Available: https: //newrelic.com/press-release/20250917

work page arXiv 2025
[2]

2025 observability forecast report,

——, “2025 observability forecast report,” 2025, report. [Online]. Available: https://newrelic.com/sites/default/files/2025-09/ new-relic-2025-observability-forecast-report.pdf

work page 2025
[3]

How to fight production incidents? An empirical study on a large-scale cloud service,

S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production incidents? An empirical study on a large-scale cloud service,” in Proceedings of the 13th Symposium on Cloud Computing, ser. SoCC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 126–141. [Online]. Available: https://doi.org/10.1145/3542929.3563482

work page doi:10.1145/3542929.3563482 2022
[4]

RESOLVED: Current account payments may fail: Major outage 27/10/2017,

Monzo-engineer, “RESOLVED: Current account payments may fail: Major outage 27/10/2017,” Oct. 2017, Monzo Community Forum. Accessed 2025-10-

work page 2017
[5]

Available: https://community.monzo.com/t/ resolved-current-account-payments-may-fail-major-outage-27-10-2017/ 26296/95

[Online]. Available: https://community.monzo.com/t/ resolved-current-account-payments-may-fail-major-outage-27-10-2017/ 26296/95

work page 2017
[6]

You broke Reddit: The Pi-Day outage,

Reddit-engineer, “You broke Reddit: The Pi-Day outage,” Mar. 2023, Reddit Engineering Blog. Accessed: 2025-10-09. [Online]. Available: https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_ broke_reddit_the_piday_outage/

work page 2023
[7]

(2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region

Amazon Web Services. (2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region. AWS. Post-incident summary of the Oct 19–20, 2025 US-EAST-1 disruption. [Online]. Available: https://aws.amazon.com/message/101925/

work page 2025
[8]

Cloudflare outage on November 18, 2025,

M. Prince, “Cloudflare outage on November 18, 2025,” https://blog. cloudflare.com/18-november-2025-outage/, Nov. 2025, accessed: 2025- 11-20

work page 2025
[9]

The program structure tree: Computing control regions in linear time,

R. Johnson, D. Pearson, and K. Pingali, “The program structure tree: Computing control regions in linear time,” inProceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, ser. PLDI ’94. New York, NY , USA: Association for Computing Machinery, 1994, pp. 171–185. [Online]. Available: https://doi.org/10.1145/178243.178258

work page doi:10.1145/178243.178258 1994
[10]

Using hammock graphs to structure programs,

F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,”IEEE Trans. Softw. Eng., vol. 30, no. 4, pp. 231–245, Apr

work page
[11]

Available: https://doi.org/10.1109/TSE.2004.1274043

[Online]. Available: https://doi.org/10.1109/TSE.2004.1274043

work page doi:10.1109/tse.2004.1274043 2004
[12]

Cloud bug study (cbs) database,

H. S. Gunawiet al., “Cloud bug study (cbs) database,” http://ucare. cs.uchicago.edu/projects/cbs/, UCARE Research Group, University of Chicago, 2014, accessed: 2025-06-01

work page 2014
[13]

Kubernetes failure stories,

H. Jacobs, “Kubernetes failure stories,” https://codeberg.org/hjacobs/ kubernetes-failure-stories, 2023, accessed: 2025-10-07

work page 2023
[14]

What bugs live in the cloud? A study of 3000+ issues in cloud systems,

H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V . Martin, and A. D. Satria, “What bugs live in the cloud? A study of 3000+ issues in cloud systems,” inProceedings of the ACM Symposium on Cloud Computing, ser. SOCC ’14. New York, NY , USA: Association for Computing Machinery, 201...

work page doi:10.1145/2670979.2670986 2014
[15]

What bugs cause production cloud incidents?

H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production cloud incidents?” inProceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 155–162. [Online]. Available: https://doi.org/10.1145/3317550.3321438

work page doi:10.1145/3317550.3321438 2019
[16]

ITBench: Evaluating AI agents across diverse real-world IT automation tasks,

S. Jha, R. R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J. wook Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswam...

work page 2025
[17]

RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, ...

work page doi:10.1145/3627673.3680016 2024
[18]

Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,

Y . Chen, J. Pan, J. Clark, Y . Su, N. Zheutlin, B. Bhavya, R. Arora, Y . Deng, S. Jha, and T. Xu, “Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,” inProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025, accepted; preprint available at arXiv:2506.02009. [Online]. Available...

work page arXiv 2025
[19]

Exploring LLM-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring LLM-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 208–219. [Online]. Available:...

work page doi:10.1145/3663529.3663841 2024
[20]

OpenRCA: Can large language models locate the root cause of software failures?

J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “OpenRCA: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=M4qNIzQYpd

work page 2025
[21]

COCA: Generative root cause analysis for distributed systems with code knowledge,

Y . Li, Y . Wu, J. Liu, Z. Jiang, Z. Chen, G. Yu, and M. R. Lyu, “COCA: Generative root cause analysis for distributed systems with code knowledge,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE ’25), 2025. [Online]. Available: https://dl.acm.org/doi/10.1109/ICSE55347.2025.00234

work page doi:10.1109/icse55347.2025.00234 2025
[22]

AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,

Y . Chen, M. Shetty, G. Somashekar, M. Ma, Y . Simmhan, J. Mace, C. Bansal, R. Wang, and S. Rajmohan, “AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,” inProceedings of MLSys ’25, 2025. [Online]. Available: https://openreview.net/forum?id=3EXBLwGxtq

work page 2025
[23]

Beyer, C

B. Beyer, C. Jones, J. Petoff, and N. R. Murphy,Site Reliability Engineering: How Google Runs Production Systems, 2016. [Online]. Available: http://landing.google.com/sre/book.html

work page 2016
[24]

The program dependence graph and its use in optimization,

J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,”ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987. [Online]. Available: https://doi.org/10.1145/24039.24041

work page doi:10.1145/24039.24041 1987
[25]

Prometheus: Monitoring system & time series database,

Prometheus Authors, “Prometheus: Monitoring system & time series database,” 2025, open-source systems monitoring and alerting toolkit. [Online]. Available: https://prometheus.io/

work page 2025
[26]

Jaeger: Open source, distributed tracing platform,

Jaeger Project, “Jaeger: Open source, distributed tracing platform,” 2025, originally open-sourced by Uber; CNCF project. [Online]. Available: https://www.jaegertracing.io/

work page 2025
[27]

ClickHouse: Fast open-source OLAP DBMS,

ClickHouse, Inc., “ClickHouse: Fast open-source OLAP DBMS,” 2025, column-oriented database for real-time analytics. [Online]. Available: https://clickhouse.com/

work page 2025
[28]

MicroRCA: Root cause localization of performance issues in microservices,

L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root cause localization of performance issues in microservices,” inNOMS 2020: 2020 IEEE/IFIP Network Operations and Management Symposium, 2020, pp. 1–9. [Online]. Available: https://doi.org/10.1109/NOMS47738. 2020.9110353

work page doi:10.1109/noms47738 2020
[29]

Sage: Practical and scalable ML-driven performance debugging in microservices,

Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: Practical and scalable ML-driven performance debugging in microservices,” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 135–151. [...

work page doi:10.1145/3445814.3446700 2021
[30]

Datadog cloud monitoring platform,

Datadog, “Datadog cloud monitoring platform,” https://www.datadoghq. com/, 2025, accessed: 2025-06-01

work page 2025
[31]

tree-sitter/tree-sitter: v0.25.10,

M. Brunsfeld, A. Qureshi, A. Hlynskyi, ObserverOfTime, W. Lillis, J. Vera, dundargoc, P. Turnbull, T. Clem, D. Creager, A. Helwer, R. Rix, D. Kavolis, C. Clason, M. Davis, R. Bruins, A. Delpeuch, Ika, A. Ya, T.-A. Nguy ˜en, bfredl, S. Brunk, M. Massicotte, N. Hasabnis, J. McCoy, M. Dong, S. Moelius, S. Kalt, and Kolja, “tree-sitter/tree-sitter: v0.25.10,”...

work page doi:10.5281/zenodo.17180150 2025
[32]

Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,

R. Krishna, R. Pan, S. Sinha, S. Tamilselvam, R. Pavuluri, and M. Vukovic, “Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, ser. FSE Companion ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. ...

work page doi:10.1145/3696630.3728555 2025
[33]

OpenTelemetry demo: Astronomy shop microservices,

OpenTelemetry authors, “OpenTelemetry demo: Astronomy shop microservices,” https://github.com/open-telemetry/opentelemetry-demo, 2025, microservice-based distributed system illustrating OpenTelemetry in a near real-world environment. Accessed: 2025-11-28

work page 2025
[34]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, pp. 1737–1749. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00149

work page doi:10.1109/icse48619.2023.00149 2023
[35]

Swe-agent: agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: agent-computer interfaces enable automated software engineering,” inProceedings of the 38th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2024

work page 2024
[36]

Lifecycle of an incident,

Google Cloud, “Lifecycle of an incident,” https://docs.cloud.google.com/ service-health/docs/incident-lifecycle, n.d., accessed: 2025-12-01

work page 2025
[37]

OpenTelemetry specification: Overview,

OpenTelemetry Authors, “OpenTelemetry specification: Overview,” https: //opentelemetry.io/docs/specs/otel/overview/, 2025, overview of the Open- Telemetry project and core concepts. Accessed: 2025-11-28

work page 2025
[38]

An empirical study of production incidents in generative AI cloud services,

H. Yan, Y . Chen, M. Ma, M. Wen, S. Lu, S. Zhang, T. Xu, R. Wang, C. Bansal, S. Rajmohan, C. Zhang, and D. Zhang, “An empirical study of production incidents in generative AI cloud services,”CoRR, vol. abs/2504.08865, 2025, accepted to ISSRE 2025; preprint on arXiv. [Online]. Available: https://arxiv.org/abs/2504.08865

work page arXiv 2025
[39]

Mutiny! how does kubernetes fail, and what can we do about it?

M. Barletta, M. Cinque, C. Di Martino, Z. T. Kalbarczyk, and R. K. Iyer, “Mutiny! how does kubernetes fail, and what can we do about it?” in 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2024, pp. 1–14

work page 2024
[40]

25% or 6 to 4: The 11/6/23 authentication outage,

Discord Engineering, “25% or 6 to 4: The 11/6/23 authentication outage,” Nov. 2023, accessed: 2025-12-04. [Online]. Available: https://discord.com/blog/authentication-outage

work page 2023
[41]

Details of the Cloudflare outage on July 2, 2019,

J. Graham-Cumming, “Details of the Cloudflare outage on July 2, 2019,” Jul. 2019, accessed: 2025-12-04. [Online]. Available: https: //blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

work page 2019
[42]

External technical root cause analysis — channel file 291,

CrowdStrike, “External technical root cause analysis — channel file 291,” Aug. 2024, accessed: 2025-12-04. [Online]. Available: https://www.crowdstrike.com/wp-content/uploads/2024/08/ Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

work page 2024
[43]

Incident report: Spotify outage on april 16, 2025,

Spotify Engineering, “Incident report: Spotify outage on april 16, 2025,” May 2025, accessed: 2025-12-04. [Online]. Available: https://engineering. atspotify.com/2025/05/incident-report-spotify-outage-april-16

work page 2025
[44]

About the Quay.io outage: Post mortem,

B. Dettelback, “About the Quay.io outage: Post mortem,” Aug. 2020, accessed: 2025-12-04. [Online]. Available: https://www.redhat.com/en/ blog/about-the-quay.io-outage-post-mortem

work page 2020
[45]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[47]

Evaluating Large Language Models Trained on Code

[Online]. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv
[48]

LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,

LangChain AI, “LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,” 2025, open-source framework for long-running, controllable LLM agents; integrates with LangChain. [Online]. Available: https://github.com/langchain-ai/langgraph

work page 2025
[49]

TIOBE index for October 2025

TIOBE Software BV. TIOBE index for October 2025. Monthly updated indicator of the popularity of programming languages. [Online]. Available: https://www.tiobe.com/tiobe-index/

work page 2025
[50]

SWE-bench: Can language models resolve real-world GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

work page 2024
[51]

A taxonomy of failures in tool-augmented LLMs,

C. Winston and R. Just, “A taxonomy of failures in tool-augmented LLMs,” in2025 IEEE/ACM International Conference on Automation of Software Test (AST), 2025, pp. 125–135. [Online]. Available: https://doi.org/10.1109/AST66626.2025.00019

work page doi:10.1109/ast66626.2025.00019 2025
[52]

Butterfly effects in toolchains: A comprehensive analysis of failed parameter filling in LLM tool-agent systems,

Q. Xiong, Y . Huang, Z. Jiang, Z. Chang, Y . Zheng, T. Li, and M. Li, “Butterfly effects in toolchains: A comprehensive analysis of failed parameter filling in LLM tool-agent systems,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for C...

work page 2025
[53]

Multi needle in a haystack,

A. Gola, “Multi needle in a haystack,” https://blog.langchain.com/ multi-needle-in-a-haystack/, Mar. 2024, accessed: 2025-11-23

work page 2024
[54]

Context rot: How increasing input tokens impacts llm performance,

K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance,” Chroma, Tech. Rep., July 2025. [Online]. Available: https://research.trychroma.com/context-rot

work page 2025
[55]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys ’24. New York, NY , USA: As...

work page doi:10.1145/3627703.3629553 2024
[56]

DeepLog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17. New York, NY , USA: Association for Computing Machinery, 2017, pp. 1285–1298. [Online]. Available: https://doi.org/10.1145/31339...

work page doi:10.1145/3133956.3134015 2017
[57]

Time-series anomaly detection service at Microsoft,

H. Ren, B. Xu, Y . Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, “Time-series anomaly detection service at Microsoft,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 3009–

work page 2019
[58]

Available: https://doi.org/10.1145/3292500.3330680

[Online]. Available: https://doi.org/10.1145/3292500.3330680

work page doi:10.1145/3292500.3330680
[59]

Practical root cause localization for microservice systems via trace analysis,

Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wang, Z. Chen, W. Zhang, X. Nie, K. Sui, and D. Pei, “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), 2021, pp. 1–10. [Online]. Available: https://doi.org/10.1109/IWQO...

work page doi:10.1109/iwqos52092.2021.9521340 2021
[60]

Generalized Slow Roll for Tensors

S. Jha, S. Cui, S. S. Banerjee, T. Xu, J. Enos, M. Showerman, Z. T. Kalbarczyk, and R. K. Iyer, “Live forensics for HPC systems: A case study on distributed storage systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00069 2020
[61]

CloudRCA: A root cause analysis framework for cloud computing platforms,

Y . Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “CloudRCA: A root cause analysis framework for cloud computing platforms,” inProceedings of the 30th ACM International Conference on Information & Knowledge Management, ser. CIKM ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 4373–4382. [Onlin...

work page doi:10.1145/3459637.3481903 2021
[62]

Xpert: Empowering incident management with query recommendations via large language models,

Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Xpert: Empowering incident management with query recommendations via large language models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,...

work page doi:10.1145/3597503.3639081 2024
[63]

X-lifecycle learning for cloud incident management using llms,

D. Goel, F. Husain, A. Singh, S. Ghosh, A. Parayil, C. Bansal, X. Zhang, and S. Rajmohan, “X-lifecycle learning for cloud incident management using llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 417–428. [O...

work page doi:10.1145/3663529.3663861 2024
[64]

Assess and summarize: Improve outage understanding with large language models,

P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhao, S. He, F. Sarro, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Assess and summarize: Improve outage understanding with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engine...

work page doi:10.1145/3611643.3613891 2023
[65]

FLASH: A workflow automation agent for diagnosing recurring incidents,

X. Zhang, T. Mittal, C. Bansal, R. Wang, M. Ma, Z. Ren, H. Huang, and S. Rajmohan, “FLASH: A workflow automation agent for diagnosing recurring incidents,” Microsoft Research Technical Report (preprint),

work page
[66]

Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf

[Online]. Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf

work page 2024
[67]

The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 931–94...

work page doi:10.1145/3691620.3695475 2024
[68]

Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 422–431. [Online]. Available:...

work page doi:10.1145/3701716.3715225 2025
[69]

Graphs meet AI agents: Taxonomy, progress, and future opportunities,

Y . Bei, W. Zhang, S. Wang, W. Chen, S. Zhou, H. Chen, Y . Li, J. Bu, S. Pan, Y . Yu, I. King, F. Karray, and P. S. Yu, “Graphs meet AI agents: Taxonomy, progress, and future opportunities,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18019

work page arXiv 2025
[70]

Graph chain-of- thought: Augmenting large language models by reasoning on graphs,

B. Jin, C. Xie, J. Zhang, K. K. Roy, Y . Zhang, Z. Li, R. Li, X. Tang, S. Wang, Y . Meng, and J. Han, “Graph chain-of- thought: Augmenting large language models by reasoning on graphs,” inFindings of ACL, 2024, pp. 163–184. [Online]. Available: https://aclanthology.org/2024.findings-acl.11.pdf

work page 2024
[71]

Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,

J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y . Gong, L. Ni, H.-Y . Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=nnVO1PvbTv

work page 2024
[72]

Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,

L. Chen, P. Tong, Z. Jin, Y . Sun, J. Ye, and H. Xiong, “Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,” inProceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2025. [Online]. Available: https://doi.org/10.52202/...

work page doi:10.52202/079017-1189 2025
[73]

Paths-over-graph: Knowledge graph empowered large language model reasoning,

X. Tan, X. Wang, Q. Liu, X. Xu, X. Yuan, and W. Zhang, “Paths-over-graph: Knowledge graph empowered large language model reasoning,” inProceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 3505–3522. [Online]. Available: https://doi.org/10.1145/3696410.3714892

work page doi:10.1145/3696410.3714892 2025
[74]

LocAgent: Graph-guided LLM agents for code localization,

Z. Chen, R. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V . Prasanna, A. Cohan, and X. Wang, “LocAgent: Graph-guided LLM agents for code localization,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Compu...

work page 2025
[75]

Available: https://aclanthology.org/2025.acl-long.426/

[Online]. Available: https://aclanthology.org/2025.acl-long.426/

work page 2025
[76]

ErrorPrism: Reconstructing error propagation paths in cloud service systems,

J. Pu, Y . Li, Z. Chen, J. Liu, Z. Jiang, J. Chen, R. Shi, Z. Zheng, and T. Zhang, “ErrorPrism: Reconstructing error propagation paths in cloud service systems,”arXiv preprint arXiv:2509.26463, 2025. [Online]. Available: https://arxiv.org/abs/2509.26463

work page arXiv 2025
[77]

Root cause analysis for microservice system based on causal inference: How far are we?

L. Pham, H. Ha, and H. Zhang, “Root cause analysis for microservice system based on causal inference: How far are we?” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 706–715. [Online]. Available: https://doi.org/10.1145/3691620.3695065

work page doi:10.1145/3691620.3695065 2024