pith. machine review for the scientific record. sign in

arxiv: 2512.22113 · v3 · submitted 2025-12-26 · 💻 cs.DC · cs.AI· cs.SE

Recognition: 3 theorem links

· Lean Theorem

PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:14 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.SE
keywords root cause analysismicroservicesprogram dependence graphsLLM agentscloud incidentsobservabilityincident diagnosisagentic workflow
0
0 comments X

The pith

PRAXIS directs LLMs through service and code dependence graphs to diagnose cloud incidents more accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PRAXIS as an orchestrator for agentic workflows that diagnose root causes of cloud incidents stemming from code or configuration problems. It has an LLM perform structured traversals on a service dependency graph for microservice interactions and a hammock-block program dependence graph for code details. This method is shown to outperform standard ReAct approaches by improving accuracy up to 6.3 times and cutting token use by 5.3 times across 30 real incidents. Such gains matter because unresolved incidents are expensive, averaging over two million dollars per hour in costs.

Core claim

PRAXIS integrates program analysis with observability for root-cause analysis by managing an LLM-driven workflow that traverses service dependency graphs capturing microservice dependencies and hammock-block program dependence graphs for code-level dependencies within services, resulting in up to 6.3x higher RCA accuracy and 5.3x lower token consumption than ReAct baselines on real-world incidents.

What carries the argument

PRAXIS orchestrator using LLM-structured traversal of service dependency graphs (SDG) and hammock-block program dependence graphs (PDG) to guide diagnosis.

If this is right

  • Root-cause analysis becomes more precise by combining high-level service views with low-level code dependencies.
  • LLM agents consume fewer tokens, making repeated diagnostics more feasible in production.
  • Diagnosis can handle both code and configuration issues in complex microservice setups.
  • The method supplies a ready benchmark of 30 real incidents for comparing future RCA techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption would require reliable automatic construction of these graphs for new services.
  • Similar graph-guided structures could improve LLM performance in other complex reasoning tasks like security auditing.
  • Companies might see faster incident response times leading to lower overall operational costs.

Load-bearing premise

Accurate service dependency graphs and hammock-block program dependence graphs can be automatically built for any production microservices, and the LLM can follow them without adding errors.

What would settle it

A collection of new incidents where graph-guided traversal produces wrong root causes or uses more tokens than direct ReAct reasoning would disprove the claimed improvements.

Figures

Figures reproduced from arXiv: 2512.22113 by Rahul Krishna, Ravishankar K. Iyer, Saurabh Jha, Shengkun Cui.

Figure 1
Figure 1. Figure 1: Incident: Degraded external database returned empty responses, triggering a silent retry loop in the Recommendation service that manifested solely as high latency alert associated with the Recommendation service, without explicit error logs or error traces. Cross-SDG-PDG traversal: (1) LLM selects the Recommendation service for investigation based on the observed alert. (2) Investigation of the Recommendat… view at source ↗
Figure 3
Figure 3. Figure 3: PRAXIS Phase 2: Initial microservice candidate(s) selection. B. Phase 2: Microservice Candidate(s) Selection In Phase 2 ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PRAXIS Phase 3: RCA decision-making. This process is repeated for the next focal entity that is (a) a dependee of the current focal entity and/or (b) suggested by the LLM based on the focal entity’s RCA decision. program when observability data are absent or insufficient due to the incident or when the matching is ambiguous (e.g., multiple matches). PDG traversal. Once the starting hammock block b0 (i.e., … view at source ↗
Figure 6
Figure 6. Figure 6: PRAXIS Phase 4: Final RCA summary. (a), PRAXIS enqueues all unvisited dependees (i.e., cloud entities on which ei depends) according to the SDG GC , denoted as QGC = GC .getDependees(ei). For (b), PRAXIS prompts the LLM M using the entity-queuing prompt ψQ to identify up to k additional entities6 from GC . This selection is based on the judgment J, the observability context ci , the program context CP , an… view at source ↗
Figure 5
Figure 5. Figure 5: Example LLM-driven PDG traversal. link error signals in the observability data with program evidence (code snippets and dependency flows) that might explain the incident. (c) Entity role judgment and queue update. The final stage in this phase is entity role judgment and investigation queue update, in which PRAXIS provides a diagnostic judgment on the entity ei’s role in the incident and an admissible expl… view at source ↗
Figure 7
Figure 7. Figure 7: RCA reasoning of PRAXIS (Obs. Ctx.) and PRAXIS. Nodes are microservices; solid arrows are dependencies; the green dotted arrow is a dependency with missing traces and had to be derived from program context. improves overall RCR Pass@1 by 28.8 percentage points, from 32.7% to 61.5% (an 88.5% improvement), and RCI Pass@1 by 14.7 percentage points, from 59.2% to 73.9% (a 24.8% improvement). Against PRAXIS (Ob… view at source ↗
read the original abstract

Unresolved production cloud incidents cost an average of over $2M per hour. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 6.3x while reducing token consumption by 5.3x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PRAXIS, an orchestrator for an LLM-driven agentic workflow that performs root-cause analysis of cloud incidents by structured traversal over automatically constructed service dependency graphs (SDGs) at the microservice level and hammock-block program dependence graphs (PDGs) at the code level. It reports empirical results on a benchmark of 30 real-world incidents, claiming up to 6.3x higher RCA accuracy and 5.3x lower token consumption relative to ReAct baselines.

Significance. If the automatically generated graphs prove accurate at scale and the traversal avoids injecting errors, the integration of static program analysis with observability could meaningfully advance automated RCA for production microservices, addressing a high-cost problem. The structured-graph guidance of LLM agents is a concrete technical contribution worth exploring further.

major comments (2)
  1. [Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.
  2. [Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.
minor comments (1)
  1. [Abstract] The abstract and introduction refer to a 'comprehensive real-world incidents' benchmark that 'is being compiled'; clarify its current public status and any licensing or access details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and reproducibility, and we address each point below with commitments to revise the paper where needed.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.

    Authors: We agree that the Evaluation section lacks sufficient detail for independent verification. In the revised manuscript, we will explicitly define the accuracy metric as the percentage of incidents for which the system's identified root cause matches the ground-truth root cause. We will describe the incident selection criteria, which focus on real-world cloud incidents with publicly available or accessible telemetry data and source code. Ground-truth root causes were established via expert annotation cross-referenced with official incident reports. We will also add statistical significance testing (e.g., paired t-tests or McNemar's test) for the reported improvements. These changes will be incorporated into the next version. revision: yes

  2. Referee: [Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.

    Authors: We acknowledge that quantitative validation of the SDGs and PDGs is missing from the current manuscript. In the revision, we will add precision/recall metrics for graph construction on the 30 incidents (computed against manually verified subsets where feasible). We will also include a discussion of failure modes, such as missed dynamic calls (mitigated by the LLM-driven traversal), configuration wiring errors, and third-party library handling. We maintain that the accuracy gains derive primarily from the structured traversal approach rather than graph perfection alone, as the same underlying data is available to the ReAct baselines; however, we will make this distinction clearer. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmark

full rationale

The paper reports PRAXIS accuracy and token reductions as direct empirical comparisons against ReAct baselines on a fixed set of 30 real-world incidents compiled into an RCA benchmark. No equations, fitted parameters, or derivations are present that reduce by construction to the same inputs; the central claims rest on observed performance differences rather than self-defined quantities or self-citation chains. The construction of SDGs and PDGs is presented as an engineering step whose accuracy is assumed for the evaluation, but this assumption is not turned into a circular prediction or uniqueness theorem within the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical engineering system rather than a formal derivation; the abstract introduces no explicit free parameters, mathematical axioms, or new postulated entities.

axioms (1)
  • domain assumption LLM agents can reliably perform structured traversal over service and program dependence graphs for diagnosis
    The workflow depends on the LLM's ability to follow the graphs without hallucinating incorrect causal links.

pith-pipeline@v0.9.0 · 5449 in / 1289 out tokens · 48802 ms · 2026-05-16T19:14:09.187476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 2 internal anchors

  1. [1]

    New Relic study reveals businesses face an annual median cost of $76 million from high-impact IT outages,

    New Relic, Inc., “New Relic study reveals businesses face an annual median cost of $76 million from high-impact IT outages,” Sep. 2025, press release. [Online]. Available: https: //newrelic.com/press-release/20250917

  2. [2]

    2025 observability forecast report,

    ——, “2025 observability forecast report,” 2025, report. [Online]. Available: https://newrelic.com/sites/default/files/2025-09/ new-relic-2025-observability-forecast-report.pdf

  3. [3]

    How to fight production incidents? An empirical study on a large-scale cloud service,

    S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production incidents? An empirical study on a large-scale cloud service,” in Proceedings of the 13th Symposium on Cloud Computing, ser. SoCC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 126–141. [Online]. Available: https://doi.org/10.1145/3542929.3563482

  4. [4]

    RESOLVED: Current account payments may fail: Major outage 27/10/2017,

    Monzo-engineer, “RESOLVED: Current account payments may fail: Major outage 27/10/2017,” Oct. 2017, Monzo Community Forum. Accessed 2025-10-

  5. [5]

    Available: https://community.monzo.com/t/ resolved-current-account-payments-may-fail-major-outage-27-10-2017/ 26296/95

    [Online]. Available: https://community.monzo.com/t/ resolved-current-account-payments-may-fail-major-outage-27-10-2017/ 26296/95

  6. [6]

    You broke Reddit: The Pi-Day outage,

    Reddit-engineer, “You broke Reddit: The Pi-Day outage,” Mar. 2023, Reddit Engineering Blog. Accessed: 2025-10-09. [Online]. Available: https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_ broke_reddit_the_piday_outage/

  7. [7]

    (2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region

    Amazon Web Services. (2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region. AWS. Post-incident summary of the Oct 19–20, 2025 US-EAST-1 disruption. [Online]. Available: https://aws.amazon.com/message/101925/

  8. [8]

    Cloudflare outage on November 18, 2025,

    M. Prince, “Cloudflare outage on November 18, 2025,” https://blog. cloudflare.com/18-november-2025-outage/, Nov. 2025, accessed: 2025- 11-20

  9. [9]

    The program structure tree: Computing control regions in linear time,

    R. Johnson, D. Pearson, and K. Pingali, “The program structure tree: Computing control regions in linear time,” inProceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, ser. PLDI ’94. New York, NY , USA: Association for Computing Machinery, 1994, pp. 171–185. [Online]. Available: https://doi.org/10.1145/178243.178258

  10. [10]

    Using hammock graphs to structure programs,

    F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,”IEEE Trans. Softw. Eng., vol. 30, no. 4, pp. 231–245, Apr

  11. [11]

    Available: https://doi.org/10.1109/TSE.2004.1274043

    [Online]. Available: https://doi.org/10.1109/TSE.2004.1274043

  12. [12]

    Cloud bug study (cbs) database,

    H. S. Gunawiet al., “Cloud bug study (cbs) database,” http://ucare. cs.uchicago.edu/projects/cbs/, UCARE Research Group, University of Chicago, 2014, accessed: 2025-06-01

  13. [13]

    Kubernetes failure stories,

    H. Jacobs, “Kubernetes failure stories,” https://codeberg.org/hjacobs/ kubernetes-failure-stories, 2023, accessed: 2025-10-07

  14. [14]

    What bugs live in the cloud? A study of 3000+ issues in cloud systems,

    H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V . Martin, and A. D. Satria, “What bugs live in the cloud? A study of 3000+ issues in cloud systems,” inProceedings of the ACM Symposium on Cloud Computing, ser. SOCC ’14. New York, NY , USA: Association for Computing Machinery, 201...

  15. [15]

    What bugs cause production cloud incidents?

    H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production cloud incidents?” inProceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 155–162. [Online]. Available: https://doi.org/10.1145/3317550.3321438

  16. [16]

    ITBench: Evaluating AI agents across diverse real-world IT automation tasks,

    S. Jha, R. R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J. wook Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswam...

  17. [17]

    RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

    Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, ...

  18. [18]

    Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,

    Y . Chen, J. Pan, J. Clark, Y . Su, N. Zheutlin, B. Bhavya, R. Arora, Y . Deng, S. Jha, and T. Xu, “Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,” inProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025, accepted; preprint available at arXiv:2506.02009. [Online]. Available...

  19. [19]

    Exploring LLM-based agents for root cause analysis,

    D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring LLM-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 208–219. [Online]. Available:...

  20. [20]

    OpenRCA: Can large language models locate the root cause of software failures?

    J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “OpenRCA: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=M4qNIzQYpd

  21. [21]

    COCA: Generative root cause analysis for distributed systems with code knowledge,

    Y . Li, Y . Wu, J. Liu, Z. Jiang, Z. Chen, G. Yu, and M. R. Lyu, “COCA: Generative root cause analysis for distributed systems with code knowledge,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE ’25), 2025. [Online]. Available: https://dl.acm.org/doi/10.1109/ICSE55347.2025.00234

  22. [22]

    AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,

    Y . Chen, M. Shetty, G. Somashekar, M. Ma, Y . Simmhan, J. Mace, C. Bansal, R. Wang, and S. Rajmohan, “AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,” inProceedings of MLSys ’25, 2025. [Online]. Available: https://openreview.net/forum?id=3EXBLwGxtq

  23. [23]

    Beyer, C

    B. Beyer, C. Jones, J. Petoff, and N. R. Murphy,Site Reliability Engineering: How Google Runs Production Systems, 2016. [Online]. Available: http://landing.google.com/sre/book.html

  24. [24]

    The program dependence graph and its use in optimization,

    J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,”ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987. [Online]. Available: https://doi.org/10.1145/24039.24041

  25. [25]

    Prometheus: Monitoring system & time series database,

    Prometheus Authors, “Prometheus: Monitoring system & time series database,” 2025, open-source systems monitoring and alerting toolkit. [Online]. Available: https://prometheus.io/

  26. [26]

    Jaeger: Open source, distributed tracing platform,

    Jaeger Project, “Jaeger: Open source, distributed tracing platform,” 2025, originally open-sourced by Uber; CNCF project. [Online]. Available: https://www.jaegertracing.io/

  27. [27]

    ClickHouse: Fast open-source OLAP DBMS,

    ClickHouse, Inc., “ClickHouse: Fast open-source OLAP DBMS,” 2025, column-oriented database for real-time analytics. [Online]. Available: https://clickhouse.com/

  28. [28]

    MicroRCA: Root cause localization of performance issues in microservices,

    L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root cause localization of performance issues in microservices,” inNOMS 2020: 2020 IEEE/IFIP Network Operations and Management Symposium, 2020, pp. 1–9. [Online]. Available: https://doi.org/10.1109/NOMS47738. 2020.9110353

  29. [29]

    Sage: Practical and scalable ML-driven performance debugging in microservices,

    Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: Practical and scalable ML-driven performance debugging in microservices,” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 135–151. [...

  30. [30]

    Datadog cloud monitoring platform,

    Datadog, “Datadog cloud monitoring platform,” https://www.datadoghq. com/, 2025, accessed: 2025-06-01

  31. [31]

    tree-sitter/tree-sitter: v0.25.10,

    M. Brunsfeld, A. Qureshi, A. Hlynskyi, ObserverOfTime, W. Lillis, J. Vera, dundargoc, P. Turnbull, T. Clem, D. Creager, A. Helwer, R. Rix, D. Kavolis, C. Clason, M. Davis, R. Bruins, A. Delpeuch, Ika, A. Ya, T.-A. Nguy ˜en, bfredl, S. Brunk, M. Massicotte, N. Hasabnis, J. McCoy, M. Dong, S. Moelius, S. Kalt, and Kolja, “tree-sitter/tree-sitter: v0.25.10,”...

  32. [32]

    Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,

    R. Krishna, R. Pan, S. Sinha, S. Tamilselvam, R. Pavuluri, and M. Vukovic, “Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, ser. FSE Companion ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. ...

  33. [33]

    OpenTelemetry demo: Astronomy shop microservices,

    OpenTelemetry authors, “OpenTelemetry demo: Astronomy shop microservices,” https://github.com/open-telemetry/opentelemetry-demo, 2025, microservice-based distributed system illustrating OpenTelemetry in a near real-world environment. Accessed: 2025-11-28

  34. [34]

    Recommending root-cause and mitigation steps for cloud incidents using large language models,

    T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, pp. 1737–1749. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00149

  35. [35]

    Swe-agent: agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: agent-computer interfaces enable automated software engineering,” inProceedings of the 38th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2024

  36. [36]

    Lifecycle of an incident,

    Google Cloud, “Lifecycle of an incident,” https://docs.cloud.google.com/ service-health/docs/incident-lifecycle, n.d., accessed: 2025-12-01

  37. [37]

    OpenTelemetry specification: Overview,

    OpenTelemetry Authors, “OpenTelemetry specification: Overview,” https: //opentelemetry.io/docs/specs/otel/overview/, 2025, overview of the Open- Telemetry project and core concepts. Accessed: 2025-11-28

  38. [38]

    An empirical study of production incidents in generative AI cloud services,

    H. Yan, Y . Chen, M. Ma, M. Wen, S. Lu, S. Zhang, T. Xu, R. Wang, C. Bansal, S. Rajmohan, C. Zhang, and D. Zhang, “An empirical study of production incidents in generative AI cloud services,”CoRR, vol. abs/2504.08865, 2025, accepted to ISSRE 2025; preprint on arXiv. [Online]. Available: https://arxiv.org/abs/2504.08865

  39. [39]

    Mutiny! how does kubernetes fail, and what can we do about it?

    M. Barletta, M. Cinque, C. Di Martino, Z. T. Kalbarczyk, and R. K. Iyer, “Mutiny! how does kubernetes fail, and what can we do about it?” in 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2024, pp. 1–14

  40. [40]

    25% or 6 to 4: The 11/6/23 authentication outage,

    Discord Engineering, “25% or 6 to 4: The 11/6/23 authentication outage,” Nov. 2023, accessed: 2025-12-04. [Online]. Available: https://discord.com/blog/authentication-outage

  41. [41]

    Details of the Cloudflare outage on July 2, 2019,

    J. Graham-Cumming, “Details of the Cloudflare outage on July 2, 2019,” Jul. 2019, accessed: 2025-12-04. [Online]. Available: https: //blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

  42. [42]

    External technical root cause analysis — channel file 291,

    CrowdStrike, “External technical root cause analysis — channel file 291,” Aug. 2024, accessed: 2025-12-04. [Online]. Available: https://www.crowdstrike.com/wp-content/uploads/2024/08/ Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

  43. [43]

    Incident report: Spotify outage on april 16, 2025,

    Spotify Engineering, “Incident report: Spotify outage on april 16, 2025,” May 2025, accessed: 2025-12-04. [Online]. Available: https://engineering. atspotify.com/2025/05/incident-report-spotify-outage-april-16

  44. [44]

    About the Quay.io outage: Post mortem,

    B. Dettelback, “About the Quay.io outage: Post mortem,” Aug. 2020, accessed: 2025-12-04. [Online]. Available: https://www.redhat.com/en/ blog/about-the-quay.io-outage-post-mortem

  45. [45]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

  46. [47]

    Evaluating Large Language Models Trained on Code

    [Online]. Available: https://arxiv.org/abs/2107.03374

  47. [48]

    LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,

    LangChain AI, “LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,” 2025, open-source framework for long-running, controllable LLM agents; integrates with LangChain. [Online]. Available: https://github.com/langchain-ai/langgraph

  48. [49]

    TIOBE index for October 2025

    TIOBE Software BV. TIOBE index for October 2025. Monthly updated indicator of the popularity of programming languages. [Online]. Available: https://www.tiobe.com/tiobe-index/

  49. [50]

    SWE-bench: Can language models resolve real-world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66

  50. [51]

    A taxonomy of failures in tool-augmented LLMs,

    C. Winston and R. Just, “A taxonomy of failures in tool-augmented LLMs,” in2025 IEEE/ACM International Conference on Automation of Software Test (AST), 2025, pp. 125–135. [Online]. Available: https://doi.org/10.1109/AST66626.2025.00019

  51. [52]

    Butterfly effects in toolchains: A comprehensive analysis of failed parameter filling in LLM tool-agent systems,

    Q. Xiong, Y . Huang, Z. Jiang, Z. Chang, Y . Zheng, T. Li, and M. Li, “Butterfly effects in toolchains: A comprehensive analysis of failed parameter filling in LLM tool-agent systems,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for C...

  52. [53]

    Multi needle in a haystack,

    A. Gola, “Multi needle in a haystack,” https://blog.langchain.com/ multi-needle-in-a-haystack/, Mar. 2024, accessed: 2025-11-23

  53. [54]

    Context rot: How increasing input tokens impacts llm performance,

    K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance,” Chroma, Tech. Rep., July 2025. [Online]. Available: https://research.trychroma.com/context-rot

  54. [55]

    Automatic root cause analysis via large language models for cloud incidents,

    Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys ’24. New York, NY , USA: As...

  55. [56]

    DeepLog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17. New York, NY , USA: Association for Computing Machinery, 2017, pp. 1285–1298. [Online]. Available: https://doi.org/10.1145/31339...

  56. [57]

    Time-series anomaly detection service at Microsoft,

    H. Ren, B. Xu, Y . Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, “Time-series anomaly detection service at Microsoft,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 3009–

  57. [58]

    Available: https://doi.org/10.1145/3292500.3330680

    [Online]. Available: https://doi.org/10.1145/3292500.3330680

  58. [59]

    Practical root cause localization for microservice systems via trace analysis,

    Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wang, Z. Chen, W. Zhang, X. Nie, K. Sui, and D. Pei, “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), 2021, pp. 1–10. [Online]. Available: https://doi.org/10.1109/IWQO...

  59. [60]

    Generalized Slow Roll for Tensors

    S. Jha, S. Cui, S. S. Banerjee, T. Xu, J. Enos, M. Showerman, Z. T. Kalbarczyk, and R. K. Iyer, “Live forensics for HPC systems: A case study on distributed storage systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10...

  60. [61]

    CloudRCA: A root cause analysis framework for cloud computing platforms,

    Y . Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “CloudRCA: A root cause analysis framework for cloud computing platforms,” inProceedings of the 30th ACM International Conference on Information & Knowledge Management, ser. CIKM ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 4373–4382. [Onlin...

  61. [62]

    Xpert: Empowering incident management with query recommendations via large language models,

    Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Xpert: Empowering incident management with query recommendations via large language models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,...

  62. [63]

    X-lifecycle learning for cloud incident management using llms,

    D. Goel, F. Husain, A. Singh, S. Ghosh, A. Parayil, C. Bansal, X. Zhang, and S. Rajmohan, “X-lifecycle learning for cloud incident management using llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 417–428. [O...

  63. [64]

    Assess and summarize: Improve outage understanding with large language models,

    P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhao, S. He, F. Sarro, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Assess and summarize: Improve outage understanding with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engine...

  64. [65]

    FLASH: A workflow automation agent for diagnosing recurring incidents,

    X. Zhang, T. Mittal, C. Bansal, R. Wang, M. Ma, Z. Ren, H. Huang, and S. Rajmohan, “FLASH: A workflow automation agent for diagnosing recurring incidents,” Microsoft Research Technical Report (preprint),

  65. [66]

    Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf

    [Online]. Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf

  66. [67]

    The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,

    Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 931–94...

  67. [68]

    Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 422–431. [Online]. Available:...

  68. [69]

    Graphs meet AI agents: Taxonomy, progress, and future opportunities,

    Y . Bei, W. Zhang, S. Wang, W. Chen, S. Zhou, H. Chen, Y . Li, J. Bu, S. Pan, Y . Yu, I. King, F. Karray, and P. S. Yu, “Graphs meet AI agents: Taxonomy, progress, and future opportunities,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18019

  69. [70]

    Graph chain-of- thought: Augmenting large language models by reasoning on graphs,

    B. Jin, C. Xie, J. Zhang, K. K. Roy, Y . Zhang, Z. Li, R. Li, X. Tang, S. Wang, Y . Meng, and J. Han, “Graph chain-of- thought: Augmenting large language models by reasoning on graphs,” inFindings of ACL, 2024, pp. 163–184. [Online]. Available: https://aclanthology.org/2024.findings-acl.11.pdf

  70. [71]

    Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,

    J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y . Gong, L. Ni, H.-Y . Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=nnVO1PvbTv

  71. [72]

    Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,

    L. Chen, P. Tong, Z. Jin, Y . Sun, J. Ye, and H. Xiong, “Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,” inProceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2025. [Online]. Available: https://doi.org/10.52202/...

  72. [73]

    Paths-over-graph: Knowledge graph empowered large language model reasoning,

    X. Tan, X. Wang, Q. Liu, X. Xu, X. Yuan, and W. Zhang, “Paths-over-graph: Knowledge graph empowered large language model reasoning,” inProceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 3505–3522. [Online]. Available: https://doi.org/10.1145/3696410.3714892

  73. [74]

    LocAgent: Graph-guided LLM agents for code localization,

    Z. Chen, R. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V . Prasanna, A. Cohan, and X. Wang, “LocAgent: Graph-guided LLM agents for code localization,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Compu...

  74. [75]

    Available: https://aclanthology.org/2025.acl-long.426/

    [Online]. Available: https://aclanthology.org/2025.acl-long.426/

  75. [76]

    ErrorPrism: Reconstructing error propagation paths in cloud service systems,

    J. Pu, Y . Li, Z. Chen, J. Liu, Z. Jiang, J. Chen, R. Shi, Z. Zheng, and T. Zhang, “ErrorPrism: Reconstructing error propagation paths in cloud service systems,”arXiv preprint arXiv:2509.26463, 2025. [Online]. Available: https://arxiv.org/abs/2509.26463

  76. [77]

    Root cause analysis for microservice system based on causal inference: How far are we?

    L. Pham, H. Ha, and H. Zhang, “Root cause analysis for microservice system based on causal inference: How far are we?” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 706–715. [Online]. Available: https://doi.org/10.1145/3691620.3695065