Recognition: 3 theorem links
· Lean TheoremPRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis
Pith reviewed 2026-05-16 19:14 UTC · model grok-4.3
The pith
PRAXIS directs LLMs through service and code dependence graphs to diagnose cloud incidents more accurately.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRAXIS integrates program analysis with observability for root-cause analysis by managing an LLM-driven workflow that traverses service dependency graphs capturing microservice dependencies and hammock-block program dependence graphs for code-level dependencies within services, resulting in up to 6.3x higher RCA accuracy and 5.3x lower token consumption than ReAct baselines on real-world incidents.
What carries the argument
PRAXIS orchestrator using LLM-structured traversal of service dependency graphs (SDG) and hammock-block program dependence graphs (PDG) to guide diagnosis.
If this is right
- Root-cause analysis becomes more precise by combining high-level service views with low-level code dependencies.
- LLM agents consume fewer tokens, making repeated diagnostics more feasible in production.
- Diagnosis can handle both code and configuration issues in complex microservice setups.
- The method supplies a ready benchmark of 30 real incidents for comparing future RCA techniques.
Where Pith is reading between the lines
- Adoption would require reliable automatic construction of these graphs for new services.
- Similar graph-guided structures could improve LLM performance in other complex reasoning tasks like security auditing.
- Companies might see faster incident response times leading to lower overall operational costs.
Load-bearing premise
Accurate service dependency graphs and hammock-block program dependence graphs can be automatically built for any production microservices, and the LLM can follow them without adding errors.
What would settle it
A collection of new incidents where graph-guided traversal produces wrong root causes or uses more tokens than direct ReAct reasoning would disprove the claimed improvements.
Figures
read the original abstract
Unresolved production cloud incidents cost an average of over $2M per hour. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 6.3x while reducing token consumption by 5.3x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PRAXIS, an orchestrator for an LLM-driven agentic workflow that performs root-cause analysis of cloud incidents by structured traversal over automatically constructed service dependency graphs (SDGs) at the microservice level and hammock-block program dependence graphs (PDGs) at the code level. It reports empirical results on a benchmark of 30 real-world incidents, claiming up to 6.3x higher RCA accuracy and 5.3x lower token consumption relative to ReAct baselines.
Significance. If the automatically generated graphs prove accurate at scale and the traversal avoids injecting errors, the integration of static program analysis with observability could meaningfully advance automated RCA for production microservices, addressing a high-cost problem. The structured-graph guidance of LLM agents is a concrete technical contribution worth exploring further.
major comments (2)
- [Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.
- [Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.
minor comments (1)
- [Abstract] The abstract and introduction refer to a 'comprehensive real-world incidents' benchmark that 'is being compiled'; clarify its current public status and any licensing or access details.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and reproducibility, and we address each point below with commitments to revise the paper where needed.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim of up to 6.3x accuracy improvement is reported without defining the accuracy metric, specifying incident selection criteria for the 30-incident set, reporting statistical significance, or describing how ground-truth root causes were established. This prevents independent verification of the quantitative results.
Authors: We agree that the Evaluation section lacks sufficient detail for independent verification. In the revised manuscript, we will explicitly define the accuracy metric as the percentage of incidents for which the system's identified root cause matches the ground-truth root cause. We will describe the incident selection criteria, which focus on real-world cloud incidents with publicly available or accessible telemetry data and source code. Ground-truth root causes were established via expert annotation cross-referenced with official incident reports. We will also add statistical significance testing (e.g., paired t-tests or McNemar's test) for the reported improvements. These changes will be incorporated into the next version. revision: yes
-
Referee: [Graph Construction] Graph construction and traversal (likely §3–4): no precision/recall or other quantitative validation is provided for the automatically constructed SDGs and hammock-block PDGs on the 30 incidents. Without this, it is impossible to determine whether the reported gains arise from the PRAXIS method or from unusually accurate graphs; failure modes for missed dynamic calls, configuration wiring, or third-party libraries are also undiscussed.
Authors: We acknowledge that quantitative validation of the SDGs and PDGs is missing from the current manuscript. In the revision, we will add precision/recall metrics for graph construction on the 30 incidents (computed against manually verified subsets where feasible). We will also include a discussion of failure modes, such as missed dynamic calls (mitigated by the LLM-driven traversal), configuration wiring errors, and third-party library handling. We maintain that the accuracy gains derive primarily from the structured traversal approach rather than graph perfection alone, as the same underlying data is available to the ReAct baselines; however, we will make this distinction clearer. revision: partial
Circularity Check
No circularity: empirical results on external benchmark
full rationale
The paper reports PRAXIS accuracy and token reductions as direct empirical comparisons against ReAct baselines on a fixed set of 30 real-world incidents compiled into an RCA benchmark. No equations, fitted parameters, or derivations are present that reduce by construction to the same inputs; the central claims rest on observed performance differences rather than self-defined quantities or self-citation chains. The construction of SDGs and PDGs is presented as an engineering step whose accuracy is assumed for the evaluation, but this assumption is not turned into a circular prediction or uniqueness theorem within the paper itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents can reliably perform structured traversal over service and program dependence graphs for diagnosis
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) ... (2) a hammock-block program dependence graph (PDG)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hammock blocks ... control, data, and call dependencies ... hierarchical nesting structure
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
New Relic, Inc., “New Relic study reveals businesses face an annual median cost of $76 million from high-impact IT outages,” Sep. 2025, press release. [Online]. Available: https: //newrelic.com/press-release/20250917
-
[2]
2025 observability forecast report,
——, “2025 observability forecast report,” 2025, report. [Online]. Available: https://newrelic.com/sites/default/files/2025-09/ new-relic-2025-observability-forecast-report.pdf
work page 2025
-
[3]
How to fight production incidents? An empirical study on a large-scale cloud service,
S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production incidents? An empirical study on a large-scale cloud service,” in Proceedings of the 13th Symposium on Cloud Computing, ser. SoCC ’22. New York, NY , USA: Association for Computing Machinery, 2022, pp. 126–141. [Online]. Available: https://doi.org/10.1145/3542929.3563482
-
[4]
RESOLVED: Current account payments may fail: Major outage 27/10/2017,
Monzo-engineer, “RESOLVED: Current account payments may fail: Major outage 27/10/2017,” Oct. 2017, Monzo Community Forum. Accessed 2025-10-
work page 2017
-
[5]
[Online]. Available: https://community.monzo.com/t/ resolved-current-account-payments-may-fail-major-outage-27-10-2017/ 26296/95
work page 2017
-
[6]
You broke Reddit: The Pi-Day outage,
Reddit-engineer, “You broke Reddit: The Pi-Day outage,” Mar. 2023, Reddit Engineering Blog. Accessed: 2025-10-09. [Online]. Available: https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_ broke_reddit_the_piday_outage/
work page 2023
-
[7]
(2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region
Amazon Web Services. (2025) Summary of the Amazon DynamoDB service disruption in the Northern Virginia (US-EAST-1) region. AWS. Post-incident summary of the Oct 19–20, 2025 US-EAST-1 disruption. [Online]. Available: https://aws.amazon.com/message/101925/
work page 2025
-
[8]
Cloudflare outage on November 18, 2025,
M. Prince, “Cloudflare outage on November 18, 2025,” https://blog. cloudflare.com/18-november-2025-outage/, Nov. 2025, accessed: 2025- 11-20
work page 2025
-
[9]
The program structure tree: Computing control regions in linear time,
R. Johnson, D. Pearson, and K. Pingali, “The program structure tree: Computing control regions in linear time,” inProceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, ser. PLDI ’94. New York, NY , USA: Association for Computing Machinery, 1994, pp. 171–185. [Online]. Available: https://doi.org/10.1145/178243.178258
-
[10]
Using hammock graphs to structure programs,
F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,”IEEE Trans. Softw. Eng., vol. 30, no. 4, pp. 231–245, Apr
-
[11]
Available: https://doi.org/10.1109/TSE.2004.1274043
[Online]. Available: https://doi.org/10.1109/TSE.2004.1274043
-
[12]
Cloud bug study (cbs) database,
H. S. Gunawiet al., “Cloud bug study (cbs) database,” http://ucare. cs.uchicago.edu/projects/cbs/, UCARE Research Group, University of Chicago, 2014, accessed: 2025-06-01
work page 2014
-
[13]
H. Jacobs, “Kubernetes failure stories,” https://codeberg.org/hjacobs/ kubernetes-failure-stories, 2023, accessed: 2025-10-07
work page 2023
-
[14]
What bugs live in the cloud? A study of 3000+ issues in cloud systems,
H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V . Martin, and A. D. Satria, “What bugs live in the cloud? A study of 3000+ issues in cloud systems,” inProceedings of the ACM Symposium on Cloud Computing, ser. SOCC ’14. New York, NY , USA: Association for Computing Machinery, 201...
-
[15]
What bugs cause production cloud incidents?
H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production cloud incidents?” inProceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 155–162. [Online]. Available: https://doi.org/10.1145/3317550.3321438
-
[16]
ITBench: Evaluating AI agents across diverse real-world IT automation tasks,
S. Jha, R. R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J. wook Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswam...
work page 2025
-
[17]
RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,
Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, ...
-
[18]
Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,
Y . Chen, J. Pan, J. Clark, Y . Su, N. Zheutlin, B. Bhavya, R. Arora, Y . Deng, S. Jha, and T. Xu, “Stratus: A multi-agent system for autonomous reliability engineering of modern clouds,” inProceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025, accepted; preprint available at arXiv:2506.02009. [Online]. Available...
-
[19]
Exploring LLM-based agents for root cause analysis,
D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring LLM-based agents for root cause analysis,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 208–219. [Online]. Available:...
-
[20]
OpenRCA: Can large language models locate the root cause of software failures?
J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang, “OpenRCA: Can large language models locate the root cause of software failures?” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=M4qNIzQYpd
work page 2025
-
[21]
COCA: Generative root cause analysis for distributed systems with code knowledge,
Y . Li, Y . Wu, J. Liu, Z. Jiang, Z. Chen, G. Yu, and M. R. Lyu, “COCA: Generative root cause analysis for distributed systems with code knowledge,” inProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE ’25), 2025. [Online]. Available: https://dl.acm.org/doi/10.1109/ICSE55347.2025.00234
-
[22]
AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,
Y . Chen, M. Shetty, G. Somashekar, M. Ma, Y . Simmhan, J. Mace, C. Bansal, R. Wang, and S. Rajmohan, “AIOpsLab: A holistic framework to evaluate AI agents for enabling autonomous clouds,” inProceedings of MLSys ’25, 2025. [Online]. Available: https://openreview.net/forum?id=3EXBLwGxtq
work page 2025
- [23]
-
[24]
The program dependence graph and its use in optimization,
J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program dependence graph and its use in optimization,”ACM Trans. Program. Lang. Syst., vol. 9, no. 3, pp. 319–349, Jul. 1987. [Online]. Available: https://doi.org/10.1145/24039.24041
-
[25]
Prometheus: Monitoring system & time series database,
Prometheus Authors, “Prometheus: Monitoring system & time series database,” 2025, open-source systems monitoring and alerting toolkit. [Online]. Available: https://prometheus.io/
work page 2025
-
[26]
Jaeger: Open source, distributed tracing platform,
Jaeger Project, “Jaeger: Open source, distributed tracing platform,” 2025, originally open-sourced by Uber; CNCF project. [Online]. Available: https://www.jaegertracing.io/
work page 2025
-
[27]
ClickHouse: Fast open-source OLAP DBMS,
ClickHouse, Inc., “ClickHouse: Fast open-source OLAP DBMS,” 2025, column-oriented database for real-time analytics. [Online]. Available: https://clickhouse.com/
work page 2025
-
[28]
MicroRCA: Root cause localization of performance issues in microservices,
L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root cause localization of performance issues in microservices,” inNOMS 2020: 2020 IEEE/IFIP Network Operations and Management Symposium, 2020, pp. 1–9. [Online]. Available: https://doi.org/10.1109/NOMS47738. 2020.9110353
-
[29]
Sage: Practical and scalable ML-driven performance debugging in microservices,
Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: Practical and scalable ML-driven performance debugging in microservices,” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 135–151. [...
-
[30]
Datadog cloud monitoring platform,
Datadog, “Datadog cloud monitoring platform,” https://www.datadoghq. com/, 2025, accessed: 2025-06-01
work page 2025
-
[31]
tree-sitter/tree-sitter: v0.25.10,
M. Brunsfeld, A. Qureshi, A. Hlynskyi, ObserverOfTime, W. Lillis, J. Vera, dundargoc, P. Turnbull, T. Clem, D. Creager, A. Helwer, R. Rix, D. Kavolis, C. Clason, M. Davis, R. Bruins, A. Delpeuch, Ika, A. Ya, T.-A. Nguy ˜en, bfredl, S. Brunk, M. Massicotte, N. Hasabnis, J. McCoy, M. Dong, S. Moelius, S. Kalt, and Kolja, “tree-sitter/tree-sitter: v0.25.10,”...
-
[32]
Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,
R. Krishna, R. Pan, S. Sinha, S. Tamilselvam, R. Pavuluri, and M. Vukovic, “Codellm-Devkit: A framework for contextualizing code LLMs with program analysis insights,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, ser. FSE Companion ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. ...
-
[33]
OpenTelemetry demo: Astronomy shop microservices,
OpenTelemetry authors, “OpenTelemetry demo: Astronomy shop microservices,” https://github.com/open-telemetry/opentelemetry-demo, 2025, microservice-based distributed system illustrating OpenTelemetry in a near real-world environment. Accessed: 2025-11-28
work page 2025
-
[34]
Recommending root-cause and mitigation steps for cloud incidents using large language models,
T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inProceedings of the 45th International Conference on Software Engineering, ser. ICSE ’23. IEEE Press, 2023, pp. 1737–1749. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00149
-
[35]
Swe-agent: agent-computer interfaces enable automated software engineering,
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: agent-computer interfaces enable automated software engineering,” inProceedings of the 38th International Confer- ence on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2024
work page 2024
-
[36]
Google Cloud, “Lifecycle of an incident,” https://docs.cloud.google.com/ service-health/docs/incident-lifecycle, n.d., accessed: 2025-12-01
work page 2025
-
[37]
OpenTelemetry specification: Overview,
OpenTelemetry Authors, “OpenTelemetry specification: Overview,” https: //opentelemetry.io/docs/specs/otel/overview/, 2025, overview of the Open- Telemetry project and core concepts. Accessed: 2025-11-28
work page 2025
-
[38]
An empirical study of production incidents in generative AI cloud services,
H. Yan, Y . Chen, M. Ma, M. Wen, S. Lu, S. Zhang, T. Xu, R. Wang, C. Bansal, S. Rajmohan, C. Zhang, and D. Zhang, “An empirical study of production incidents in generative AI cloud services,”CoRR, vol. abs/2504.08865, 2025, accepted to ISSRE 2025; preprint on arXiv. [Online]. Available: https://arxiv.org/abs/2504.08865
-
[39]
Mutiny! how does kubernetes fail, and what can we do about it?
M. Barletta, M. Cinque, C. Di Martino, Z. T. Kalbarczyk, and R. K. Iyer, “Mutiny! how does kubernetes fail, and what can we do about it?” in 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2024, pp. 1–14
work page 2024
-
[40]
25% or 6 to 4: The 11/6/23 authentication outage,
Discord Engineering, “25% or 6 to 4: The 11/6/23 authentication outage,” Nov. 2023, accessed: 2025-12-04. [Online]. Available: https://discord.com/blog/authentication-outage
work page 2023
-
[41]
Details of the Cloudflare outage on July 2, 2019,
J. Graham-Cumming, “Details of the Cloudflare outage on July 2, 2019,” Jul. 2019, accessed: 2025-12-04. [Online]. Available: https: //blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/
work page 2019
-
[42]
External technical root cause analysis — channel file 291,
CrowdStrike, “External technical root cause analysis — channel file 291,” Aug. 2024, accessed: 2025-12-04. [Online]. Available: https://www.crowdstrike.com/wp-content/uploads/2024/08/ Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf
work page 2024
-
[43]
Incident report: Spotify outage on april 16, 2025,
Spotify Engineering, “Incident report: Spotify outage on april 16, 2025,” May 2025, accessed: 2025-12-04. [Online]. Available: https://engineering. atspotify.com/2025/05/incident-report-spotify-outage-april-16
work page 2025
-
[44]
About the Quay.io outage: Post mortem,
B. Dettelback, “About the Quay.io outage: Post mortem,” Aug. 2020, accessed: 2025-12-04. [Online]. Available: https://www.redhat.com/en/ blog/about-the-quay.io-outage-post-mortem
work page 2020
-
[45]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X
work page 2023
-
[47]
Evaluating Large Language Models Trained on Code
[Online]. Available: https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,
LangChain AI, “LangGraph: Build resilient, stateful multi-agent workflows for LLM applications,” 2025, open-source framework for long-running, controllable LLM agents; integrates with LangChain. [Online]. Available: https://github.com/langchain-ai/langgraph
work page 2025
-
[49]
TIOBE Software BV. TIOBE index for October 2025. Monthly updated indicator of the popularity of programming languages. [Online]. Available: https://www.tiobe.com/tiobe-index/
work page 2025
-
[50]
SWE-bench: Can language models resolve real-world GitHub issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[51]
A taxonomy of failures in tool-augmented LLMs,
C. Winston and R. Just, “A taxonomy of failures in tool-augmented LLMs,” in2025 IEEE/ACM International Conference on Automation of Software Test (AST), 2025, pp. 125–135. [Online]. Available: https://doi.org/10.1109/AST66626.2025.00019
-
[52]
Q. Xiong, Y . Huang, Z. Jiang, Z. Chang, Y . Zheng, T. Li, and M. Li, “Butterfly effects in toolchains: A comprehensive analysis of failed parameter filling in LLM tool-agent systems,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for C...
work page 2025
-
[53]
A. Gola, “Multi needle in a haystack,” https://blog.langchain.com/ multi-needle-in-a-haystack/, Mar. 2024, accessed: 2025-11-23
work page 2024
-
[54]
Context rot: How increasing input tokens impacts llm performance,
K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance,” Chroma, Tech. Rep., July 2025. [Online]. Available: https://research.trychroma.com/context-rot
work page 2025
-
[55]
Automatic root cause analysis via large language models for cloud incidents,
Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys ’24. New York, NY , USA: As...
-
[56]
DeepLog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17. New York, NY , USA: Association for Computing Machinery, 2017, pp. 1285–1298. [Online]. Available: https://doi.org/10.1145/31339...
-
[57]
Time-series anomaly detection service at Microsoft,
H. Ren, B. Xu, Y . Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, J. Tong, and Q. Zhang, “Time-series anomaly detection service at Microsoft,” inProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 3009–
work page 2019
-
[58]
Available: https://doi.org/10.1145/3292500.3330680
[Online]. Available: https://doi.org/10.1145/3292500.3330680
-
[59]
Practical root cause localization for microservice systems via trace analysis,
Z. Li, J. Chen, R. Jiao, N. Zhao, Z. Wang, S. Zhang, Y . Wu, L. Jiang, L. Yan, Z. Wang, Z. Chen, W. Zhang, X. Nie, K. Sui, and D. Pei, “Practical root cause localization for microservice systems via trace analysis,” in2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), 2021, pp. 1–10. [Online]. Available: https://doi.org/10.1109/IWQO...
-
[60]
Generalized Slow Roll for Tensors
S. Jha, S. Cui, S. S. Banerjee, T. Xu, J. Enos, M. Showerman, Z. T. Kalbarczyk, and R. K. Iyer, “Live forensics for HPC systems: A case study on distributed storage systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00069 2020
-
[61]
CloudRCA: A root cause analysis framework for cloud computing platforms,
Y . Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, L. Fan, and M. Ke, “CloudRCA: A root cause analysis framework for cloud computing platforms,” inProceedings of the 30th ACM International Conference on Information & Knowledge Management, ser. CIKM ’21. New York, NY , USA: Association for Computing Machinery, 2021, pp. 4373–4382. [Onlin...
-
[62]
Xpert: Empowering incident management with query recommendations via large language models,
Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Xpert: Empowering incident management with query recommendations via large language models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery,...
-
[63]
X-lifecycle learning for cloud incident management using llms,
D. Goel, F. Husain, A. Singh, S. Ghosh, A. Parayil, C. Bansal, X. Zhang, and S. Rajmohan, “X-lifecycle learning for cloud incident management using llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, pp. 417–428. [O...
-
[64]
Assess and summarize: Improve outage understanding with large language models,
P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhao, S. He, F. Sarro, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Assess and summarize: Improve outage understanding with large language models,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engine...
-
[65]
FLASH: A workflow automation agent for diagnosing recurring incidents,
X. Zhang, T. Mittal, C. Bansal, R. Wang, M. Ma, Z. Ren, H. Huang, and S. Rajmohan, “FLASH: A workflow automation agent for diagnosing recurring incidents,” Microsoft Research Technical Report (preprint),
-
[66]
Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf
[Online]. Available: https://www.microsoft.com/en-us/research/ wp-content/uploads/2024/10/FLASH_Paper.pdf
work page 2024
-
[67]
Y . Han, Q. Du, Y . Huang, J. Wu, F. Tian, and C. He, “The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 931–94...
-
[68]
Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,
C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei, “Flow-of-action: SOP enhanced LLM-based multi-agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 422–431. [Online]. Available:...
-
[69]
Graphs meet AI agents: Taxonomy, progress, and future opportunities,
Y . Bei, W. Zhang, S. Wang, W. Chen, S. Zhou, H. Chen, Y . Li, J. Bu, S. Pan, Y . Yu, I. King, F. Karray, and P. S. Yu, “Graphs meet AI agents: Taxonomy, progress, and future opportunities,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18019
-
[70]
Graph chain-of- thought: Augmenting large language models by reasoning on graphs,
B. Jin, C. Xie, J. Zhang, K. K. Roy, Y . Zhang, Z. Li, R. Li, X. Tang, S. Wang, Y . Meng, and J. Han, “Graph chain-of- thought: Augmenting large language models by reasoning on graphs,” inFindings of ACL, 2024, pp. 163–184. [Online]. Available: https://aclanthology.org/2024.findings-acl.11.pdf
work page 2024
-
[71]
Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,
J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y . Gong, L. Ni, H.-Y . Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=nnVO1PvbTv
work page 2024
-
[72]
Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,
L. Chen, P. Tong, Z. Jin, Y . Sun, J. Ye, and H. Xiong, “Plan-on- graph: Self-correcting adaptive planning of large language model on knowledge graphs,” inProceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS ’24. Red Hook, NY , USA: Curran Associates Inc., 2025. [Online]. Available: https://doi.org/10.52202/...
-
[73]
Paths-over-graph: Knowledge graph empowered large language model reasoning,
X. Tan, X. Wang, Q. Liu, X. Xu, X. Yuan, and W. Zhang, “Paths-over-graph: Knowledge graph empowered large language model reasoning,” inProceedings of the ACM on Web Conference 2025, ser. WWW ’25. New York, NY , USA: Association for Computing Machinery, 2025, pp. 3505–3522. [Online]. Available: https://doi.org/10.1145/3696410.3714892
-
[74]
LocAgent: Graph-guided LLM agents for code localization,
Z. Chen, R. Tang, G. Deng, F. Wu, J. Wu, Z. Jiang, V . Prasanna, A. Cohan, and X. Wang, “LocAgent: Graph-guided LLM agents for code localization,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Compu...
work page 2025
-
[75]
Available: https://aclanthology.org/2025.acl-long.426/
[Online]. Available: https://aclanthology.org/2025.acl-long.426/
work page 2025
-
[76]
ErrorPrism: Reconstructing error propagation paths in cloud service systems,
J. Pu, Y . Li, Z. Chen, J. Liu, Z. Jiang, J. Chen, R. Shi, Z. Zheng, and T. Zhang, “ErrorPrism: Reconstructing error propagation paths in cloud service systems,”arXiv preprint arXiv:2509.26463, 2025. [Online]. Available: https://arxiv.org/abs/2509.26463
-
[77]
Root cause analysis for microservice system based on causal inference: How far are we?
L. Pham, H. Ha, and H. Zhang, “Root cause analysis for microservice system based on causal inference: How far are we?” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 706–715. [Online]. Available: https://doi.org/10.1145/3691620.3695065
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.