pith. sign in

arxiv: 2510.10074 · v2 · submitted 2025-10-11 · 💻 cs.AI

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Pith reviewed 2026-05-18 07:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords troubleshooting guidesincident managementagentic frameworkLLM automationexecution DAGparallel executionquery preparationIT systems reliability
0
0 comments X

The pith

StepFly automates troubleshooting guide execution for IT incidents by converting them into structured DAGs that support parallel steps and achieve around 94 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StepFly, an end-to-end agentic framework designed to replace slow and error-prone manual runs of troubleshooting guides in large-scale IT systems. It structures the process into three stages: a quality-improvement tool for engineers, offline conversion of text guides into executable DAGs plus supporting plugins, and an online executor that follows the DAG while allowing independent steps to run in parallel. A sympathetic reader would care because reliable automation here could shorten the time from incident detection to resolution without requiring constant human oversight. Real-world tests show the system outperforms simpler LLM approaches while using less time and fewer tokens overall.

Core claim

StepFly presents a three-stage agentic workflow for troubleshooting guide automation: the first stage supplies a TSG Mentor tool to help site reliability engineers improve guide quality; the second stage uses LLMs offline to extract structured execution DAGs from unstructured guides and to build dedicated Query Preparation Plugins; the third stage runs online via a DAG-guided scheduler-executor equipped with a memory system that enforces correct ordering and enables parallel execution of independent steps, yielding approximately 94 percent success on GPT-4.1 together with lower time and token costs than baselines.

What carries the argument

A DAG-guided scheduler-executor with memory system that enforces workflow correctness and permits parallel runs of independent steps extracted from the original guide.

If this is right

  • Troubleshooting guides become executable with far less manual intervention once converted to DAG form.
  • Independent steps identified in the DAG can run simultaneously, cutting total execution time between 33 and 70 percent.
  • The framework handles complex control flow and data-heavy queries through its dedicated plugins.
  • Built-in quality tools let engineers fix common problems in existing guides before automation begins.
  • Reduced token usage and faster runs make repeated executions of the same guides more practical at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preprocessing and scheduling pattern could apply to other domains that rely on written procedural checklists, such as network configuration or software deployment scripts.
  • Over time the memory system might accumulate patterns from past runs to suggest better parallel groupings without human input.
  • As base models improve, the success rate on guides with especially intricate logic could rise further without changes to the overall architecture.
  • Production use would likely shift SRE effort away from step-by-step execution toward reviewing exceptions and updating the underlying guides.

Load-bearing premise

Large language models can accurately turn unstructured real-world troubleshooting guides into correct execution DAGs and working Query Preparation Plugins during the offline stage.

What would settle it

A case in which the extracted DAG produces the wrong step order or causes a data query to fail on a known incident would show that the offline preprocessing step is unreliable.

Figures

Figures reproduced from arXiv: 2510.10074 by Chaoyun Zhang, Dongmei Zhang, Jiayi Mao, Liqun Li, Qingwei Lin, Samia Khalid, Saravan Rajmohan, Shilin He, Si Qin, Sitaram Lanka, Yanjie Gao, Zegang Peng.

Figure 1
Figure 1. Figure 1: Workflow of a real TSG for diagnosing availability incidents of a large online service. Typically, each step within a TSG includes: (1) a concise title summarizing the action; (2) a detailed description, which may include instructions, com￾mands, queries, etc.; (3) a specification of expected outcomes; and (4) flow control logic that dictates the subsequent step based on the observed outcomes. A TSG termin… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics on TSG characteristics: (a) Token count distribution, (b) Step count Distribution, (c) Tool [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TSG Issue Distribution. Each major issue category is further decomposed into sub￾categories. For example, the Data Flow Issues (DF) category includes sub-issues such as “Unknown Input Source”, “Wrong Input Source”, and “Missing Parameters”; the Control Flow Issues (CF) group includes sub-issues such as “Unable To Infer Next Step” and “Wrong Next Step”. The detailed issue taxon￾omy is omitted due to space c… view at source ↗
Figure 4
Figure 4. Figure 4: The Proposed Approach Motivated by Finding 4, we provide detailed guidance on writing high-quality TSGs, based on our empirical study in Section 3.3. We introduce a tool called TSG Mentor to help SREs identify quality issues in their TSGs. The specifics are discussed in Section 4.1. We preprocess TSGs by creating an execution DAG, which serves as a structural blueprint of the workflow, with steps as nodes … view at source ↗
Figure 5
Figure 5. Figure 5: The execution DAG of the example TSG shown in Fig. 1. The conditional edges are associated with the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The process of running a log query. A critical observation is that these queries are structured as templates, utilizing either explicit placeholders (e.g., “where DeployRing == ‘{ring}’ ”) or implicit ones that could be inferred from con￾text (e.g., “replace the StartTime to the incident’s start time when using the query”). Leveraging this characteristic, we propose using an LLM to extract these templates … view at source ↗
Figure 7
Figure 7. Figure 7: Cross-plugin data flow via memory. The memory architecture follows a key-value par￾adigm where string-based keys provide unique data identification and values accommodate arbitrary data types, ranging from primitive types (strings, numbers, lists, dictionaries) to complex structured data (DataFrames, ndarrays). Our implementation leverages MongoDB [22] as the underlying stor￾age backend, because it support… view at source ↗
Figure 8
Figure 8. Figure 8: The execution DAG of the TSG in Fig. 1 after parallelization. The steps that can run concurrently are [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of time (left) and token consumption (right) [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Time consumption comparison between sequential execution and parallelized execution with different [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces StepFly, a three-stage agentic framework for automating troubleshooting guide (TSG) execution in large-scale IT incident management. It reports an empirical study on 92 real-world TSGs to identify challenges, introduces a TSG Mentor tool for quality improvement, performs offline LLM-based extraction of structured execution DAGs and Query Preparation Plugins (QPPs), and executes online via a DAG-guided scheduler-executor with memory support for correct workflow and parallel execution of independent steps. The central empirical claim is a ~94% success rate on GPT-4.1, outperforming baselines with lower time and token consumption, plus execution time reductions of 32.9% to 70.4% for parallelizable TSGs.

Significance. If the performance claims hold under more rigorous validation, this work would represent a meaningful engineering contribution to LLM-based automation of structured operational workflows, specifically addressing TSG quality issues, control-flow interpretation, data-intensive queries, and parallelism exploitation. The public release of code and sample data at the provided GitHub link is a clear strength that supports reproducibility and follow-on research in AI for systems operations.

major comments (3)
  1. [Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.
  2. [Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.
  3. [Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.
minor comments (2)
  1. [Abstract] Abstract: the model is referred to as 'GPT-4.1'; clarify the precise identifier (e.g., GPT-4o, GPT-4 Turbo) and version used in all experiments.
  2. [TSG Mentor] The TSG Mentor description would be strengthened by one or two concrete before/after examples of quality improvements it enables for SREs.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the potential engineering contribution of StepFly along with the value of our public code release. We address each major comment below with clarifications and commitments to revisions that strengthen verifiability while remaining faithful to the experiments performed.

read point-by-point responses
  1. Referee: [Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.

    Authors: We agree that additional details will improve verifiability. In the revised manuscript we will expand the Empirical Evaluation section to explicitly define the baselines (naive LLM prompting, sequential non-DAG execution, and non-parallel scheduler), describe the exact composition and sourcing of the 92 TSGs plus associated incident logs, provide a categorized error analysis of the failure cases, and report statistical significance tests (e.g., McNemar’s test for success-rate differences and paired t-tests for time/token metrics). revision: yes

  2. Referee: [Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.

    Authors: We acknowledge that independent quantitative metrics for the extraction stage would strengthen claims of end-to-end robustness. Our original study did not produce human-annotated ground-truth DAGs or QPPs; validation occurred indirectly via online success. In revision we will add a qualitative assessment based on manual review of a 20-TSG sample, report observed accuracy on control-flow constructs and QPP generation, include representative extraction examples, and break down online failures attributable to preprocessing. A full precision/recall study against new human annotations was not performed and would require additional effort beyond the current work. revision: partial

  3. Referee: [Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.

    Authors: We will revise the Results section to clarify these points: parallelism is identified by detecting independent steps with no data or control dependencies via the DAG’s topological order and dependency graph; scheduler overhead was measured separately as the time for dependency checks and task queuing (observed to be <2 s per TSG on average); and the reported range applies to the subset of TSGs containing at least one parallelizable branch (we will state the exact count and also report mean reduction with standard deviation). revision: yes

standing simulated objections not resolved
  • Full quantitative precision/recall evaluation of DAG and QPP extraction against human-annotated ground truth was not conducted in the original study.

Circularity Check

0 steps flagged

No circularity: empirical system evaluation against external baselines

full rationale

The paper describes an engineering artifact (StepFly) with a three-stage workflow for TSG automation and reports an empirical success rate of ~94% on real-world incidents and TSGs, measured against separate baselines. No equations, first-principles derivations, or predictions appear in the manuscript. The central performance claims rest on direct execution measurements rather than any quantity defined in terms of itself or fitted to the target metric. No load-bearing self-citations or ansatzes are invoked to justify the results; the evaluation is self-contained against external benchmarks and public code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that current LLMs can produce reliable structured DAGs and plugins from noisy TSG text, plus the engineering choice to treat extracted graphs as faithful representations of control flow. No explicit free parameters are named; invented components are the new tools and plugins introduced by the framework.

axioms (1)
  • domain assumption LLMs can accurately interpret complex control flow and data-intensive queries in real-world TSGs to produce correct DAGs and QPPs
    Invoked in the offline preprocessing stage described in the abstract.
invented entities (2)
  • TSG Mentor no independent evidence
    purpose: Assist SREs in improving TSG quality before automation
    New tool introduced in the first stage of the workflow.
  • Query Preparation Plugins (QPPs) no independent evidence
    purpose: Handle data-intensive queries during execution
    Dedicated plugins created during offline preprocessing.

pith-pipeline@v0.9.0 · 5852 in / 1499 out tokens · 43619 ms · 2026-05-18T07:42:39.014994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 7.0

    SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.

  2. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 5.0

    SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.

  3. ActionNex: A Virtual Outage Manager for Cloud Computing

    cs.AI 2026-04 unverdicted novelty 4.0

    ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers

  1. [1]

    Langgraph: State machines for llm applications

    LangChain AI. Langgraph: State machines for llm applications. https://github.com/langchain-ai/langgraph, 2024. Accessed: 2025-05-30

  2. [2]

    Nissist: An incident mitigation copi- lot based on troubleshooting guides

    Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides, May 2024. arXiv:2402.17531

  3. [3]

    Empower your digital tasks with autogpt

    AutoGPT. Empower your digital tasks with autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023. Accessed: 2025-05-30

  4. [4]

    Type systems

    Luca Cardelli. Type systems. ACM Computing Surveys (CSUR), 28(1):263–264, 1996

  5. [5]

    Automatic root cause analysis via large language models for cloud incidents

    Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xue-Chao Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo GHOSH, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Automatic root cause analysis via large language models for cloud incidents. In EuroSys’24, April 2024

  6. [6]

    Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. Towards intelligent incident management: why we need it and how we make it. ESEC/FSE 2020, New York, NY, USA, 2020. Association for Computing Machinery

  7. [7]

    Mapreduce: simplified data processing on large clusters

    Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008

  8. [8]

    Thinking in JAVA

    Bruce Eckel. Thinking in JAVA. Prentice Hall Professional, 2003

  9. [9]

    Metagpt: Meta programming for a multi-agent collaborative framework, 2024

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

  10. [10]

    Dryad: distributed data-parallel programs from sequential building blocks

    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007, pages 59–72, 2007

  11. [11]

    How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

    Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Confer...

  12. [12]

    Xpert: Empowering incident management with query recommendations via large language models

    Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Xpert: Empowering incident management with query recommendations via large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA, 202...

  13. [13]

    Assess and summarize: Improve outage understanding with large language models

    Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the 31st ACM Joint European Software Engineering C...

  14. [14]

    Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 10651...

  15. [15]

    LLexus: an AI agent system for incident management

    Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: an AI agent system for incident management. ACM SIGOPS Operating Systems Review, 58(1):23–36, August 2024

  16. [16]

    Long-context llms struggle with long in-context learning, 2024

    Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, et al

  17. [17]

    A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

    Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

  18. [18]

    Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation

    Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5662–5673, 2025

  19. [19]

    Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

    Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, and Xuhong Zhang. Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

  20. [20]

    The rust language

    Nicholas D Matsakis and Felix S Klock. The rust language. InProceedings of the 2014 ACM SIGAda annual conference on High integrity language technology, pages 103–104, 2014

  21. [21]

    Kusto query language

    Microsoft. Kusto query language. https://learn.microsoft.com/en-us/kusto/query, 2025. Accessed: 2025-05-30

  22. [22]

    The world’s leading morden database

    MongoDB. The world’s leading morden database. https://www.mongodb.com/, 2025. Accessed: 2025-05-30

  23. [23]

    Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis

    Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, and Dan Pei. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis. In Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 422–431, New York, NY, USA, 2025. Association for Comp...

  24. [24]

    Types and programming languages

    Benjamin C Pierce. Types and programming languages. MIT press, 2002

  25. [25]

    Taskweaver: A code-first agent framework, 2024

    Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Taskweaver: A code-first agent framework, 2024

  26. [26]

    Exploring llm-based agents for root cause analysis

    Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmo- han. Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, page 208–219, New York, NY, USA, 2024. Associa- tion for Computing Machinery

  27. [27]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

  28. [28]

    Autotsg: learning and synthesis for incident troubleshooting

    Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1477–1488, New York, NY, USA, 2022. Association...

  29. [29]

    Significant Gravitas. AutoGPT

  30. [30]

    An overview of c++

    Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986

  31. [31]

    Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing

    Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing. Nl2kql: From natural language to kusto query. arXiv preprint arXiv:2404.02933, 2025

  32. [32]

    Groot: An event-graph-based approach for root cause analysis in industrial settings

    Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021

  33. [33]

    How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

    Weijing Wang, Junjie Chen, Lin Yang, Hongyu Zhang, Pu Zhao, Bo Qiao, Yu Kang, Qingwei Lin, Saravanakumar Rajmohan, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

  34. [34]

    Executable code actions elicit better llm agents

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

  35. [35]

    Chain-of-thought reasoning without prompting, 2024

    Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting, 2024

  36. [36]

    Large language models can provide accurate and interpretable incident triage

    Zexin Wang, Jianhui Li, Minghua Ma, Ze Li, Yu Kang, Chaoyun Zhang, Chetan Bansal, Murali Chintalapati, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang, Changhua Pei, and Gaogang Xie. Large language models can provide accurate and interpretable incident triage. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 523–534, 2024

  37. [37]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  38. [38]

    Seda: An architecture for well-conditioned, scalable internet services

    Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230–243, 2001

  39. [39]

    Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. , Vol. 1, No. 1, Article . Publication date: October 2025. Agentic Troubleshooting Guide ...

  40. [40]

    Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

    Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, and Jonathan Mace. Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

  41. [41]

    Aios compiler: Llm as interpreter for natural language programming and flow programming of ai agents, 2024

    Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents, May 2024. arXiv:2405.06907 version: 1

  42. [42]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  43. [43]

    AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models

    Chaoyun Zhang, Zicheng Ma, Yuhao Wu, Shilin He, Si Qin, Minghua Ma, Xiaoting Qin, Yu Kang, Yuyi Liang, Xiaoyu Gou, Yajie Xue, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models . In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 43–57, ...

  44. [44]

    Flash: A workflow automation agent for diagnosing recurring incidents, 2024

    Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang, Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Rajmohan. Flash: A workflow automation agent for diagnosing recurring incidents, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025