StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Chaoyun Zhang; Dongmei Zhang; Jiayi Mao; Liqun Li; Qingwei Lin; Samia Khalid; Saravan Rajmohan; Shilin He; Si Qin; Sitaram Lanka

arxiv: 2510.10074 · v2 · submitted 2025-10-11 · 💻 cs.AI

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Jiayi Mao , Liqun Li , Yanjie Gao , Zegang Peng , Shilin He , Chaoyun Zhang , Si Qin , Samia Khalid

show 4 more authors

Qingwei Lin Saravan Rajmohan Sitaram Lanka Dongmei Zhang

This is my paper

Pith reviewed 2026-05-18 07:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords troubleshooting guidesincident managementagentic frameworkLLM automationexecution DAGparallel executionquery preparationIT systems reliability

0 comments

The pith

StepFly automates troubleshooting guide execution for IT incidents by converting them into structured DAGs that support parallel steps and achieve around 94 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StepFly, an end-to-end agentic framework designed to replace slow and error-prone manual runs of troubleshooting guides in large-scale IT systems. It structures the process into three stages: a quality-improvement tool for engineers, offline conversion of text guides into executable DAGs plus supporting plugins, and an online executor that follows the DAG while allowing independent steps to run in parallel. A sympathetic reader would care because reliable automation here could shorten the time from incident detection to resolution without requiring constant human oversight. Real-world tests show the system outperforms simpler LLM approaches while using less time and fewer tokens overall.

Core claim

StepFly presents a three-stage agentic workflow for troubleshooting guide automation: the first stage supplies a TSG Mentor tool to help site reliability engineers improve guide quality; the second stage uses LLMs offline to extract structured execution DAGs from unstructured guides and to build dedicated Query Preparation Plugins; the third stage runs online via a DAG-guided scheduler-executor equipped with a memory system that enforces correct ordering and enables parallel execution of independent steps, yielding approximately 94 percent success on GPT-4.1 together with lower time and token costs than baselines.

What carries the argument

A DAG-guided scheduler-executor with memory system that enforces workflow correctness and permits parallel runs of independent steps extracted from the original guide.

If this is right

Troubleshooting guides become executable with far less manual intervention once converted to DAG form.
Independent steps identified in the DAG can run simultaneously, cutting total execution time between 33 and 70 percent.
The framework handles complex control flow and data-heavy queries through its dedicated plugins.
Built-in quality tools let engineers fix common problems in existing guides before automation begins.
Reduced token usage and faster runs make repeated executions of the same guides more practical at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preprocessing and scheduling pattern could apply to other domains that rely on written procedural checklists, such as network configuration or software deployment scripts.
Over time the memory system might accumulate patterns from past runs to suggest better parallel groupings without human input.
As base models improve, the success rate on guides with especially intricate logic could rise further without changes to the overall architecture.
Production use would likely shift SRE effort away from step-by-step execution toward reviewing exceptions and updating the underlying guides.

Load-bearing premise

Large language models can accurately turn unstructured real-world troubleshooting guides into correct execution DAGs and working Query Preparation Plugins during the offline stage.

What would settle it

A case in which the extracted DAG produces the wrong step order or causes a data query to fail on a known incident would show that the offline preprocessing step is unreliable.

Figures

Figures reproduced from arXiv: 2510.10074 by Chaoyun Zhang, Dongmei Zhang, Jiayi Mao, Liqun Li, Qingwei Lin, Samia Khalid, Saravan Rajmohan, Shilin He, Si Qin, Sitaram Lanka, Yanjie Gao, Zegang Peng.

**Figure 1.** Figure 1: Workflow of a real TSG for diagnosing availability incidents of a large online service. Typically, each step within a TSG includes: (1) a concise title summarizing the action; (2) a detailed description, which may include instructions, commands, queries, etc.; (3) a specification of expected outcomes; and (4) flow control logic that dictates the subsequent step based on the observed outcomes. A TSG termin… view at source ↗

**Figure 2.** Figure 2: Statistics on TSG characteristics: (a) Token count distribution, (b) Step count Distribution, (c) Tool [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: TSG Issue Distribution. Each major issue category is further decomposed into subcategories. For example, the Data Flow Issues (DF) category includes sub-issues such as “Unknown Input Source”, “Wrong Input Source”, and “Missing Parameters”; the Control Flow Issues (CF) group includes sub-issues such as “Unable To Infer Next Step” and “Wrong Next Step”. The detailed issue taxonomy is omitted due to space c… view at source ↗

**Figure 4.** Figure 4: The Proposed Approach Motivated by Finding 4, we provide detailed guidance on writing high-quality TSGs, based on our empirical study in Section 3.3. We introduce a tool called TSG Mentor to help SREs identify quality issues in their TSGs. The specifics are discussed in Section 4.1. We preprocess TSGs by creating an execution DAG, which serves as a structural blueprint of the workflow, with steps as nodes … view at source ↗

**Figure 5.** Figure 5: The execution DAG of the example TSG shown in Fig. 1. The conditional edges are associated with the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The process of running a log query. A critical observation is that these queries are structured as templates, utilizing either explicit placeholders (e.g., “where DeployRing == ‘{ring}’ ”) or implicit ones that could be inferred from context (e.g., “replace the StartTime to the incident’s start time when using the query”). Leveraging this characteristic, we propose using an LLM to extract these templates … view at source ↗

**Figure 7.** Figure 7: Cross-plugin data flow via memory. The memory architecture follows a key-value paradigm where string-based keys provide unique data identification and values accommodate arbitrary data types, ranging from primitive types (strings, numbers, lists, dictionaries) to complex structured data (DataFrames, ndarrays). Our implementation leverages MongoDB [22] as the underlying storage backend, because it support… view at source ↗

**Figure 8.** Figure 8: The execution DAG of the TSG in Fig. 1 after parallelization. The steps that can run concurrently are [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of time (left) and token consumption (right) [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Time consumption comparison between sequential execution and parallelized execution with different [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StepFly gives a workable three-stage pipeline that extracts DAGs and query plugins from real TSGs with LLMs then runs them in parallel with memory, and the 94% success plus time savings on 92 cases are the main concrete result.

read the letter

StepFly is a practical engineering system for turning messy troubleshooting guides into automated workflows. It starts with a TSG Mentor tool to help fix guide quality, moves to an offline LLM step that builds execution DAGs and Query Preparation Plugins, and finishes with an online scheduler that uses memory to keep state and run independent steps in parallel. The empirical study on 92 real-world TSGs and the public code are the parts that give it weight beyond a prompt-only baseline. The reported 94% success rate on GPT-4.1 and the 32.9-70.4% time cuts for parallelizable cases line up with the kind of gains operations teams would actually notice. The DAG-guided executor with memory looks like a sensible way to handle control flow and data queries that plain LLM calls often botch. The main soft spot is the offline extraction stage. The paper does not report separate accuracy numbers for how well the LLMs recover correct DAGs or working QPPs from unstructured text, nor does it break down failures on control-flow constructs or compare against human-annotated versions. Without that, the end-to-end success rate could be driven by easier TSGs or post-hoc adjustments rather than robust conversion on arbitrary guides. Baseline details are also thin in the abstract. This is aimed at applied researchers and engineers working on agentic tools for incident response and SRE practice. Readers who need concrete pipelines and open implementations for workflow automation will get usable ideas from it. It has enough grounding in real data and released code to deserve a serious referee, though reviewers will likely press for extraction fidelity metrics and fuller experimental reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces StepFly, a three-stage agentic framework for automating troubleshooting guide (TSG) execution in large-scale IT incident management. It reports an empirical study on 92 real-world TSGs to identify challenges, introduces a TSG Mentor tool for quality improvement, performs offline LLM-based extraction of structured execution DAGs and Query Preparation Plugins (QPPs), and executes online via a DAG-guided scheduler-executor with memory support for correct workflow and parallel execution of independent steps. The central empirical claim is a ~94% success rate on GPT-4.1, outperforming baselines with lower time and token consumption, plus execution time reductions of 32.9% to 70.4% for parallelizable TSGs.

Significance. If the performance claims hold under more rigorous validation, this work would represent a meaningful engineering contribution to LLM-based automation of structured operational workflows, specifically addressing TSG quality issues, control-flow interpretation, data-intensive queries, and parallelism exploitation. The public release of code and sample data at the provided GitHub link is a clear strength that supports reproducibility and follow-on research in AI for systems operations.

major comments (3)

[Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.
[Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.
[Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.

minor comments (2)

[Abstract] Abstract: the model is referred to as 'GPT-4.1'; clarify the precise identifier (e.g., GPT-4o, GPT-4 Turbo) and version used in all experiments.
[TSG Mentor] The TSG Mentor description would be strengthened by one or two concrete before/after examples of quality improvements it enables for SREs.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the potential engineering contribution of StepFly along with the value of our public code release. We address each major comment below with clarifications and commitments to revisions that strengthen verifiability while remaining faithful to the experiments performed.

read point-by-point responses

Referee: [Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.

Authors: We agree that additional details will improve verifiability. In the revised manuscript we will expand the Empirical Evaluation section to explicitly define the baselines (naive LLM prompting, sequential non-DAG execution, and non-parallel scheduler), describe the exact composition and sourcing of the 92 TSGs plus associated incident logs, provide a categorized error analysis of the failure cases, and report statistical significance tests (e.g., McNemar’s test for success-rate differences and paired t-tests for time/token metrics). revision: yes
Referee: [Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.

Authors: We acknowledge that independent quantitative metrics for the extraction stage would strengthen claims of end-to-end robustness. Our original study did not produce human-annotated ground-truth DAGs or QPPs; validation occurred indirectly via online success. In revision we will add a qualitative assessment based on manual review of a 20-TSG sample, report observed accuracy on control-flow constructs and QPP generation, include representative extraction examples, and break down online failures attributable to preprocessing. A full precision/recall study against new human annotations was not performed and would require additional effort beyond the current work. revision: partial
Referee: [Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.

Authors: We will revise the Results section to clarify these points: parallelism is identified by detecting independent steps with no data or control dependencies via the DAG’s topological order and dependency graph; scheduler overhead was measured separately as the time for dependency checks and task queuing (observed to be <2 s per TSG on average); and the reported range applies to the subset of TSGs containing at least one parallelizable branch (we will state the exact count and also report mean reduction with standard deviation). revision: yes

standing simulated objections not resolved

Full quantitative precision/recall evaluation of DAG and QPP extraction against human-annotated ground truth was not conducted in the original study.

Circularity Check

0 steps flagged

No circularity: empirical system evaluation against external baselines

full rationale

The paper describes an engineering artifact (StepFly) with a three-stage workflow for TSG automation and reports an empirical success rate of ~94% on real-world incidents and TSGs, measured against separate baselines. No equations, first-principles derivations, or predictions appear in the manuscript. The central performance claims rest on direct execution measurements rather than any quantity defined in terms of itself or fitted to the target metric. No load-bearing self-citations or ansatzes are invoked to justify the results; the evaluation is self-contained against external benchmarks and public code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that current LLMs can produce reliable structured DAGs and plugins from noisy TSG text, plus the engineering choice to treat extracted graphs as faithful representations of control flow. No explicit free parameters are named; invented components are the new tools and plugins introduced by the framework.

axioms (1)

domain assumption LLMs can accurately interpret complex control flow and data-intensive queries in real-world TSGs to produce correct DAGs and QPPs
Invoked in the offline preprocessing stage described in the abstract.

invented entities (2)

TSG Mentor no independent evidence
purpose: Assist SREs in improving TSG quality before automation
New tool introduced in the first stage of the workflow.
Query Preparation Plugins (QPPs) no independent evidence
purpose: Handle data-intensive queries during execution
Dedicated plugins created during offline preprocessing.

pith-pipeline@v0.9.0 · 5852 in / 1499 out tokens · 43619 ms · 2026-05-18T07:42:39.014994+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
cs.AI 2026-05 unverdicted novelty 7.0

SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
cs.AI 2026-05 unverdicted novelty 5.0

SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.
ActionNex: A Virtual Outage Manager for Cloud Computing
cs.AI 2026-04 unverdicted novelty 4.0

ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 2 Pith papers

[1]

Langgraph: State machines for llm applications

LangChain AI. Langgraph: State machines for llm applications. https://github.com/langchain-ai/langgraph, 2024. Accessed: 2025-05-30

work page 2024
[2]

Nissist: An incident mitigation copi- lot based on troubleshooting guides

Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides, May 2024. arXiv:2402.17531

work page arXiv 2024
[3]

Empower your digital tasks with autogpt

AutoGPT. Empower your digital tasks with autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023. Accessed: 2025-05-30

work page 2023
[4]

Type systems

Luca Cardelli. Type systems. ACM Computing Surveys (CSUR), 28(1):263–264, 1996

work page 1996
[5]

Automatic root cause analysis via large language models for cloud incidents

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xue-Chao Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo GHOSH, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Automatic root cause analysis via large language models for cloud incidents. In EuroSys’24, April 2024

work page 2024
[6]

Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. Towards intelligent incident management: why we need it and how we make it. ESEC/FSE 2020, New York, NY, USA, 2020. Association for Computing Machinery

work page 2020
[7]

Mapreduce: simplified data processing on large clusters

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008

work page 2008
[8]

Thinking in JAVA

Bruce Eckel. Thinking in JAVA. Prentice Hall Professional, 2003

work page 2003
[9]

Metagpt: Meta programming for a multi-agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

work page 2024
[10]

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007, pages 59–72, 2007

work page 2007
[11]

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Confer...

work page 2020
[12]

Xpert: Empowering incident management with query recommendations via large language models

Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Xpert: Empowering incident management with query recommendations via large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA, 202...

work page 2024
[13]

Assess and summarize: Improve outage understanding with large language models

Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the 31st ACM Joint European Software Engineering C...

work page 2023
[14]

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 10651...

work page 2024
[15]

LLexus: an AI agent system for incident management

Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: an AI agent system for incident management. ACM SIGOPS Operating Systems Review, 58(1):23–36, August 2024

work page 2024
[16]

Long-context llms struggle with long in-context learning, 2024

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, et al

work page 2024
[17]

A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

work page 2025
[18]

Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation

Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5662–5673, 2025

work page 2025
[19]

Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, and Xuhong Zhang. Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

work page 2025
[20]

The rust language

Nicholas D Matsakis and Felix S Klock. The rust language. InProceedings of the 2014 ACM SIGAda annual conference on High integrity language technology, pages 103–104, 2014

work page 2014
[21]

Kusto query language

Microsoft. Kusto query language. https://learn.microsoft.com/en-us/kusto/query, 2025. Accessed: 2025-05-30

work page 2025
[22]

The world’s leading morden database

MongoDB. The world’s leading morden database. https://www.mongodb.com/, 2025. Accessed: 2025-05-30

work page 2025
[23]

Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis

Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, and Dan Pei. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis. In Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 422–431, New York, NY, USA, 2025. Association for Comp...

work page 2025
[24]

Types and programming languages

Benjamin C Pierce. Types and programming languages. MIT press, 2002

work page 2002
[25]

Taskweaver: A code-first agent framework, 2024

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Taskweaver: A code-first agent framework, 2024

work page 2024
[26]

Exploring llm-based agents for root cause analysis

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmo- han. Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, page 208–219, New York, NY, USA, 2024. Associa- tion for Computing Machinery

work page 2024
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[28]

Autotsg: learning and synthesis for incident troubleshooting

Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1477–1488, New York, NY, USA, 2022. Association...

work page 2022
[29]

Significant Gravitas. AutoGPT

work page
[30]

An overview of c++

Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986

work page 1986
[31]

Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing

Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing. Nl2kql: From natural language to kusto query. arXiv preprint arXiv:2404.02933, 2025

work page arXiv 2025
[32]

Groot: An event-graph-based approach for root cause analysis in industrial settings

Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021

work page 2021
[33]

How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

Weijing Wang, Junjie Chen, Lin Yang, Hongyu Zhang, Pu Zhao, Bo Qiao, Yu Kang, Qingwei Lin, Saravanakumar Rajmohan, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

work page 2021
[34]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

work page 2024
[35]

Chain-of-thought reasoning without prompting, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting, 2024

work page 2024
[36]

Large language models can provide accurate and interpretable incident triage

Zexin Wang, Jianhui Li, Minghua Ma, Ze Li, Yu Kang, Chaoyun Zhang, Chetan Bansal, Murali Chintalapati, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang, Changhua Pei, and Gaogang Xie. Large language models can provide accurate and interpretable incident triage. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 523–534, 2024

work page 2024
[37]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[38]

Seda: An architecture for well-conditioned, scalable internet services

Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230–243, 2001

work page 2001
[39]

Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. , Vol. 1, No. 1, Article . Publication date: October 2025. Agentic Troubleshooting Guide ...

work page 2023
[40]

Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, and Jonathan Mace. Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

work page 2024
[41]

Aios compiler: Llm as interpreter for natural language programming and flow programming of ai agents, 2024

Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents, May 2024. arXiv:2405.06907 version: 1

work page arXiv 2024
[42]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[43]

AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models

Chaoyun Zhang, Zicheng Ma, Yuhao Wu, Shilin He, Si Qin, Minghua Ma, Xiaoting Qin, Yu Kang, Yuyi Liang, Xiaoyu Gou, Yajie Xue, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models . In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 43–57, ...

work page 2025
[44]

Flash: A workflow automation agent for diagnosing recurring incidents, 2024

Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang, Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Rajmohan. Flash: A workflow automation agent for diagnosing recurring incidents, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025

work page 2024

[1] [1]

Langgraph: State machines for llm applications

LangChain AI. Langgraph: State machines for llm applications. https://github.com/langchain-ai/langgraph, 2024. Accessed: 2025-05-30

work page 2024

[2] [2]

Nissist: An incident mitigation copi- lot based on troubleshooting guides

Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides, May 2024. arXiv:2402.17531

work page arXiv 2024

[3] [3]

Empower your digital tasks with autogpt

AutoGPT. Empower your digital tasks with autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023. Accessed: 2025-05-30

work page 2023

[4] [4]

Type systems

Luca Cardelli. Type systems. ACM Computing Surveys (CSUR), 28(1):263–264, 1996

work page 1996

[5] [5]

Automatic root cause analysis via large language models for cloud incidents

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xue-Chao Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo GHOSH, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Automatic root cause analysis via large language models for cloud incidents. In EuroSys’24, April 2024

work page 2024

[6] [6]

Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. Towards intelligent incident management: why we need it and how we make it. ESEC/FSE 2020, New York, NY, USA, 2020. Association for Computing Machinery

work page 2020

[7] [7]

Mapreduce: simplified data processing on large clusters

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008

work page 2008

[8] [8]

Thinking in JAVA

Bruce Eckel. Thinking in JAVA. Prentice Hall Professional, 2003

work page 2003

[9] [9]

Metagpt: Meta programming for a multi-agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024

work page 2024

[10] [10]

Dryad: distributed data-parallel programs from sequential building blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007, pages 59–72, 2007

work page 2007

[11] [11]

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems

Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Confer...

work page 2020

[12] [12]

Xpert: Empowering incident management with query recommendations via large language models

Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Xpert: Empowering incident management with query recommendations via large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA, 202...

work page 2024

[13] [13]

Assess and summarize: Improve outage understanding with large language models

Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the 31st ACM Joint European Software Engineering C...

work page 2023

[14] [14]

Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 10651...

work page 2024

[15] [15]

LLexus: an AI agent system for incident management

Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: an AI agent system for incident management. ACM SIGOPS Operating Systems Review, 58(1):23–36, August 2024

work page 2024

[16] [16]

Long-context llms struggle with long in-context learning, 2024

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, et al

work page 2024

[17] [17]

A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025

work page 2025

[18] [18]

Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation

Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5662–5673, 2025

work page 2025

[19] [19]

Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, and Xuhong Zhang. Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025

work page 2025

[20] [20]

The rust language

Nicholas D Matsakis and Felix S Klock. The rust language. InProceedings of the 2014 ACM SIGAda annual conference on High integrity language technology, pages 103–104, 2014

work page 2014

[21] [21]

Kusto query language

Microsoft. Kusto query language. https://learn.microsoft.com/en-us/kusto/query, 2025. Accessed: 2025-05-30

work page 2025

[22] [22]

The world’s leading morden database

MongoDB. The world’s leading morden database. https://www.mongodb.com/, 2025. Accessed: 2025-05-30

work page 2025

[23] [23]

Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis

Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, and Dan Pei. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis. In Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 422–431, New York, NY, USA, 2025. Association for Comp...

work page 2025

[24] [24]

Types and programming languages

Benjamin C Pierce. Types and programming languages. MIT press, 2002

work page 2002

[25] [25]

Taskweaver: A code-first agent framework, 2024

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Taskweaver: A code-first agent framework, 2024

work page 2024

[26] [26]

Exploring llm-based agents for root cause analysis

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmo- han. Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, page 208–219, New York, NY, USA, 2024. Associa- tion for Computing Machinery

work page 2024

[27] [27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023

[28] [28]

Autotsg: learning and synthesis for incident troubleshooting

Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1477–1488, New York, NY, USA, 2022. Association...

work page 2022

[29] [29]

Significant Gravitas. AutoGPT

work page

[30] [30]

An overview of c++

Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986

work page 1986

[31] [31]

Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing

Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing. Nl2kql: From natural language to kusto query. arXiv preprint arXiv:2404.02933, 2025

work page arXiv 2025

[32] [32]

Groot: An event-graph-based approach for root cause analysis in industrial settings

Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021

work page 2021

[33] [33]

How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

Weijing Wang, Junjie Chen, Lin Yang, Hongyu Zhang, Pu Zhao, Bo Qiao, Yu Kang, Qingwei Lin, Saravanakumar Rajmohan, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021

work page 2021

[34] [34]

Executable code actions elicit better llm agents

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024

work page 2024

[35] [35]

Chain-of-thought reasoning without prompting, 2024

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting, 2024

work page 2024

[36] [36]

Large language models can provide accurate and interpretable incident triage

Zexin Wang, Jianhui Li, Minghua Ma, Ze Li, Yu Kang, Chaoyun Zhang, Chetan Bansal, Murali Chintalapati, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang, Changhua Pei, and Gaogang Xie. Large language models can provide accurate and interpretable incident triage. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 523–534, 2024

work page 2024

[37] [37]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023

[38] [38]

Seda: An architecture for well-conditioned, scalable internet services

Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230–243, 2001

work page 2001

[39] [39]

Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. , Vol. 1, No. 1, Article . Publication date: October 2025. Agentic Troubleshooting Guide ...

work page 2023

[40] [40]

Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, and Jonathan Mace. Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

work page 2024

[41] [41]

Aios compiler: Llm as interpreter for natural language programming and flow programming of ai agents, 2024

Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents, May 2024. arXiv:2405.06907 version: 1

work page arXiv 2024

[42] [42]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023

[43] [43]

AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models

Chaoyun Zhang, Zicheng Ma, Yuhao Wu, Shilin He, Si Qin, Minghua Ma, Xiaoting Qin, Yu Kang, Yuyi Liang, Xiaoyu Gou, Yajie Xue, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models . In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 43–57, ...

work page 2025

[44] [44]

Flash: A workflow automation agent for diagnosing recurring incidents, 2024

Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang, Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Rajmohan. Flash: A workflow automation agent for diagnosing recurring incidents, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025

work page 2024