StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
Pith reviewed 2026-05-18 07:42 UTC · model grok-4.3
The pith
StepFly automates troubleshooting guide execution for IT incidents by converting them into structured DAGs that support parallel steps and achieve around 94 percent success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StepFly presents a three-stage agentic workflow for troubleshooting guide automation: the first stage supplies a TSG Mentor tool to help site reliability engineers improve guide quality; the second stage uses LLMs offline to extract structured execution DAGs from unstructured guides and to build dedicated Query Preparation Plugins; the third stage runs online via a DAG-guided scheduler-executor equipped with a memory system that enforces correct ordering and enables parallel execution of independent steps, yielding approximately 94 percent success on GPT-4.1 together with lower time and token costs than baselines.
What carries the argument
A DAG-guided scheduler-executor with memory system that enforces workflow correctness and permits parallel runs of independent steps extracted from the original guide.
If this is right
- Troubleshooting guides become executable with far less manual intervention once converted to DAG form.
- Independent steps identified in the DAG can run simultaneously, cutting total execution time between 33 and 70 percent.
- The framework handles complex control flow and data-heavy queries through its dedicated plugins.
- Built-in quality tools let engineers fix common problems in existing guides before automation begins.
- Reduced token usage and faster runs make repeated executions of the same guides more practical at scale.
Where Pith is reading between the lines
- The same preprocessing and scheduling pattern could apply to other domains that rely on written procedural checklists, such as network configuration or software deployment scripts.
- Over time the memory system might accumulate patterns from past runs to suggest better parallel groupings without human input.
- As base models improve, the success rate on guides with especially intricate logic could rise further without changes to the overall architecture.
- Production use would likely shift SRE effort away from step-by-step execution toward reviewing exceptions and updating the underlying guides.
Load-bearing premise
Large language models can accurately turn unstructured real-world troubleshooting guides into correct execution DAGs and working Query Preparation Plugins during the offline stage.
What would settle it
A case in which the extracted DAG produces the wrong step order or causes a data query to fail on a known incident would show that the offline preprocessing step is unreliable.
Figures
read the original abstract
Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces StepFly, a three-stage agentic framework for automating troubleshooting guide (TSG) execution in large-scale IT incident management. It reports an empirical study on 92 real-world TSGs to identify challenges, introduces a TSG Mentor tool for quality improvement, performs offline LLM-based extraction of structured execution DAGs and Query Preparation Plugins (QPPs), and executes online via a DAG-guided scheduler-executor with memory support for correct workflow and parallel execution of independent steps. The central empirical claim is a ~94% success rate on GPT-4.1, outperforming baselines with lower time and token consumption, plus execution time reductions of 32.9% to 70.4% for parallelizable TSGs.
Significance. If the performance claims hold under more rigorous validation, this work would represent a meaningful engineering contribution to LLM-based automation of structured operational workflows, specifically addressing TSG quality issues, control-flow interpretation, data-intensive queries, and parallelism exploitation. The public release of code and sample data at the provided GitHub link is a clear strength that supports reproducibility and follow-on research in AI for systems operations.
major comments (3)
- [Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.
- [Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.
- [Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.
minor comments (2)
- [Abstract] Abstract: the model is referred to as 'GPT-4.1'; clarify the precise identifier (e.g., GPT-4o, GPT-4 Turbo) and version used in all experiments.
- [TSG Mentor] The TSG Mentor description would be strengthened by one or two concrete before/after examples of quality improvements it enables for SREs.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential engineering contribution of StepFly along with the value of our public code release. We address each major comment below with clarifications and commitments to revisions that strengthen verifiability while remaining faithful to the experiments performed.
read point-by-point responses
-
Referee: [Empirical Evaluation] Empirical Evaluation section: the ~94% success rate claim on GPT-4.1 and outperformance over baselines lacks visible details on baseline definitions, exact dataset composition beyond the 92-TSG study, error analysis, or statistical significance testing. This directly affects verifiability of the central performance result.
Authors: We agree that additional details will improve verifiability. In the revised manuscript we will expand the Empirical Evaluation section to explicitly define the baselines (naive LLM prompting, sequential non-DAG execution, and non-parallel scheduler), describe the exact composition and sourcing of the 92 TSGs plus associated incident logs, provide a categorized error analysis of the failure cases, and report statistical significance tests (e.g., McNemar’s test for success-rate differences and paired t-tests for time/token metrics). revision: yes
-
Referee: [Offline Preprocessing] Offline Preprocessing stage (abstract and methods description): no quantified accuracy metrics (e.g., precision/recall against human-annotated DAGs, failure rates on control-flow constructs, or QPP query correctness) are reported for the LLM extraction of execution DAGs and Query Preparation Plugins from the 92 unstructured TSGs. Since online success presupposes reliable extraction, this is load-bearing for interpreting end-to-end robustness.
Authors: We acknowledge that independent quantitative metrics for the extraction stage would strengthen claims of end-to-end robustness. Our original study did not produce human-annotated ground-truth DAGs or QPPs; validation occurred indirectly via online success. In revision we will add a qualitative assessment based on manual review of a 20-TSG sample, report observed accuracy on control-flow constructs and QPP generation, include representative extraction examples, and break down online failures attributable to preprocessing. A full precision/recall study against new human annotations was not performed and would require additional effort beyond the current work. revision: partial
-
Referee: [Results] Results on parallel execution: the reported 32.9% to 70.4% execution time reduction for parallelizable TSGs requires explicit description of how parallelism was identified in the DAGs, how the scheduler overhead was measured, and the subset of TSGs to which the range applies.
Authors: We will revise the Results section to clarify these points: parallelism is identified by detecting independent steps with no data or control dependencies via the DAG’s topological order and dependency graph; scheduler overhead was measured separately as the time for dependency checks and task queuing (observed to be <2 s per TSG on average); and the reported range applies to the subset of TSGs containing at least one parallelizable branch (we will state the exact count and also report mean reduction with standard deviation). revision: yes
- Full quantitative precision/recall evaluation of DAG and QPP extraction against human-annotated ground truth was not conducted in the original study.
Circularity Check
No circularity: empirical system evaluation against external baselines
full rationale
The paper describes an engineering artifact (StepFly) with a three-stage workflow for TSG automation and reports an empirical success rate of ~94% on real-world incidents and TSGs, measured against separate baselines. No equations, first-principles derivations, or predictions appear in the manuscript. The central performance claims rest on direct execution measurements rather than any quantity defined in terms of itself or fitted to the target metric. No load-bearing self-citations or ansatzes are invoked to justify the results; the evaluation is self-contained against external benchmarks and public code.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can accurately interpret complex control flow and data-intensive queries in real-world TSGs to produce correct DAGs and QPPs
invented entities (2)
-
TSG Mentor
no independent evidence
-
Query Preparation Plugins (QPPs)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.
-
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.
-
ActionNex: A Virtual Outage Manager for Cloud Computing
ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.
Reference graph
Works this paper leans on
-
[1]
Langgraph: State machines for llm applications
LangChain AI. Langgraph: State machines for llm applications. https://github.com/langchain-ai/langgraph, 2024. Accessed: 2025-05-30
work page 2024
-
[2]
Nissist: An incident mitigation copi- lot based on troubleshooting guides
Kaikai An, Fangkai Yang, Junting Lu, Liqun Li, Zhixing Ren, Hao Huang, Lu Wang, Pu Zhao, Yu Kang, Hua Ding, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides, May 2024. arXiv:2402.17531
-
[3]
Empower your digital tasks with autogpt
AutoGPT. Empower your digital tasks with autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023. Accessed: 2025-05-30
work page 2023
-
[4]
Luca Cardelli. Type systems. ACM Computing Surveys (CSUR), 28(1):263–264, 1996
work page 1996
-
[5]
Automatic root cause analysis via large language models for cloud incidents
Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xue-Chao Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo GHOSH, Xuchao Zhang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Automatic root cause analysis via large language models for cloud incidents. In EuroSys’24, April 2024
work page 2024
-
[6]
Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Yingnong Dang, Feng Gao, Pu Zhao, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Michael R. Lyu. Towards intelligent incident management: why we need it and how we make it. ESEC/FSE 2020, New York, NY, USA, 2020. Association for Computing Machinery
work page 2020
-
[7]
Mapreduce: simplified data processing on large clusters
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008
work page 2008
- [8]
-
[9]
Metagpt: Meta programming for a multi-agent collaborative framework, 2024
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework, 2024
work page 2024
-
[10]
Dryad: distributed data-parallel programs from sequential building blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007, pages 59–72, 2007
work page 2007
-
[11]
Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Confer...
work page 2020
-
[12]
Xpert: Empowering incident management with query recommendations via large language models
Yuxuan Jiang, Chaoyun Zhang, Shilin He, Zhihao Yang, Minghua Ma, Si Qin, Yu Kang, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Xpert: Empowering incident management with query recommendations via large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA, 202...
work page 2024
-
[13]
Assess and summarize: Improve outage understanding with large language models
Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, Shilin He, Federica Sarro, Yingnong Dang, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the 31st ACM Joint European Software Engineering C...
work page 2023
-
[14]
Babilong: Testing the limits of llms with long context reasoning-in-a-haystack
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 10651...
work page 2024
-
[15]
LLexus: an AI agent system for incident management
Pedro Las-Casas, Alok Gautum Kumbhare, Rodrigo Fonseca, and Sharad Agarwal. LLexus: an AI agent system for incident management. ACM SIGOPS Operating Systems Review, 58(1):23–36, August 2024
work page 2024
-
[16]
Long-context llms struggle with long in-context learning, 2024
Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025. 20 Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, et al
work page 2024
-
[17]
A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025
Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. A survey of text-to-sql in the era of llms: Where are we, and where are we going?, 2025
work page 2025
-
[18]
Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation
Xinyu Liu, Shuyu Shen, Boyan Li, Nan Tang, and Yuyu Luo. Nl2sql-bugs: A benchmark for detecting semantic errors in nl2sql translation. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5662–5673, 2025
work page 2025
-
[19]
Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025
Yanming Liu, Xinyue Peng, Jiannan Cao, Shi Bo, Yanxin Shen, Tianyu Du, Sheng Cheng, Xun Wang, Jianwei Yin, and Xuhong Zhang. Bridging context gaps: Leveraging coreference resolution for long contextual understanding, 2025
work page 2025
-
[20]
Nicholas D Matsakis and Felix S Klock. The rust language. InProceedings of the 2014 ACM SIGAda annual conference on High integrity language technology, pages 103–104, 2014
work page 2014
-
[21]
Microsoft. Kusto query language. https://learn.microsoft.com/en-us/kusto/query, 2025. Accessed: 2025-05-30
work page 2025
-
[22]
The world’s leading morden database
MongoDB. The world’s leading morden database. https://www.mongodb.com/, 2025. Accessed: 2025-05-30
work page 2025
-
[23]
Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis
Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, and Dan Pei. Flow-of-action: Sop enhanced llm-based multi-agent system for root cause analysis. In Companion Proceedings of the ACM on Web Conference 2025, WWW ’25, page 422–431, New York, NY, USA, 2025. Association for Comp...
work page 2025
-
[24]
Types and programming languages
Benjamin C Pierce. Types and programming languages. MIT press, 2002
work page 2002
-
[25]
Taskweaver: A code-first agent framework, 2024
Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Taskweaver: A code-first agent framework, 2024
work page 2024
-
[26]
Exploring llm-based agents for root cause analysis
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca, and Saravan Rajmo- han. Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, page 208–219, New York, NY, USA, 2024. Associa- tion for Computing Machinery
work page 2024
-
[27]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539–68551, 2023
work page 2023
-
[28]
Autotsg: learning and synthesis for incident troubleshooting
Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, page 1477–1488, New York, NY, USA, 2022. Association...
work page 2022
-
[29]
Significant Gravitas. AutoGPT
-
[30]
Bjarne Stroustrup. An overview of c++. In Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming, pages 7–18, 1986
work page 1986
-
[31]
Xinye Tang, Amir H. Abdi, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, and Ye Xing. Nl2kql: From natural language to kusto query. arXiv preprint arXiv:2404.02933, 2025
-
[32]
Groot: An event-graph-based approach for root cause analysis in industrial settings
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021
work page 2021
-
[33]
How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021
Weijing Wang, Junjie Chen, Lin Yang, Hongyu Zhang, Pu Zhao, Bo Qiao, Yu Kang, Qingwei Lin, Saravanakumar Rajmohan, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. How long will it take to mitigate this incident for online service systems? In ISSRE’21, 2021
work page 2021
-
[34]
Executable code actions elicit better llm agents
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. In ICML, 2024
work page 2024
-
[35]
Chain-of-thought reasoning without prompting, 2024
Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting, 2024
work page 2024
-
[36]
Large language models can provide accurate and interpretable incident triage
Zexin Wang, Jianhui Li, Minghua Ma, Ze Li, Yu Kang, Chaoyun Zhang, Chetan Bansal, Murali Chintalapati, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang, Changhua Pei, and Gaogang Xie. Large language models can provide accurate and interpretable incident triage. In 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), pages 523–534, 2024
work page 2024
-
[37]
Chain-of-thought prompting elicits reasoning in large language models, 2023
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023
work page 2023
-
[38]
Seda: An architecture for well-conditioned, scalable internet services
Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. ACM SIGOPS operating systems review, 35(5):230–243, 2001
work page 2001
-
[39]
Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. , Vol. 1, No. 1, Article . Publication date: October 2025. Agentic Troubleshooting Guide ...
work page 2023
-
[40]
Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, and Jonathan Mace. Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024
work page 2024
-
[41]
Shuyuan Xu, Zelong Li, Kai Mei, and Yongfeng Zhang. CoRE: LLM as Interpreter for Natural Language Programming, Pseudo-Code Programming, and Flow Programming of AI Agents, May 2024. arXiv:2405.06907 version: 1
-
[42]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[43]
AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models
Chaoyun Zhang, Zicheng Ma, Yuhao Wu, Shilin He, Si Qin, Minghua Ma, Xiaoting Qin, Yu Kang, Yuyi Liang, Xiaoyu Gou, Yajie Xue, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. AllHands: Ask Me Anything on Large-Scale Verbatim Feedback via Large Language Models . In 2025 IEEE 41st International Conference on Data Engineering (ICDE), pages 43–57, ...
work page 2025
-
[44]
Flash: A workflow automation agent for diagnosing recurring incidents, 2024
Xuchao Zhang, Tanish Mittal, Chetan Bansal, Rujia Wang, Minghua Ma, Zhixin Ren, Hao Huang, and Saravan Rajmohan. Flash: A workflow automation agent for diagnosing recurring incidents, 2024. , Vol. 1, No. 1, Article . Publication date: October 2025
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.