Testing Agentic Workflows with Structural Coverage Criteria

Mojtaba Bagherzadeh; Nafiseh Kahani

arxiv: 2605.26521 · v1 · pith:T72IRYKTnew · submitted 2026-05-26 · 💻 cs.SE

Testing Agentic Workflows with Structural Coverage Criteria

Nafiseh Kahani , Mojtaba Bagherzadeh This is my paper

Pith reviewed 2026-06-29 16:14 UTC · model grok-4.3

classification 💻 cs.SE

keywords multi-agent workflowsstructural coveragetest adequacycoordination graphdelegation pathstool access rulesworkflow testingagentic systems

0 comments

The pith

A typed coordination graph turns multi-agent workflow specifications into structural coverage obligations that tests must exercise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to represent each workflow as a typed coordination graph whose nodes and edges capture agents, allowed tool uses, restricted tool uses, and delegation paths. Coverage obligations are then derived directly from the reachable parts of this graph. Tests are generated to meet those obligations so that success indicates the declared structure has been exercised rather than only that a task completed. A reader would care because end-to-end success scores alone leave open whether tool-access rules, restrictions, or delegation paths were ever used or have regressed. The method therefore supplies an additional, measurable adequacy layer for test suites.

Core claim

The approach extracts a typed coordination graph from a workflow specification, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and produces executable tests whose witnesses demonstrate that those obligations have been met at runtime. Evaluation on ten benchmarks shows that generated scenarios can witness substantial fractions of the allowed-tool and delegation obligations and can elicit restricted-call violations that separate workflows whose restrictions hold from those with concrete misrouting.

What carries the argument

The typed coordination graph, which encodes the declared agents, tool-access rules, restrictions, and delegation paths as nodes and edges and supplies the coverage obligations that generated tests must satisfy.

If this is right

Test suites acquire an independent structural adequacy measure that can be checked without reference to task success.
Restricted-tool obligations can surface concrete misrouting failures that end-to-end evaluation may miss.
Gaps in delegation coverage become detectable before deployment.
Workflows can be compared by how many of their declared structural obligations their test suites actually meet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph could be used to prioritize which parts of a workflow need additional semantic tests once structural coverage is reached.
Coverage gaps identified early could guide refinement of the workflow specification itself rather than only its tests.
The approach might apply to any workflow system that exposes explicit agent, tool, and delegation declarations.

Load-bearing premise

The coordination graph extracted from the specification accurately and completely represents the runtime coordination behavior of the workflow.

What would settle it

A workflow whose generated test suite achieves full graph coverage yet runtime traces show tool calls or delegations that violate the graph's declared edges or leave declared edges unexercised.

Figures

Figures reproduced from arXiv: 2605.26521 by Mojtaba Bagherzadeh, Nafiseh Kahani.

**Figure 1.** Figure 1: shows the corresponding coordination graph. Rounded nodes denote agents, rectangular nodes denote tools, solid edges denote allowed tool access, and dotted edges denote delegation. For readability, the figure shows only the two allowed-tool edges and the four delegation edges; the four restricted-tool edges are listed in the caption. Listing 1 shows the normalized specification used by the prototype. The l… view at source ↗

**Figure 3.** Figure 3: shows a real RealizeDelegate run on the oai_message_filter workflow, for the obligation Delegate(assistant_2 → spanish_assistant). Although oai_customer_service remains the paper’s main running example, we use oai_message_filter here because it provides a clearer illustration of multi-attempt runtime-grounded refinement. Early attempts ask for Spanish Objective. Delegate(assistant_2 → spanish_assistant) on… view at source ↗

read the original abstract

Multi-agent systems increasingly expose explicit workflow structure: agents, tools, tool-access rules, restrictions, and delegation paths. Existing evaluations rely largely on end-to-end task success, benchmark scores, final-response quality, or prompt-level checks, which provide limited evidence that this declared coordination structure has actually been exercised. This makes it difficult to assess test-suite adequacy or detect structural regressions in tool access, restrictions, and inter-agent delegation. We address this gap with a structural testing approach for multi-agent workflow specifications. The approach represents each workflow as a typed coordination graph, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and uses coverage-driven generation with DSPy-based scenario realization to produce executable tests. The graph fixes what must be covered; DSPy realizes those obligations as natural-language scenarios whose witnesses are checked at runtime. We implement the approach for OpenAI Agents SDK-style workflows and evaluate it on ten SDK-derived benchmarks comprising 49 reachable agents, 47 tools, and 403 structural obligations. Generated scenarios witness 54/75 allowed-tool obligations and 36/48 delegation obligations within a bounded refinement budget. The adversarial restricted-tool criterion elicits 23/248 restricted-call violations, separating workflows whose restrictions hold under probing from workflows with concrete misrouting failures. These results show that structural coverage provides a useful adequacy layer for multi-agent workflow testing: it does not replace semantic or end-to-end evaluation, but reveals whether declared agents, tool-access rules, restrictions, and delegation paths have been exercised.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps agent workflows to typed coordination graphs, derives structural coverage obligations, and uses DSPy to generate tests that hit those obligations, yielding concrete numbers on ten benchmarks.

read the letter

The paper's main contribution is turning explicit multi-agent workflow specs into typed coordination graphs, then pulling out coverage obligations for agents, allowed tool edges, restricted tool edges, and delegation paths. DSPy handles turning those obligations into natural-language scenarios that get executed and checked. On the ten SDK-derived benchmarks they report hitting 54 of 75 allowed-tool obligations, 36 of 48 delegation obligations, and surfacing 23 restricted-call violations.

This is a direct application of classical structural testing ideas to the coordination layer that most current agent evaluations skip. The numbers are useful because they give a measurable check on whether the declared structure actually gets exercised, separate from end-to-end task success.

The soft spots are in the evaluation transparency. The abstract states the coverage counts but gives no detail on how obligations were counted from the graphs or how runtime witnesses were validated. All ten benchmarks come from the same OpenAI Agents SDK style, so it's unclear how much the results depend on that setup or whether the benchmarks were chosen independently. The central assumption—that satisfying the static graph obligations corresponds to real runtime coordination—remains untested beyond the reported runs.

This is for people working on testing multi-agent LLM systems who want something more structured than prompt checks or final-answer quality. A reader focused on software engineering for agents would find the method and the reported fractions worth examining.

I would send it for peer review. The core mapping and the use of DSPy for realization are concrete enough to merit referee feedback, even with the gaps in validation detail.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a structural testing approach for multi-agent workflows. Workflows are modeled as typed coordination graphs from which coverage obligations are derived for reachable agents, allowed/restricted tool edges, and delegation edges. DSPy-based generation produces executable scenarios to satisfy these obligations, and the method is evaluated on ten OpenAI Agents SDK-derived benchmarks (49 agents, 47 tools, 403 obligations), reporting 54/75 allowed-tool coverage, 36/48 delegation coverage, and 23/248 restricted violations detected.

Significance. If the graph-to-runtime mapping holds, the work supplies a useful complementary adequacy criterion for agentic systems that focuses on exercise of declared coordination structure rather than only end-to-end task success. The separation of obligation derivation from scenario realization and the concrete numerical results on restriction probing are positive features.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the reported figures (54/75 allowed-tool, 36/48 delegation, 23/248 restricted) are given without any description of how obligations were enumerated from the graphs, how runtime witnesses were validated, or the procedure used to select the ten benchmarks independently of the proposed method. These details are load-bearing for interpreting the empirical support for the adequacy claim.
[Approach / Evaluation] Approach and Evaluation sections: the central claim that satisfying graph-derived obligations exercises the actual coordination structure rests on the unvalidated assumption that the extracted typed coordination graph faithfully and completely represents runtime behavior; no dynamic trace comparison or counter-example analysis is supplied to support or bound this mapping.

minor comments (1)

Define 'reachable agents' and the precise counting of structural obligations more explicitly (e.g., in a dedicated subsection or table) to support independent reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional clarity would strengthen the presentation of the empirical results and the scope of the central claim. We address each major comment below, indicating planned revisions.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the reported figures (54/75 allowed-tool, 36/48 delegation, 23/248 restricted) are given without any description of how obligations were enumerated from the graphs, how runtime witnesses were validated, or the procedure used to select the ten benchmarks independently of the proposed method. These details are load-bearing for interpreting the empirical support for the adequacy claim.

Authors: We agree that the abstract and Evaluation section would benefit from explicit descriptions of these procedures. In the revised manuscript we will add a dedicated subsection in Evaluation that (1) details obligation enumeration by graph traversal over reachable agents, allowed/restricted tool edges, and delegation edges; (2) describes the runtime witness validation process, which inspects SDK execution logs for the presence of the required edges and agents; and (3) states that the ten benchmarks were drawn from publicly available OpenAI Agents SDK examples chosen for diversity in agent count and tool usage, without reference to the coverage criteria. These additions will make the reported coverage numbers interpretable. revision: yes
Referee: [Approach / Evaluation] Approach and Evaluation sections: the central claim that satisfying graph-derived obligations exercises the actual coordination structure rests on the unvalidated assumption that the extracted typed coordination graph faithfully and completely represents runtime behavior; no dynamic trace comparison or counter-example analysis is supplied to support or bound this mapping.

Authors: The coordination graph is extracted directly from the workflow specification (agent declarations, tool-access rules, and delegation paths). The adequacy claim concerns whether the declared structure is exercised, not whether the graph captures every possible runtime behavior. We acknowledge that the manuscript provides no dynamic trace comparison or counter-example analysis to bound the fidelity of the extraction. In revision we will add an explicit limitations paragraph in the Approach section clarifying the scope of the claim and noting that full runtime validation of the graph-to-execution mapping remains future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives coverage obligations directly from the typed coordination graph representation of the workflow specification, with no equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations. The central claims rest on the graph extraction and DSPy-based realization steps, which are defined independently without renaming known results or importing uniqueness theorems from prior author work. The evaluation numbers (e.g., 54/75) are presented as empirical outcomes under the stated assumptions rather than forced by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes the workflow specification is a faithful static model of runtime behavior.

pith-pipeline@v0.9.1-grok · 5803 in / 1078 out tokens · 31402 ms · 2026-06-29T16:14:15.641228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 6 canonical work pages · 6 internal anchors

[1]

Cambridge University Press, 2nd edition, 2016

Paul Ammann and Jeff Offutt.Introduction to Software Testing. Cambridge University Press, 2nd edition, 2016

2016
[2]

Dspy documentation

DSPy Contributors. Dspy documentation. https://dspy.ai/, 2026. Ac- cessed: 2026-05-02

2026
[3]

Dspy documentation: Modules

DSPy Contributors. Dspy documentation: Modules. https://dspy.ai/learn/ programming/modules/, 2026. Accessed: 2026-05-02

2026
[4]

Dspy documentation: Signatures

DSPy Contributors. Dspy documentation: Signatures. https://dspy.ai/ learn/programming/signatures/, 2026. Accessed: 2026-05-02

2026
[5]

Evosuite: Automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. Evosuite: Automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pages 416–419, 2011

2011
[6]

An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

2011
[7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 14

2024
[8]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations, 2024

2024
[9]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Ming Li et al. API-Bank: A comprehensive benchmark for tool- augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yao Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023

2023
[11]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, et al. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing

Glenford J. Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing. John Wiley & Sons, 3rd edition, 2011

2011
[14]

Agents sdk

OpenAI. Agents sdk. https://developers.openai.com/api/docs/guides/ agents, 2026. Accessed: 2026-05-02

2026
[15]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Ruyi Luo, Pan Ye, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020

2020
[17]

Morgan Kaufmann, 2006

Mark Utting and Bruno Legeard.Practical Model-Based Testing: A Tools Approach. Morgan Kaufmann, 2006

2006
[18]

Textflint: Unified multilingual robustness evaluation toolkit for natural language processing

Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Xuanjing Huang, et al. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstra- tions, pages 347...

2021
[19]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling (COLM), 2024

2024
[20]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Cambridge University Press, 2nd edition, 2016

Paul Ammann and Jeff Offutt.Introduction to Software Testing. Cambridge University Press, 2nd edition, 2016

2016

[2] [2]

Dspy documentation

DSPy Contributors. Dspy documentation. https://dspy.ai/, 2026. Ac- cessed: 2026-05-02

2026

[3] [3]

Dspy documentation: Modules

DSPy Contributors. Dspy documentation: Modules. https://dspy.ai/learn/ programming/modules/, 2026. Accessed: 2026-05-02

2026

[4] [4]

Dspy documentation: Signatures

DSPy Contributors. Dspy documentation: Signatures. https://dspy.ai/ learn/programming/signatures/, 2026. Accessed: 2026-05-02

2026

[5] [5]

Evosuite: Automatic test suite generation for object-oriented software

Gordon Fraser and Andrea Arcuri. Evosuite: Automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pages 416–419, 2011

2011

[6] [6]

An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

2011

[7] [7]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 14

2024

[8] [8]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations, 2024

2024

[9] [9]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Ming Li et al. API-Bank: A comprehensive benchmark for tool- augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yao Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023

2023

[11] [11]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, et al. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing

Glenford J. Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing. John Wiley & Sons, 3rd edition, 2011

2011

[14] [14]

Agents sdk

OpenAI. Agents sdk. https://developers.openai.com/api/docs/guides/ agents, 2026. Accessed: 2026-05-02

2026

[15] [15]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Ruyi Luo, Pan Ye, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020

2020

[17] [17]

Morgan Kaufmann, 2006

Mark Utting and Bruno Legeard.Practical Model-Based Testing: A Tools Approach. Morgan Kaufmann, 2006

2006

[18] [18]

Textflint: Unified multilingual robustness evaluation toolkit for natural language processing

Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Xuanjing Huang, et al. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstra- tions, pages 347...

2021

[19] [19]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling (COLM), 2024

2024

[20] [20]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[22] [22]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024