pith. sign in

arxiv: 2605.26521 · v1 · pith:T72IRYKTnew · submitted 2026-05-26 · 💻 cs.SE

Testing Agentic Workflows with Structural Coverage Criteria

Pith reviewed 2026-06-29 16:14 UTC · model grok-4.3

classification 💻 cs.SE
keywords multi-agent workflowsstructural coveragetest adequacycoordination graphdelegation pathstool access rulesworkflow testingagentic systems
0
0 comments X

The pith

A typed coordination graph turns multi-agent workflow specifications into structural coverage obligations that tests must exercise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to represent each workflow as a typed coordination graph whose nodes and edges capture agents, allowed tool uses, restricted tool uses, and delegation paths. Coverage obligations are then derived directly from the reachable parts of this graph. Tests are generated to meet those obligations so that success indicates the declared structure has been exercised rather than only that a task completed. A reader would care because end-to-end success scores alone leave open whether tool-access rules, restrictions, or delegation paths were ever used or have regressed. The method therefore supplies an additional, measurable adequacy layer for test suites.

Core claim

The approach extracts a typed coordination graph from a workflow specification, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and produces executable tests whose witnesses demonstrate that those obligations have been met at runtime. Evaluation on ten benchmarks shows that generated scenarios can witness substantial fractions of the allowed-tool and delegation obligations and can elicit restricted-call violations that separate workflows whose restrictions hold from those with concrete misrouting.

What carries the argument

The typed coordination graph, which encodes the declared agents, tool-access rules, restrictions, and delegation paths as nodes and edges and supplies the coverage obligations that generated tests must satisfy.

If this is right

  • Test suites acquire an independent structural adequacy measure that can be checked without reference to task success.
  • Restricted-tool obligations can surface concrete misrouting failures that end-to-end evaluation may miss.
  • Gaps in delegation coverage become detectable before deployment.
  • Workflows can be compared by how many of their declared structural obligations their test suites actually meet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph could be used to prioritize which parts of a workflow need additional semantic tests once structural coverage is reached.
  • Coverage gaps identified early could guide refinement of the workflow specification itself rather than only its tests.
  • The approach might apply to any workflow system that exposes explicit agent, tool, and delegation declarations.

Load-bearing premise

The coordination graph extracted from the specification accurately and completely represents the runtime coordination behavior of the workflow.

What would settle it

A workflow whose generated test suite achieves full graph coverage yet runtime traces show tool calls or delegations that violate the graph's declared edges or leave declared edges unexercised.

Figures

Figures reproduced from arXiv: 2605.26521 by Mojtaba Bagherzadeh, Nafiseh Kahani.

Figure 1
Figure 1. Figure 1: shows the corresponding coordination graph. Rounded nodes denote agents, rectangular nodes denote tools, solid edges denote allowed tool access, and dotted edges denote delegation. For readability, the figure shows only the two allowed-tool edges and the four delegation edges; the four restricted-tool edges are listed in the caption. Listing 1 shows the normalized specification used by the prototype. The l… view at source ↗
Figure 3
Figure 3. Figure 3: shows a real RealizeDelegate run on the oai_message_filter workflow, for the obligation Delegate(assistant_2 → spanish_assistant). Although oai_customer_service remains the paper’s main running example, we use oai_message_filter here because it provides a clearer illustration of multi-attempt runtime-grounded refinement. Early attempts ask for Spanish Objective. Delegate(assistant_2 → spanish_assistant) on… view at source ↗
read the original abstract

Multi-agent systems increasingly expose explicit workflow structure: agents, tools, tool-access rules, restrictions, and delegation paths. Existing evaluations rely largely on end-to-end task success, benchmark scores, final-response quality, or prompt-level checks, which provide limited evidence that this declared coordination structure has actually been exercised. This makes it difficult to assess test-suite adequacy or detect structural regressions in tool access, restrictions, and inter-agent delegation. We address this gap with a structural testing approach for multi-agent workflow specifications. The approach represents each workflow as a typed coordination graph, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and uses coverage-driven generation with DSPy-based scenario realization to produce executable tests. The graph fixes what must be covered; DSPy realizes those obligations as natural-language scenarios whose witnesses are checked at runtime. We implement the approach for OpenAI Agents SDK-style workflows and evaluate it on ten SDK-derived benchmarks comprising 49 reachable agents, 47 tools, and 403 structural obligations. Generated scenarios witness 54/75 allowed-tool obligations and 36/48 delegation obligations within a bounded refinement budget. The adversarial restricted-tool criterion elicits 23/248 restricted-call violations, separating workflows whose restrictions hold under probing from workflows with concrete misrouting failures. These results show that structural coverage provides a useful adequacy layer for multi-agent workflow testing: it does not replace semantic or end-to-end evaluation, but reveals whether declared agents, tool-access rules, restrictions, and delegation paths have been exercised.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a structural testing approach for multi-agent workflows. Workflows are modeled as typed coordination graphs from which coverage obligations are derived for reachable agents, allowed/restricted tool edges, and delegation edges. DSPy-based generation produces executable scenarios to satisfy these obligations, and the method is evaluated on ten OpenAI Agents SDK-derived benchmarks (49 agents, 47 tools, 403 obligations), reporting 54/75 allowed-tool coverage, 36/48 delegation coverage, and 23/248 restricted violations detected.

Significance. If the graph-to-runtime mapping holds, the work supplies a useful complementary adequacy criterion for agentic systems that focuses on exercise of declared coordination structure rather than only end-to-end task success. The separation of obligation derivation from scenario realization and the concrete numerical results on restriction probing are positive features.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the reported figures (54/75 allowed-tool, 36/48 delegation, 23/248 restricted) are given without any description of how obligations were enumerated from the graphs, how runtime witnesses were validated, or the procedure used to select the ten benchmarks independently of the proposed method. These details are load-bearing for interpreting the empirical support for the adequacy claim.
  2. [Approach / Evaluation] Approach and Evaluation sections: the central claim that satisfying graph-derived obligations exercises the actual coordination structure rests on the unvalidated assumption that the extracted typed coordination graph faithfully and completely represents runtime behavior; no dynamic trace comparison or counter-example analysis is supplied to support or bound this mapping.
minor comments (1)
  1. Define 'reachable agents' and the precise counting of structural obligations more explicitly (e.g., in a dedicated subsection or table) to support independent reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight areas where additional clarity would strengthen the presentation of the empirical results and the scope of the central claim. We address each major comment below, indicating planned revisions.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the reported figures (54/75 allowed-tool, 36/48 delegation, 23/248 restricted) are given without any description of how obligations were enumerated from the graphs, how runtime witnesses were validated, or the procedure used to select the ten benchmarks independently of the proposed method. These details are load-bearing for interpreting the empirical support for the adequacy claim.

    Authors: We agree that the abstract and Evaluation section would benefit from explicit descriptions of these procedures. In the revised manuscript we will add a dedicated subsection in Evaluation that (1) details obligation enumeration by graph traversal over reachable agents, allowed/restricted tool edges, and delegation edges; (2) describes the runtime witness validation process, which inspects SDK execution logs for the presence of the required edges and agents; and (3) states that the ten benchmarks were drawn from publicly available OpenAI Agents SDK examples chosen for diversity in agent count and tool usage, without reference to the coverage criteria. These additions will make the reported coverage numbers interpretable. revision: yes

  2. Referee: [Approach / Evaluation] Approach and Evaluation sections: the central claim that satisfying graph-derived obligations exercises the actual coordination structure rests on the unvalidated assumption that the extracted typed coordination graph faithfully and completely represents runtime behavior; no dynamic trace comparison or counter-example analysis is supplied to support or bound this mapping.

    Authors: The coordination graph is extracted directly from the workflow specification (agent declarations, tool-access rules, and delegation paths). The adequacy claim concerns whether the declared structure is exercised, not whether the graph captures every possible runtime behavior. We acknowledge that the manuscript provides no dynamic trace comparison or counter-example analysis to bound the fidelity of the extraction. In revision we will add an explicit limitations paragraph in the Approach section clarifying the scope of the claim and noting that full runtime validation of the graph-to-execution mapping remains future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives coverage obligations directly from the typed coordination graph representation of the workflow specification, with no equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations. The central claims rest on the graph extraction and DSPy-based realization steps, which are defined independently without renaming known results or importing uniqueness theorems from prior author work. The evaluation numbers (e.g., 54/75) are presented as empirical outcomes under the stated assumptions rather than forced by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes the workflow specification is a faithful static model of runtime behavior.

pith-pipeline@v0.9.1-grok · 5803 in / 1078 out tokens · 31402 ms · 2026-06-29T16:14:15.641228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 6 canonical work pages · 6 internal anchors

  1. [1]

    Cambridge University Press, 2nd edition, 2016

    Paul Ammann and Jeff Offutt.Introduction to Software Testing. Cambridge University Press, 2nd edition, 2016

  2. [2]

    Dspy documentation

    DSPy Contributors. Dspy documentation. https://dspy.ai/, 2026. Ac- cessed: 2026-05-02

  3. [3]

    Dspy documentation: Modules

    DSPy Contributors. Dspy documentation: Modules. https://dspy.ai/learn/ programming/modules/, 2026. Accessed: 2026-05-02

  4. [4]

    Dspy documentation: Signatures

    DSPy Contributors. Dspy documentation: Signatures. https://dspy.ai/ learn/programming/signatures/, 2026. Accessed: 2026-05-02

  5. [5]

    Evosuite: Automatic test suite generation for object-oriented software

    Gordon Fraser and Andrea Arcuri. Evosuite: Automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pages 416–419, 2011

  6. [6]

    An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

    Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing.IEEE Transactions on Software Engineering, 37(5):649–678, 2011

  7. [7]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, 2024. 14

  8. [8]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. InInternational Conference on Learning Representations, 2024

  9. [9]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Ming Li et al. API-Bank: A comprehensive benchmark for tool- augmented LLMs.arXiv preprint arXiv:2304.08244, 2023

  10. [10]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yao Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023

  11. [11]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, et al. Agentbench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

  12. [12]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

  13. [13]

    Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing

    Glenford J. Myers, Corey Sandler, and Tom Badgett.The Art of Software Testing. John Wiley & Sons, 3rd edition, 2011

  14. [14]

    Agents sdk

    OpenAI. Agents sdk. https://developers.openai.com/api/docs/guides/ agents, 2026. Accessed: 2026-05-02

  15. [15]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Ruyi Luo, Pan Ye, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  16. [16]

    Beyond accuracy: Behavioral testing of NLP models with CheckList

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, 2020

  17. [17]

    Morgan Kaufmann, 2006

    Mark Utting and Bruno Legeard.Practical Model-Based Testing: A Tools Approach. Morgan Kaufmann, 2006

  18. [18]

    Textflint: Unified multilingual robustness evaluation toolkit for natural language processing

    Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Xuanjing Huang, et al. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstra- tions, pages 347...

  19. [19]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First Conference on Language Modeling (COLM), 2024

  20. [20]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...

  21. [21]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  22. [22]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2024