arxiv: 2604.13346 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang , Jerry Huang , Jiarui Yao , Rui Pan , Peizhi Niu , Yaowenqi Liu , Ruida Wang , Renhao Lu

show 2 more authors

Yuwei Guo Tong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentsworkflow specificationagent control flowmodular workflowsvisual editoragent harnessexplicit state management

0 comments

The pith

AgentSPEX supplies a dedicated language for defining LLM-agent workflows with explicit branching, loops, parallelism, and state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language-model agents often rely on loose instructions that leave control flow and state implicit or on code frameworks that entangle logic with implementation details. AgentSPEX introduces a specification language that declares typed steps, branching, loops, parallel branches, reusable modules, and state variables in plain form. These definitions execute inside a harness that supplies tool access, a sandboxed environment, checkpointing, verification, and logging, while a visual editor shows both textual and graph representations. The paper supplies ready agents for research tasks and reports results from seven benchmarks plus a user study indicating clearer authoring and inspection than prior methods.

Core claim

AgentSPEX is an Agent Specification and Execution Language that lets users write LLM-agent workflows with typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management. These workflows run inside a customizable agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging, accompanied by a visual editor with synchronized graph and workflow views and sample agents for deep and scientific research.

What carries the argument

The AgentSPEX language for explicit workflow specification together with its execution harness, which together separate control flow and state from low-level implementation so that structure becomes inspectable and reusable.

Load-bearing premise

That moving from implicit prompting or tightly coupled code frameworks to an explicit specification language plus harness will produce better practical control, maintainability, and interpretability in agent workflows.

What would settle it

A side-by-side user study or benchmark run in which participants cannot maintain or understand AgentSPEX workflows more readily than alternative approaches, or in which the reported benchmark gains disappear on re-execution.

Figures

Figures reproduced from arXiv: 2604.13346 by Jerry Huang, Jiarui Yao, Peizhi Niu, Pengcheng Wang, Renhao Lu, Ruida Wang, Rui Pan, Tong Zhang, Yaowenqi Liu, Yuwei Guo.

**Figure 2.** Figure 2: An example of an AgentSPEX workflow for topic research and summarization. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visual editor interface for a deep research agent implemented with AgentSPEX, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example of the log viewer for a SWE-Bench Verified instance. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Example extract_single_citation_module YAML plan for formal verification 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Properties of variables inferred from the YAML task plan [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Example extract_single_citation_module YAML plan for formal verification 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Example of formal verification of trajectory [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgentSPEX gives a clean DSL plus visual editor for explicit agent workflows, but the claimed benchmark wins and user-study gains are stated without numbers or methods.

read the letter

The main contribution is a specification language that lets you write agent workflows with typed steps, explicit branches/loops/parallelism, reusable submodules, and state, then run them in a harness that adds sandboxing, checkpointing, logging, and verification. A synchronized visual editor is included so you can switch between graph and text views. This directly targets the implicit control flow in plain prompting and the Python entanglement in LangGraph, DSPy, and CrewAI. Ready-made agents for deep research and scientific tasks are a nice practical addition. The design choices around modularity and explicit state look thoughtful and could reduce the usual maintenance headaches once people learn the syntax. The soft spot is the evaluation. The abstract says the system was tested on seven benchmarks plus a user study and came out ahead on interpretability and accessibility, yet no scores, baselines, task details, or statistical results appear. Without those, it is impossible to tell whether the language itself drove any gains or whether the harness and prompting simply differed. The claim that moving to a dedicated language improves real-world control and maintainability therefore rests on an unshown empirical step. This paper is aimed at people who already build or study multi-step LLM agents and want a more declarative alternative to current frameworks. A reader working on agent reliability or tooling would get concrete design ideas from it. The work deserves peer review because the feature set is concrete and the motivation is sound, even though the results section will need careful referee scrutiny to confirm the practical advantages.

Referee Report

2 major / 0 minor

Summary. The paper introduces AgentSPEX, an explicit specification and execution language for LLM-agent workflows that supports typed steps, branching, loops, parallel execution, reusable submodules, and state management. These workflows run in a customizable harness providing tool access, sandboxing, checkpointing, verification, and logging, accompanied by a visual editor with synchronized graph and text views. The authors supply ready-to-use agents for deep and scientific research and claim that evaluation on seven benchmarks plus a user study demonstrates that AgentSPEX offers a more interpretable and accessible authoring paradigm than popular Python-coupled frameworks such as LangGraph.

Significance. If the benchmark and user-study results hold, the work would provide a concrete, maintainable alternative to reactive prompting and tightly coupled orchestration frameworks by separating declarative workflow specification from execution details. The combination of standard control-flow primitives, modularity, explicit state, and a visual editor could improve debuggability and accessibility for complex agent systems. The ready-to-use research agents add immediate practical value.

major comments (2)

[Evaluation section] Evaluation section: the manuscript states that AgentSPEX was evaluated on seven benchmarks and shows superiority, yet no quantitative results, tables, baseline comparisons, metrics, or statistical analysis are presented. This absence prevents verification of the central empirical claim.
[User study section] User study section: the paper asserts that a user study demonstrates greater interpretability and accessibility than an existing framework, but supplies no details on participant count, tasks, protocol, or outcome measures. This evidence is load-bearing for the accessibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and for identifying the gaps in our empirical sections. We agree that the current manuscript does not contain the quantitative results, tables, or study details needed to support the central claims, and we will make the requested additions in revision.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the manuscript states that AgentSPEX was evaluated on seven benchmarks and shows superiority, yet no quantitative results, tables, baseline comparisons, metrics, or statistical analysis are presented. This absence prevents verification of the central empirical claim.

Authors: We agree that the evaluation section as currently written lacks all quantitative results, tables, baseline comparisons, metrics, and statistical analysis. Although the experiments on the seven benchmarks were performed, these data were omitted from the submitted draft. In the revised manuscript we will add a complete evaluation section that reports per-benchmark scores, direct comparisons against LangGraph and other frameworks, the exact metrics employed, and any statistical tests performed, so that readers can verify the superiority claims. revision: yes
Referee: [User study section] User study section: the paper asserts that a user study demonstrates greater interpretability and accessibility than an existing framework, but supplies no details on participant count, tasks, protocol, or outcome measures. This evidence is load-bearing for the accessibility claim.

Authors: We acknowledge that the user-study section currently provides no information on participant count, tasks, protocol, or outcome measures. In the revision we will expand the section to describe the full study design, the number of participants, the concrete authoring tasks assigned, the experimental protocol, and both quantitative (e.g., task-completion time, error rates) and qualitative outcome measures that support the interpretability and accessibility claims relative to the compared framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely descriptive system design

full rationale

The paper introduces AgentSPEX as a workflow specification language and harness with standard control-flow features (typed steps, branching, loops, parallelism, submodules, state management) plus a visual editor and ready agents. It evaluates the system on 7 benchmarks and a user study for interpretability and accessibility. No equations, derivations, fitted parameters, predictions, or first-principles claims appear anywhere in the provided text or abstract. The contribution is a descriptive engineering artifact whose correctness rests on external empirical evaluation rather than any internal reduction to its own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The argument structure is self-contained against external benchmarks and user feedback.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper's central contribution is the invention of a new specification language and harness rather than a derivation from prior axioms or data fits; no free parameters or mathematical axioms are invoked.

invented entities (1)

AgentSPEX language and harness no independent evidence
purpose: To provide explicit control flow and modular structure for LLM agent workflows
The language itself is the novel artifact introduced by the paper.

pith-pipeline@v0.9.0 · 5548 in / 1258 out tokens · 41631 ms · 2026-05-10T14:50:53.827873+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

21 extracted references · 17 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

URL https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time?arXiv preprint arXiv:2307.09009,

work page arXiv
[2]

URLhttps://arxiv.org/abs/2601.16206. CrewAI. Crewai,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, and Liang He

URLhttps://arxiv.org/abs/2510.10549. Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, and Liang He. A survey on the optimization of large language model-based agents.ACM Computing Surveys, 58(9):1–37, February

work page arXiv
[4]

doi: 10.1145/3789261

ISSN 1557-7341. doi: 10.1145/3789261. URL http: //dx.doi.org/10.1145/3789261. Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts LLM performance despite perfect retrieval. In Christos Christodoulopoulos, Tanmoy Chakraborty, Car...

work page doi:10.1145/3789261 2025
[5]

ISBN 979-8-89176-335-7

Association for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/ v1/2025.findings-emnlp.1264. URL https://aclanthology.org/2025.findings-emnlp. 1264/. Google. Gemini deep research — your personal research assistant,

2025
[6]

ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

URLhttps://arxiv.org/abs/2510.00615. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. InThe Twelfth International...

work page arXiv
[7]

org/abs/2507.08870

URL https://arxiv. org/abs/2507.08870. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292,

work page arXiv
[8]

Nature Chemistry , volume =

ISSN 1755-4349. doi: 10.1038/s41557-025-01815-x. URL https://doi.org/10.1038/s41557-025-01815-x. Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In André Platzer and Geoff Sutcliffe (eds.),Automated Deduction – CADE 28, pp. 625–635, Cham,

work page doi:10.1038/s41557-025-01815-x
[9]

MemGPT: Towards LLMs as Operating Systems

URL https: //arxiv.org/abs/2310.08560. Lawrence C Paulson.Isabelle: A generic theorem prover. Springer,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URL https://arxiv.org/ abs/2205.01833. Ron F. Del Rosario, Klaudia Krawiecka, and Christian Schroeder de Witt. Architecting resilient llm agents: A guide to secure plan-then-execute implementations,

work page arXiv
[11]

Architecting resilient llm agents: A guide to secure plan-then-execute implementations,

URL https://arxiv.org/abs/2509.08646. Mandana Vaziri, Louis Mandel, Claudio Spiess, and Martin Hirzel. Pdl: A declarative prompt programming language,

work page arXiv
[12]

2024.PDL: A Declarative Prompt Programming Language

URLhttps://arxiv.org/abs/2410.19135. Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models, 2024a. URL https://arxiv.org/abs/2307.10635. Xingyao Wang, Boxuan Chen, Ziniu Li, Yu...

work page arXiv
[13]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

URLhttps://arxiv.org/abs/2308.08155. Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Wang Zi- jia, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang. Writingbench: A comprehen- sive benchmark for generative writing. InThe Thirty-ninth Annual Conference on Neu- ral Information Processing Systems Datasets and Benchmarks T rack,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

CoRR , volume =

URL https: //arxiv.org/abs/2511.13646. John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable auto- mated software engineering. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang (eds.),Advances in Neural Information Proc...

work page arXiv
[15]

URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf

doi: 10.52202/ 079017-1601. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations,

2024
[16]

org/abs/2602.23668

URL https://arxiv. org/abs/2602.23668. 12 Preprint. Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, and Jiaxuan You. TinyScientist: An interactive, extensible, and con- trollable framework for building research agents. In Ivan Habernal, Peter Schulam, and Jörg Tiedemann (eds.),Proceedings of the 2025 Co...

work page arXiv 2025
[17]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Association for Computational Linguistics. ISBN 979-8-89176-334-0. doi: 10.18653/v1/ 2025.emnlp-demos.41. URLhttps://aclanthology.org/2025.emnlp-demos.41/. Sirui Zeng and Xifeng Yan. Adl: A declarative language for agent-based chatbots,

work page doi:10.18653/v1/ 2025
[18]

A Evaluation Details We evaluate on seven diverse benchmarks spanning five domains, summarized in Table

URLhttps://arxiv.org/abs/2504.14787. A Evaluation Details We evaluate on seven diverse benchmarks spanning five domains, summarized in Table

work page arXiv
[19]

500 sb-cli Mathematics AIME 2025 (Art of Problem Solving,

2025
[20]

A.1 SWE-Bench Verified Chen et al

403 Exact Match Table 4: Summary of evaluation benchmarks. A.1 SWE-Bench Verified Chen et al. (2023) documented significant variation in LLM outputs over time, motivating the need for rapid agent iteration. Therefore, an important but often overlooked dimension of agent framework evaluation is model-version robustness: whether an agent system maintains or...

2023
[21]

e x t r a c t _ s i n g l e _ c i t a t i o n _ m o d u l e

Survey participants generally all had prior programming experience, but had varied levels of experience building agents. ID Type Question Background Background How much agent development experience do you have? Q0 Comprehension What task are the agent declarations doing? Q1 Preference Which implementation is easier to read and understand? Q2 Preference Wh...

work page arXiv