arxiv: 2604.18934 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

AutomationBench

Daniel Shepard , Robin Salimans

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsautomation benchmarkREST APIworkflow orchestrationcross-applicationpolicy adherenceZapierbusiness automation

0 comments

The pith

AutomationBench shows that frontier AI models score below 10% on realistic cross-application workflow tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutomationBench to evaluate AI agents on business automation tasks that span multiple applications. These tasks require agents to discover API endpoints on their own, follow business policies, and correctly update data across systems like CRM and email. Existing benchmarks miss these elements, so this one draws from actual Zapier workflow patterns across several domains. Current top models perform poorly on it, indicating a significant shortfall in agent capabilities needed for real-world use.

Core claim

AutomationBench is a benchmark for AI agents performing cross-application workflow orchestration via REST APIs. Tasks are based on real patterns from Zapier's platform across Sales, Marketing, Operations, Support, Finance, and HR. Agents must autonomously find relevant endpoints, adhere to layered business rules, and operate in environments with irrelevant or misleading records. Evaluation uses programmatic end-state grading that checks whether correct data reached the right systems. Frontier models currently achieve scores below 10 percent.

What carries the argument

AutomationBench, a benchmark with tasks requiring autonomous REST API discovery, policy adherence, and end-state verification in multi-application environments.

Load-bearing premise

Tasks based on Zapier patterns combined with end-state grading are sufficient to capture the full complexity and edge cases of business automation.

What would settle it

A frontier model achieving over 50% on AutomationBench tasks but still failing to complete similar workflows in live production environments without human assistance.

Figures

Figures reproduced from arXiv: 2604.18934 by Daniel Shepard, Robin Salimans.

**Figure 2.** Figure 2: Model Results with Reasoning Effort 7 Results State-of-the-art models all score below 10%. Opus 4.7 tops the leaderboard at 9.9%. See our benchmark leaderboard page [10] for up to date results. Models are solving fairly different sets of tasks from each other. Gemini and Opus are the top scoring models. Their passing sets have a Jaccard similarity of 0.17. 71% of the tasks solved by Opus are not solved by … view at source ↗

read the original abstract

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutomationBench introduces a useful new benchmark for multi-app agent tasks but the abstract leaves too many methodological gaps to fully trust the reported performance numbers.

read the letter

The paper's core contribution is AutomationBench, which pulls real workflow patterns from Zapier to create tasks that force agents to coordinate across apps, discover APIs on their own, and stick to policy rules while dealing with irrelevant records. It uses end-state programmatic grading rather than step-by-step checks. Frontier models come in below 10 percent, which lines up with the intuition that current agents still struggle with the messy coordination businesses actually need.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce AutomationBench, a benchmark for AI agents on cross-application REST API workflow orchestration based on Zapier patterns. It requires agents to handle coordination across apps, discover APIs autonomously, and adhere to policies in realistic environments with irrelevant records. The key empirical result is that even the best current frontier models achieve success rates below 10% on these tasks.

Significance. If the benchmark tasks and grading accurately represent the demands of business automation, this work provides a much-needed, challenging evaluation tool that highlights deficiencies in current AI agents for practical applications. The focus on end-state correctness and real-world inspired tasks strengthens its potential utility for the field.

major comments (3)

The statement that 'even the best frontier models currently score below 10%' is presented without any details on the models tested, the number or distribution of tasks, or the specific success metrics used, which is load-bearing for the central claim about the gap in agentic capabilities.
The description of how tasks are drawn from Zapier workflow patterns lacks specifics on task selection criteria, the introduction of irrelevant/misleading records, and the exact nature of the layered business rules, making it unclear if the benchmark truly tests the claimed complexities or introduces artifacts that affect model performance.
The programmatic end-state grading is described, but without addressing how the benchmark handles or excludes real-world complications like transient API failures or evolving policies, it is difficult to determine if the low scores indicate model shortcomings or limitations in the evaluation environment.

minor comments (2)

Consider adding a sentence on the total number of tasks or domains to give readers a sense of scale.
The related work section could more explicitly contrast AutomationBench with prior benchmarks like those for web navigation or API calling to highlight the unique combination of requirements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our paper introducing AutomationBench. We provide point-by-point responses below and will revise the manuscript accordingly to address the concerns raised.

read point-by-point responses

Referee: The statement that 'even the best frontier models currently score below 10%' is presented without any details on the models tested, the number or distribution of tasks, or the specific success metrics used, which is load-bearing for the central claim about the gap in agentic capabilities.

Authors: We agree that the main claim would benefit from more immediate context. While detailed results are in Section 4 (including models: GPT-4o, Claude 3.5 Sonnet, Llama-3.1-405B; 150 tasks with domain distribution: 25 each in Sales, Marketing, etc.; success metric: programmatic check for correct final state in all involved systems), we will add a concise description of the evaluation setup in the abstract and a new paragraph in the introduction summarizing the experimental findings. This will make the central claim clearer without altering the reported numbers. revision: yes
Referee: The description of how tasks are drawn from Zapier workflow patterns lacks specifics on task selection criteria, the introduction of irrelevant/misleading records, and the exact nature of the layered business rules, making it unclear if the benchmark truly tests the claimed complexities or introduces artifacts that affect model performance.

Authors: We appreciate this feedback on clarity. Section 3.1 describes the curation process: tasks were selected from over 500 Zapier templates by criteria of involving at least three distinct applications and requiring data transformation. Irrelevant records are introduced by seeding each application's database with 30-50% non-target entries that could be returned by broad queries. Layered rules are defined in per-task policy documents that include conditional logic (e.g., 'if amount > 1000 then require approval'). We will add a dedicated subsection with pseudocode for task generation and examples of misleading records to the revised manuscript. revision: yes
Referee: The programmatic end-state grading is described, but without addressing how the benchmark handles or excludes real-world complications like transient API failures or evolving policies, it is difficult to determine if the low scores indicate model shortcomings or limitations in the evaluation environment.

Authors: The design choice to use end-state grading in a stable environment is intentional to focus on the agent's ability to orchestrate workflows correctly rather than handle infrastructure issues. Transient failures are excluded by design, as the benchmark provides deterministic API responses; this is stated in Section 2.3. Evolving policies are not included because each task has a fixed policy document. We will expand Section 5 (Limitations) to explicitly discuss these scope decisions and why they were made, including references to how other benchmarks handle similar issues. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external grounding

full rationale

The paper introduces AutomationBench as a new empirical benchmark drawn from real Zapier workflow patterns across domains, with tasks requiring endpoint discovery, policy adherence, and end-state programmatic grading. No derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text or abstract. Performance results (e.g., frontier models scoring below 10%) are presented as direct measurements on the benchmark rather than quantities derived from or equivalent to its own inputs by construction. The central claims rest on external platform data and independent evaluation, making the work self-contained against the circularity criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper with no mathematical derivations, fitted parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5447 in / 1002 out tokens · 39327 ms · 2026-05-10T02:42:50.353733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 2 internal anchors

[1]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, G. Neubig.WebArena: A Realistic Web Environment for Building Autonomous Agents.ICLR, 2024

2024
[2]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, Y. Su. Mind2Web: Towards a Generalist Agent for the Web.NeurIPS, 2023

2023
[3]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, T. Yu.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.NeurIPS, 2024

2024
[4]

Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, M. Sun.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs.ICLR, 2024

2024
[5]

M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, Y. Li.API- Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.EMNLP, 2023

2023
[6]

Trivedi, T

H. Trivedi, T. Khot, M. Hartmann, R. Manku, V. Dong, E. Li, S. Gupta, A. Sabharwal, N. Balasubramanian.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.ACL, 2024. 9

2024
[7]

S. Yao, N. Shinn, P. Razavi, K. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review arXiv 2024
[8]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

V. Barres, H. Dong, S. Ray, X. Si, K. Narasimhan.τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review arXiv 2025
[9]

Q. Shi, A. Zytek, P. Razavi, K. Narasimhan, V. Barres.τ-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge.arXiv preprint arXiv:2603.04370, 2026

work page arXiv 2026
[10]

AutomationBench Leaderboard.https://zapier.com/benchmarks
[11]

AutomationBench Public Repository.https://github.com/zapier/AutomationBench
[12]

Quarterly prod- uct review with partners

Prime Intellect Environment.https://app.primeintellect.ai/dashboard/ environments/zapier/AutomationBench 11 Appendix: Task Examples Example task data is simplified and includes additional notes for illustration purposes. Task 1·Sales — Meeting Conflict Resolution Prompt There’s a scheduling conflict on February 20, 2026 at 2:00 PM — a Zoom meeting and a G...

2026