Recognition: unknown
AutomationBench
Pith reviewed 2026-05-10 02:42 UTC · model grok-4.3
The pith
AutomationBench shows that frontier AI models score below 10% on realistic cross-application workflow tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutomationBench is a benchmark for AI agents performing cross-application workflow orchestration via REST APIs. Tasks are based on real patterns from Zapier's platform across Sales, Marketing, Operations, Support, Finance, and HR. Agents must autonomously find relevant endpoints, adhere to layered business rules, and operate in environments with irrelevant or misleading records. Evaluation uses programmatic end-state grading that checks whether correct data reached the right systems. Frontier models currently achieve scores below 10 percent.
What carries the argument
AutomationBench, a benchmark with tasks requiring autonomous REST API discovery, policy adherence, and end-state verification in multi-application environments.
Load-bearing premise
Tasks based on Zapier patterns combined with end-state grading are sufficient to capture the full complexity and edge cases of business automation.
What would settle it
A frontier model achieving over 50% on AutomationBench tasks but still failing to complete similar workflows in live production environments without human assistance.
Figures
read the original abstract
Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce AutomationBench, a benchmark for AI agents on cross-application REST API workflow orchestration based on Zapier patterns. It requires agents to handle coordination across apps, discover APIs autonomously, and adhere to policies in realistic environments with irrelevant records. The key empirical result is that even the best current frontier models achieve success rates below 10% on these tasks.
Significance. If the benchmark tasks and grading accurately represent the demands of business automation, this work provides a much-needed, challenging evaluation tool that highlights deficiencies in current AI agents for practical applications. The focus on end-state correctness and real-world inspired tasks strengthens its potential utility for the field.
major comments (3)
- The statement that 'even the best frontier models currently score below 10%' is presented without any details on the models tested, the number or distribution of tasks, or the specific success metrics used, which is load-bearing for the central claim about the gap in agentic capabilities.
- The description of how tasks are drawn from Zapier workflow patterns lacks specifics on task selection criteria, the introduction of irrelevant/misleading records, and the exact nature of the layered business rules, making it unclear if the benchmark truly tests the claimed complexities or introduces artifacts that affect model performance.
- The programmatic end-state grading is described, but without addressing how the benchmark handles or excludes real-world complications like transient API failures or evolving policies, it is difficult to determine if the low scores indicate model shortcomings or limitations in the evaluation environment.
minor comments (2)
- Consider adding a sentence on the total number of tasks or domains to give readers a sense of scale.
- The related work section could more explicitly contrast AutomationBench with prior benchmarks like those for web navigation or API calling to highlight the unique combination of requirements.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper introducing AutomationBench. We provide point-by-point responses below and will revise the manuscript accordingly to address the concerns raised.
read point-by-point responses
-
Referee: The statement that 'even the best frontier models currently score below 10%' is presented without any details on the models tested, the number or distribution of tasks, or the specific success metrics used, which is load-bearing for the central claim about the gap in agentic capabilities.
Authors: We agree that the main claim would benefit from more immediate context. While detailed results are in Section 4 (including models: GPT-4o, Claude 3.5 Sonnet, Llama-3.1-405B; 150 tasks with domain distribution: 25 each in Sales, Marketing, etc.; success metric: programmatic check for correct final state in all involved systems), we will add a concise description of the evaluation setup in the abstract and a new paragraph in the introduction summarizing the experimental findings. This will make the central claim clearer without altering the reported numbers. revision: yes
-
Referee: The description of how tasks are drawn from Zapier workflow patterns lacks specifics on task selection criteria, the introduction of irrelevant/misleading records, and the exact nature of the layered business rules, making it unclear if the benchmark truly tests the claimed complexities or introduces artifacts that affect model performance.
Authors: We appreciate this feedback on clarity. Section 3.1 describes the curation process: tasks were selected from over 500 Zapier templates by criteria of involving at least three distinct applications and requiring data transformation. Irrelevant records are introduced by seeding each application's database with 30-50% non-target entries that could be returned by broad queries. Layered rules are defined in per-task policy documents that include conditional logic (e.g., 'if amount > 1000 then require approval'). We will add a dedicated subsection with pseudocode for task generation and examples of misleading records to the revised manuscript. revision: yes
-
Referee: The programmatic end-state grading is described, but without addressing how the benchmark handles or excludes real-world complications like transient API failures or evolving policies, it is difficult to determine if the low scores indicate model shortcomings or limitations in the evaluation environment.
Authors: The design choice to use end-state grading in a stable environment is intentional to focus on the agent's ability to orchestrate workflows correctly rather than handle infrastructure issues. Transient failures are excluded by design, as the benchmark provides deterministic API responses; this is stated in Section 2.3. Evolving policies are not included because each task has a fixed policy document. We will expand Section 5 (Limitations) to explicitly discuss these scope decisions and why they were made, including references to how other benchmarks handle similar issues. revision: partial
Circularity Check
No circularity: empirical benchmark with external grounding
full rationale
The paper introduces AutomationBench as a new empirical benchmark drawn from real Zapier workflow patterns across domains, with tasks requiring endpoint discovery, policy adherence, and end-state programmatic grading. No derivations, equations, fitted parameters, predictions, or self-citations appear in the provided text or abstract. Performance results (e.g., frontier models scoring below 10%) are presented as direct measurements on the benchmark rather than quantities derived from or equivalent to its own inputs by construction. The central claims rest on external platform data and independent evaluation, making the work self-contained against the circularity criteria.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, G. Neubig.WebArena: A Realistic Web Environment for Building Autonomous Agents.ICLR, 2024
2024
-
[2]
X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, Y. Su. Mind2Web: Towards a Generalist Agent for the Web.NeurIPS, 2023
2023
-
[3]
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, T. Yu.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.NeurIPS, 2024
2024
-
[4]
Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, M. Sun.ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs.ICLR, 2024
2024
-
[5]
M. Li, Y. Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, Y. Li.API- Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.EMNLP, 2023
2023
-
[6]
Trivedi, T
H. Trivedi, T. Khot, M. Hartmann, R. Manku, V. Dong, E. Li, S. Gupta, A. Sabharwal, N. Balasubramanian.AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.ACL, 2024. 9
2024
-
[7]
S. Yao, N. Shinn, P. Razavi, K. Narasimhan.τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
V. Barres, H. Dong, S. Ray, X. Si, K. Narasimhan.τ 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review arXiv 2025
- [9]
-
[10]
AutomationBench Leaderboard.https://zapier.com/benchmarks
-
[11]
AutomationBench Public Repository.https://github.com/zapier/AutomationBench
-
[12]
Quarterly prod- uct review with partners
Prime Intellect Environment.https://app.primeintellect.ai/dashboard/ environments/zapier/AutomationBench 11 Appendix: Task Examples Example task data is simplified and includes additional notes for illustration purposes. Task 1·Sales — Meeting Conflict Resolution Prompt There’s a scheduling conflict on February 20, 2026 at 2:00 PM — a Zoom meeting and a G...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.