pith. machine review for the scientific record. sign in

arxiv: 2604.15715 · v1 · submitted 2026-04-17 · 💻 cs.CL · cs.AI

Recognition: unknown

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords tool agentsAI benchmarksworkflow evaluationatomic tool useopen-ended tasksagent frameworkscapability assessment
0
0 comments X

The pith

Frontier models drop from under 50% success on atomic tool tasks to just 14% on open-ended workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new benchmark to test AI agents on realistic tool use, from quick single actions to full multi-step productivity tasks drawn from real user needs. It demonstrates a steep performance cliff where even leading models that handle basic operations falter when tasks require sustained coordination and end-to-end delivery. A checkpoint system breaks complex goals into smaller verifiable parts so that success can be scored consistently for both the underlying model and the surrounding execution setup. If accurate, this shows current agents remain limited for autonomous work and that framework design contributes as much as raw model power.

Core claim

GTA-2 establishes that models score below 50 percent on short atomic tool-use tests yet reach only 14.39 percent success on long-horizon open-ended workflows. The recursive checkpoint mechanism decomposes deliverables into checkable sub-goals, allowing direct comparison of model ability and agent harness performance. Experiments further show that checkpoint feedback during runs lifts results and that dedicated frameworks such as Manus and OpenClaw deliver larger gains than model upgrades alone.

What carries the argument

The recursive checkpoint-based evaluation mechanism that decomposes open-ended objectives into verifiable sub-goals for objective measurement of both models and execution frameworks.

If this is right

  • Checkpoint-guided feedback during execution measurably improves workflow completion rates.
  • Specialized agent frameworks boost performance on complex tasks beyond what model capacity alone provides.
  • The gap between atomic and workflow results underscores the need for stronger long-horizon planning in agent systems.
  • Real user queries and live tools expose coordination failures that synthetic benchmarks miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training focused only on isolated tool calls may leave agents unprepared for sequential real-world demands.
  • Execution harness design emerges as a high-leverage research direction separate from scaling foundation models.
  • The benchmark implies current systems function best as supervised assistants rather than fully autonomous workers.

Load-bearing premise

The checkpoint decomposition fully captures what matters for real-world task quality without missing key aspects or adding evaluator bias.

What would settle it

Independent human review of the same workflow outputs that produces success rates substantially different from the checkpoint scores.

Figures

Figures reproduced from arXiv: 2604.15715 by Cailian Chen, Dacheng Tao, Jize Wang, Songyang Zhang, Xinping Guan, Xinyi Le, Xuanxuan Liu, Yijun Wang, Yining Li, Zifei Shan.

Figure 1
Figure 1. Figure 1: The hierarchical framework of GTA-2. GTA-Atomic (Top) evaluates foundational tool-use precision through short￾horizon, closed-ended tasks. GTA-Workflow (Bottom) introduces long-horizon, open-ended productivity tasks. The benchmark utilizes real user queries, real deployed tools, and multimodal context inputs. For open-ended workflows, a recursive checkpoint￾based mechanism is proposed for verifiable evalua… view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of GTA-Atomic and GTA-Workflow. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task difficulty analysis of GTA-Workflow w.r.t. de￾liverable types. Struct. Data is short for Structured Data, which contains CSV, XLSX, and JSON. Multimedia includes the modalities of image, audio, and video. the Medium level but recovers at higher complexity, reaching around 24%. This suggests some frontier models may exhibit stronger robustness in long-horizon tasks, although performance across models c… view at source ↗
Figure 6
Figure 6. Figure 6: Model performance breakdown across 6 real-world categories in GTA-Workflow. Each subplot displays the average [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Model efficiency comparison in GTA-Workflow tasks. 6.3 Tool Execution and Error Analysis 6.3.1 Tool Stability vs. Task Completion Interestingly, a high Tool SR does not guarantee task success. For example, Kimi-K2 achieves a high Tool SR of 89.85%, yet its final Root SR is only 8.33%. This confirms the necessity of our deliverable-centric evaluation: traditional metrics that only monitor whether a tool was… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity of Root SR to different thresholds. Gemini-2.5-Flash produces consistently higher absolute scores, the relative ranking remains identical. This is further confirmed by perfect rank correlation (Spearman ρ = 1.0, Kendall τ = 1.0). These results indicate that our evaluation framework is robust to the choice of LLM judge: while different judges may exhibit calibration differences, they generally y… view at source ↗
Figure 9
Figure 9. Figure 9: The ReAct-style prompt template for the agent system. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt of raw query and checkpoint generation in GTA-Workflow. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Query classification prompt in GTA-Workflow task construction. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Refinement prompt in GTA-Workflow task construction (1 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Refinement prompt in GTA-Workflow task construction (2/ [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Augmentation prompt in GTA-Workflow task construction (1 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Augmentation prompt in GTA-Workflow task construction (2/ [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Task rewriting prompt of GTA-Workflow. General Goal: Your task is to generate the final JSON-formatted task object with a complete hierarchy of result-oriented checkpoints, following exactly the schema below. You will be given: • The rewritten user query (from Step 2) • The original JSON structure (fields must be preserved) Your job: 1. Insert the rewritten query into the JSON under the correct field. 2. … view at source ↗
Figure 17
Figure 17. Figure 17: Checkpoint regeneration prompt of GTA-Workflow. [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Human quality control guidelines for GTA-Workflow tasks. [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: The prompt for LLM judge in GTA-Workflow. [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: The classification prompt used for three-level failure decomposition. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: An example of objective query in GTA-Atomic. The final answer is a uniquely determined number or phrase. [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: An example of subjective query in GTA-Atomic. The final answer is usually some descriptive text. It is not unique, but [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: An example of image generation query in GTA-Atomic. The final answer is none since we do not evaluate the generated [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Task examples of GTA-Workflow from different categories. For presentation purpose, both the queries and checkpoints are condensed versions [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
read the original abstract

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GTA-2, a hierarchical benchmark for general tool agents spanning GTA-Atomic (short-horizon, closed-ended tool-use precision inherited from prior work) and GTA-Workflow (long-horizon, open-ended tasks using real user queries, deployed tools, and multimodal contexts). It proposes a recursive checkpoint-based evaluation that decomposes objectives into verifiable sub-goals for unified assessment of models and execution frameworks. Experiments report frontier models below 50% success on atomic tasks and only 14.39% on workflows, with checkpoint-guided feedback and frameworks such as Manus and OpenClaw improving results, highlighting the role of execution harnesses.

Significance. If the evaluation methodology proves reliable, the work is significant for exposing a capability gap between atomic tool use and realistic productivity workflows, providing empirical guidance for agent and framework development beyond model scaling. The open release of dataset and code supports reproducibility and further research in tool-agent benchmarks.

major comments (3)
  1. [§3.2] §3.2 (GTA-Workflow and recursive checkpoint evaluation): The central claim of a 'pronounced capability cliff' (14.39% workflow success) depends on the checkpoint mechanism accurately measuring open-ended deliverables, but the manuscript provides no inter-annotator agreement scores, details on checkpoint creation process, or validation against missed task aspects, leaving the weakest assumption unaddressed.
  2. [§2.1] §2.1 (task and tool construction): Real-world authenticity is asserted via real user queries and deployed tools, yet no quantitative metrics are given for tool authenticity validation, query representativeness, or coverage of real-world scenarios; this directly affects the generalizability of the reported performance gaps.
  3. [Table 2] Table 2 and §4.3 (framework comparisons): Improvements from Manus and OpenClaw are presented as evidence for harness importance, but without ablations isolating checkpoint feedback effects from framework design or controls for evaluator bias in the recursive process, the attribution to execution harnesses beyond model capacity remains under-supported.
minor comments (2)
  1. [Figure 1] Figure 1: The workflow diagram would benefit from clearer labeling of the recursive checkpoint loop and how it interfaces with the execution harness.
  2. [§4.1] §4.1: Specific success rates for individual frontier models on GTA-Atomic (beyond the 'below 50%' aggregate) should be reported in the main text or a dedicated table for direct comparison with workflow results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that real user queries and deployed tools constitute authentic evaluation contexts, plus standard assumptions about checkpoint verifiability; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Real user queries and deployed tools provide authentic contexts for evaluating agent performance
    Invoked to justify the benchmark's design as addressing misalignment with real-world requirements.

pith-pipeline@v0.9.0 · 5605 in / 1249 out tokens · 28979 ms · 2026-05-10T08:37:24.548370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    Langchain, 2022

    Harrison Chase. Langchain, 2022. URL https://github.com/ langchain-ai/langchain

  2. [2]

    Autogpt, 2023

    Significant Gravitas. Autogpt, 2023. URL https://github. com/Significant-Gravitas/AutoGPT

  3. [3]

    Claude code docs, 2025

    Anthropic. Claude code docs, 2025. URL https://code. claude.com/docs/en/overview

  4. [4]

    Lightrag: Simple and fast retrieval-augmented generation

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation. InFind- ings of the Association for Computational Linguistics: EMNLP, pages 10746–10761, 2025

  5. [5]

    Adaptive hinge balance loss for document-level relation extraction

    Jize Wang, Xinyi Le, Xiaodi Peng, and Cailian Chen. Adaptive hinge balance loss for document-level relation extraction. InFind- ings of the Association for Computational Linguistics: EMNLP, pages 3872–3878, 2023

  6. [6]

    Demystifying long chain-of-thought reasoning in llms

    Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. InICLR Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2025

  7. [7]

    monthly performance,

    Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv:2508.12752, 2025

  8. [8]

    From automation to autonomy: A survey on large language models in scientific discovery

    Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InEMNLP, pages 17744–17761, 2025

  9. [9]

    Toolllm: Facilitating large language models to master 16000+real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+real-world apis. InICLR, 2024

  10. [10]

    Gorilla: Large language model connected with massive apis

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. In NeurIPS, pages 126544–126565, 2024

  11. [11]

    Gta: a benchmark for general tool agents

    Jize Wang, Ma Zerun, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents. InNeurIPS, pages 75749–75790, 2024

  12. [12]

    Swe-bench: Can language models resolve real-world github issues? InICLR, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? InICLR, 2024. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows15

  13. [13]

    A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. InEMNLP, pages 3102–3116, 2023

  14. [14]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InICLR, 2024

  15. [15]

    m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

    Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. InSynthetic Data for Computer Vision Workshop @ CVPR, 2024

  16. [16]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann Le- Cun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InICLR, 2024

  17. [17]

    Gaia2: Benchmarking LLM agents on dynamic and asynchronous environments

    Romain Froger, Amine Benhalloum, Andrey Rusakov, Dheeraj Mekala, Emilien Garreau, Gerard Moreno-Torres Bertran, Gré- goire Mialon, Hugo Laurençon, Jean-Baptiste Gaya, Kunal Malkan, Mathieu Rita, Matteo Bettini, Maxime Lecanu, Mengjue Wang, Pierre Andrews, Pierre Menard, Thomas Scialom, Ulyana Piterbarg, Virginie Do, Amar Budhiraja, Ian Yu, Mikhail Plekhan...

  18. [18]

    Odysseybench: Evaluating llm agents on long-horizon complex office application workflows

    Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows. arXiv:2508.09124, 2025

  19. [19]

    Zhang, S

    Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv:2601.18137, 2026

  20. [20]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasu- pat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

  21. [21]

    Tool learning with large language models: A survey.Frontiers of Computer Science, page 198343, 2025

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey.Frontiers of Computer Science, page 198343, 2025

  22. [22]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, pages 68539–68551, 2023

  23. [23]

    React: Synergizing rea- soning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InICLR, 2022

  24. [24]

    Large language models as tool makers

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. InICLR, 2024

  25. [25]

    Enhancing decision-making for llm agents via step-level q-value models

    Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, and Huaimin Wang. Enhancing decision-making for llm agents via step-level q-value models. InAAAI, pages 27161–27169, 2025

  26. [26]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, pages 48371– 48392, 2025

  27. [27]

    Metatool benchmark for large language models: Deciding whether to use tools and which to use

    Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. InICLR, 2024

  28. [28]

    Aios: Llm agent operating system

    Kai Mei, Xi Zhu, Wujiang Xu, Mingyu Jin, Wenyue Hua, Ze- long Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system. InCOLM, 2025

  29. [29]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems.arXiv:2310.08560, 2023

  30. [30]

    Openclaw — personal ai assistant, 2026

    OpenClaw Team. Openclaw — personal ai assistant, 2026. URL https://github.com/openclaw/openclaw

  31. [31]

    What minimax agent can do, 2026

    MiniMax Team. What minimax agent can do, 2026. URL https: //agent.minimax.io/docs/user-guide

  32. [32]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022

  33. [33]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023

  34. [34]

    Routemoa: Dynamic routing without pre-inference boosts efficient mixture-of-agents.arXiv:2601.18130, 2026

    Jize Wang, Han Wu, Zhiyuan You, Yiming Song, Yijun Wang, Zifei Shan, Yining Li, Songyang Zhang, Xinyi Le, Cailian Chen, et al. Routemoa: Dynamic routing without pre-inference boosts efficient mixture-of-agents.arXiv:2601.18130, 2026

  35. [35]

    V oyager: An open-ended embodied agent with large language models.TMLR, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.TMLR, 2024. ISSN 2835-8856

  36. [36]

    Kortix – build, manage and train ai agents., 2025

    Kortix. Kortix – build, manage and train ai agents., 2025. URL https://github.com/kortix-ai/suna

  37. [37]

    From mind to machine: The rise of manus ai as a fully autonomous digital agent,

    Minjie Shen, Yanshu Li, Lulu Chen, and Qikai Yang. From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv:2505.02024, 2025

  38. [38]

    Travelplanner: A benchmark for real-world planning with language agents

    Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. In ICML, 2024

  39. [39]

    Per- sonal travel solver: A preference-driven llm-solver system for travel planning

    Zijian Shao, Jiancan Wu, Weijian Chen, and Xiang Wang. Per- sonal travel solver: A preference-driven llm-solver system for travel planning. InACL, pages 27622–27642, 2025

  40. [40]

    Agent-as-a-judge: Evaluate agents with agents

    Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InICML, 2025

  41. [41]

    Theagentcompany: Benchmarking llm agents on consequential real world tasks

    Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: Benchmarking llm agents on consequential real world tasks. InNeurIPS, 2025

  42. [42]

    Webarena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InICLR, 2024

  43. [43]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeurIPS, pages 52040–52094, 2024

  44. [44]

    OS-Marathon: Benchmarking computer-use agents on long-horizon repetitive tasks.arXiv preprint arXiv:2601.20650, 2026

    Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, and Vibhav Vineet. Os-marathon: Bench- marking computer-use agents on long-horizon repetitive tasks. arXiv:2601.20650, 2026. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows16

  45. [45]

    The dawn of lmms: Preliminary explorations with gpt-4v (ision)

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung- Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv:2309.17421, 2023

  46. [46]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023

  47. [47]

    Gpt-5 system card, 2025

    OpenAI. Gpt-5 system card, 2025. URL https://cdn.openai. com/gpt-5-system-card.pdf

  48. [48]

    System card: Claude sonnet 4.5,

    Anthropic. System card: Claude sonnet 4.5,

  49. [49]

    URL https://www-cdn.anthropic.com/ 963373e433e489a87a10c823c52a0a013e9172dd.pdf

  50. [50]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv:2511.21631, 2025

  51. [51]

    Navigating the digital world as humans do: Universal visual grounding for gui agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InICLR, 2025

  52. [52]

    Sail: sample- centric in-context learning for document information extraction

    Jinyu Zhang, Zhiyuan You, Jize Wang, and Xinyi Le. Sail: sample- centric in-context learning for document information extraction. InAAAI, pages 25868–25876, 2025

  53. [53]

    Doca- gent: An agentic framework for multi-modal long-context docu- ment understanding

    Li Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu You. Doca- gent: An agentic framework for multi-modal long-context docu- ment understanding. InEMNLP, pages 17712–17727, 2025

  54. [54]

    Flowith neo: Reinventing ai work beyond chatbots, 2026

    Flowith. Flowith neo: Reinventing ai work beyond chatbots, 2026. URLhttps://flowith.io/blog/meet-agent-neo/

  55. [55]

    Framework for orchestrating role-playing, autonomous ai agents, 2025

    CrewAI. Framework for orchestrating role-playing, autonomous ai agents, 2025. URL https://github.com/crewaiinc/ crewai

  56. [56]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Al- tenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774, 2023

  57. [57]

    The claude 3 model family: Opus, sonnet, haiku

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024

  58. [58]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv:2310.06825, 2023

  59. [59]

    Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog, 2024

    Meta AI. Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog, 2024

  60. [60]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv:2401.04088, 2024

  61. [61]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv:2507.20534, 2025

  62. [62]

    Grok 4, 2025

    xAI. Grok 4, 2025. URLhttps://x.ai/news/grok-4

  63. [63]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv:2512.02556, 2025

  64. [64]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025

    Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/ blog/llama-4-multimodal-intelligence/

  65. [65]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025

  66. [66]

    Opencompass: A universal eval- uation platform for foundation models, 2023

    OpenCompass Contributors. Opencompass: A universal eval- uation platform for foundation models, 2023. URL https: //github.com/open-compass/opencompass

  67. [67]

    Lagent: InternLM a lightweight open- source framework that allows users to efficiently build large lan- guage model(llm)-based agents, 2023

    Lagent Developer Team. Lagent: InternLM a lightweight open- source framework that allows users to efficiently build large lan- guage model(llm)-based agents, 2023. URL https://github. com/InternLM/lagent. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows17 AdditionalGTA-2 Information .1 Tool Definition The detailed de...

  68. [68]

    The generated data is intended to evaluate the degree to which a large language model completes the task

    Based on the content of the query, generate result-oriented checkpoints organized in a tree-structured logical hierarchy with nested subtasks. The generated data is intended to evaluate the degree to which a large language model completes the task

  69. [69]

    The tools field must contain the tools required for solving the task, and tools can only be selected from the provided list

    Follow the structure in the example strictly. The tools field must contain the tools required for solving the task, and tools can only be selected from the provided list. Each tool must include the fields name, description, inputs, and outputs. The files field refers to the files required as input to solve the query. The dialogs field represents the user ...

  70. [70]

    For medium-difficulty problems, include 5–7 checkpoints, typically with two levels of nesting

    For simple problems, include 2–4 checkpoints. For medium-difficulty problems, include 5–7 checkpoints, typically with two levels of nesting. For difficult problems, include 8–10 checkpoints

  71. [71]

    Do not include any code block markers, comments, or extra text

    The output must be strictly in JSON format. Do not include any code block markers, comments, or extra text. All nested JSON strings must be processed usingjson.dumps(). All keys and values must use double quotes. Figure 10: The prompt of raw query and checkpoint generation in GTA-Workflow. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use ...

  72. [72]

    DELETE - Problems that should be removed entirely

  73. [73]

    REFINE - Problems that need requirement specification and output format clarification

  74. [74]

    AUGMENT - Problems that need complexity expansion and tool augmentation

  75. [75]

    classification

    PASS - Problems that meet all quality standards and require no modification Classification Criteria: Category 1: DELETE - Problems requiring deep visual/video understanding beyond basic description (e.g., detailed scene analysis, complex object relationships) - Problems that are essentially pure VLM (Vision-Language Model) evaluation tasks - Problems wher...

  76. [76]

    Add clear, specific requirements that eliminate ambiguity

  77. [77]

    Specify exact output format that could be generated by available content generation tools

  78. [78]

    Maintain the core intent while enhancing professionalism

  79. [79]

    Input Parsing and Extraction

    Ensure the problem represents a realistic productivity scenarios Refinement Guidelines: •Convert vague requests into specific, actionable tasks •Specify the exact output format from available generation tools •Include clear structure expectations (sections, visual elements, etc.) •Add measurable success criteria where possible •Ensure multi-step workflow ...

  80. [80]

    Expand problem complexity to require 3 to 5+tools total (3-4 for medium tasks, 5+for hard tasks)

Showing first 80 references.