Recognition: unknown
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Pith reviewed 2026-05-10 08:37 UTC · model grok-4.3
The pith
Frontier models drop from under 50% success on atomic tool tasks to just 14% on open-ended workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GTA-2 establishes that models score below 50 percent on short atomic tool-use tests yet reach only 14.39 percent success on long-horizon open-ended workflows. The recursive checkpoint mechanism decomposes deliverables into checkable sub-goals, allowing direct comparison of model ability and agent harness performance. Experiments further show that checkpoint feedback during runs lifts results and that dedicated frameworks such as Manus and OpenClaw deliver larger gains than model upgrades alone.
What carries the argument
The recursive checkpoint-based evaluation mechanism that decomposes open-ended objectives into verifiable sub-goals for objective measurement of both models and execution frameworks.
If this is right
- Checkpoint-guided feedback during execution measurably improves workflow completion rates.
- Specialized agent frameworks boost performance on complex tasks beyond what model capacity alone provides.
- The gap between atomic and workflow results underscores the need for stronger long-horizon planning in agent systems.
- Real user queries and live tools expose coordination failures that synthetic benchmarks miss.
Where Pith is reading between the lines
- Training focused only on isolated tool calls may leave agents unprepared for sequential real-world demands.
- Execution harness design emerges as a high-leverage research direction separate from scaling foundation models.
- The benchmark implies current systems function best as supervised assistants rather than fully autonomous workers.
Load-bearing premise
The checkpoint decomposition fully captures what matters for real-world task quality without missing key aspects or adding evaluator bias.
What would settle it
Independent human review of the same workflow outputs that produces success rates substantially different from the checkpoint scores.
Figures
read the original abstract
The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GTA-2, a hierarchical benchmark for general tool agents spanning GTA-Atomic (short-horizon, closed-ended tool-use precision inherited from prior work) and GTA-Workflow (long-horizon, open-ended tasks using real user queries, deployed tools, and multimodal contexts). It proposes a recursive checkpoint-based evaluation that decomposes objectives into verifiable sub-goals for unified assessment of models and execution frameworks. Experiments report frontier models below 50% success on atomic tasks and only 14.39% on workflows, with checkpoint-guided feedback and frameworks such as Manus and OpenClaw improving results, highlighting the role of execution harnesses.
Significance. If the evaluation methodology proves reliable, the work is significant for exposing a capability gap between atomic tool use and realistic productivity workflows, providing empirical guidance for agent and framework development beyond model scaling. The open release of dataset and code supports reproducibility and further research in tool-agent benchmarks.
major comments (3)
- [§3.2] §3.2 (GTA-Workflow and recursive checkpoint evaluation): The central claim of a 'pronounced capability cliff' (14.39% workflow success) depends on the checkpoint mechanism accurately measuring open-ended deliverables, but the manuscript provides no inter-annotator agreement scores, details on checkpoint creation process, or validation against missed task aspects, leaving the weakest assumption unaddressed.
- [§2.1] §2.1 (task and tool construction): Real-world authenticity is asserted via real user queries and deployed tools, yet no quantitative metrics are given for tool authenticity validation, query representativeness, or coverage of real-world scenarios; this directly affects the generalizability of the reported performance gaps.
- [Table 2] Table 2 and §4.3 (framework comparisons): Improvements from Manus and OpenClaw are presented as evidence for harness importance, but without ablations isolating checkpoint feedback effects from framework design or controls for evaluator bias in the recursive process, the attribution to execution harnesses beyond model capacity remains under-supported.
minor comments (2)
- [Figure 1] Figure 1: The workflow diagram would benefit from clearer labeling of the recursive checkpoint loop and how it interfaces with the execution harness.
- [§4.1] §4.1: Specific success rates for individual frontier models on GTA-Atomic (beyond the 'below 50%' aggregate) should be reported in the main text or a dedicated table for direct comparison with workflow results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real user queries and deployed tools provide authentic contexts for evaluating agent performance
Reference graph
Works this paper leans on
-
[1]
Langchain, 2022
Harrison Chase. Langchain, 2022. URL https://github.com/ langchain-ai/langchain
2022
-
[2]
Autogpt, 2023
Significant Gravitas. Autogpt, 2023. URL https://github. com/Significant-Gravitas/AutoGPT
2023
-
[3]
Claude code docs, 2025
Anthropic. Claude code docs, 2025. URL https://code. claude.com/docs/en/overview
2025
-
[4]
Lightrag: Simple and fast retrieval-augmented generation
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation. InFind- ings of the Association for Computational Linguistics: EMNLP, pages 10746–10761, 2025
2025
-
[5]
Adaptive hinge balance loss for document-level relation extraction
Jize Wang, Xinyi Le, Xiaodi Peng, and Cailian Chen. Adaptive hinge balance loss for document-level relation extraction. InFind- ings of the Association for Computational Linguistics: EMNLP, pages 3872–3878, 2023
2023
-
[6]
Demystifying long chain-of-thought reasoning in llms
Edward Yeo, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. InICLR Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2025
2025
-
[7]
Wenlin Zhang, Xiaopeng Li, Yingyi Zhang, Pengyue Jia, Yichao Wang, Huifeng Guo, Yong Liu, and Xiangyu Zhao. Deep research: A survey of autonomous research agents.arXiv:2508.12752, 2025
-
[8]
From automation to autonomy: A survey on large language models in scientific discovery
Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song. From automation to autonomy: A survey on large language models in scientific discovery. InEMNLP, pages 17744–17761, 2025
2025
-
[9]
Toolllm: Facilitating large language models to master 16000+real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+real-world apis. InICLR, 2024
2024
-
[10]
Gorilla: Large language model connected with massive apis
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. In NeurIPS, pages 126544–126565, 2024
2024
-
[11]
Gta: a benchmark for general tool agents
Jize Wang, Ma Zerun, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents. InNeurIPS, pages 75749–75790, 2024
2024
-
[12]
Swe-bench: Can language models resolve real-world github issues? InICLR, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? InICLR, 2024. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows15
2024
-
[13]
A comprehensive benchmark for tool-augmented llms
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li API-bank. A comprehensive benchmark for tool-augmented llms. InEMNLP, pages 3102–3116, 2023
2023
-
[14]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InICLR, 2024
2024
-
[15]
m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks
Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m&m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. InSynthetic Data for Computer Vision Workshop @ CVPR, 2024
2024
-
[16]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann Le- Cun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InICLR, 2024
2024
-
[17]
Gaia2: Benchmarking LLM agents on dynamic and asynchronous environments
Romain Froger, Amine Benhalloum, Andrey Rusakov, Dheeraj Mekala, Emilien Garreau, Gerard Moreno-Torres Bertran, Gré- goire Mialon, Hugo Laurençon, Jean-Baptiste Gaya, Kunal Malkan, Mathieu Rita, Matteo Bettini, Maxime Lecanu, Mengjue Wang, Pierre Andrews, Pierre Menard, Thomas Scialom, Ulyana Piterbarg, Virginie Do, Amar Budhiraja, Ian Yu, Mikhail Plekhan...
2026
-
[18]
Odysseybench: Evaluating llm agents on long-horizon complex office application workflows
Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows. arXiv:2508.09124, 2025
- [19]
-
[20]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasu- pat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Tool learning with large language models: A survey.Frontiers of Computer Science, page 198343, 2025
Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. Tool learning with large language models: A survey.Frontiers of Computer Science, page 198343, 2025
2025
-
[22]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InNeurIPS, pages 68539–68551, 2023
2023
-
[23]
React: Synergizing rea- soning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing rea- soning and acting in language models. InICLR, 2022
2022
-
[24]
Large language models as tool makers
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. InICLR, 2024
2024
-
[25]
Enhancing decision-making for llm agents via step-level q-value models
Yuanzhao Zhai, Tingkai Yang, Kele Xu, Dawei Feng, Cheng Yang, Bo Ding, and Huaimin Wang. Enhancing decision-making for llm agents via step-level q-value models. InAAAI, pages 27161–27169, 2025
2025
-
[26]
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InICML, pages 48371– 48392, 2025
2025
-
[27]
Metatool benchmark for large language models: Deciding whether to use tools and which to use
Yue Huang, Jiawen Shi, Yuan Li, Chenrui Fan, Siyuan Wu, Qihui Zhang, Yixin Liu, Pan Zhou, Yao Wan, Neil Zhenqiang Gong, et al. Metatool benchmark for large language models: Deciding whether to use tools and which to use. InICLR, 2024
2024
-
[28]
Aios: Llm agent operating system
Kai Mei, Xi Zhu, Wujiang Xu, Mingyu Jin, Wenyue Hua, Ze- long Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. Aios: Llm agent operating system. InCOLM, 2025
2025
-
[29]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems.arXiv:2310.08560, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Openclaw — personal ai assistant, 2026
OpenClaw Team. Openclaw — personal ai assistant, 2026. URL https://github.com/openclaw/openclaw
2026
-
[31]
What minimax agent can do, 2026
MiniMax Team. What minimax agent can do, 2026. URL https: //agent.minimax.io/docs/user-guide
2026
-
[32]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022
2022
-
[33]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023
2023
-
[34]
Jize Wang, Han Wu, Zhiyuan You, Yiming Song, Yijun Wang, Zifei Shan, Yining Li, Songyang Zhang, Xinyi Le, Cailian Chen, et al. Routemoa: Dynamic routing without pre-inference boosts efficient mixture-of-agents.arXiv:2601.18130, 2026
-
[35]
V oyager: An open-ended embodied agent with large language models.TMLR, 2024
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.TMLR, 2024. ISSN 2835-8856
2024
-
[36]
Kortix – build, manage and train ai agents., 2025
Kortix. Kortix – build, manage and train ai agents., 2025. URL https://github.com/kortix-ai/suna
2025
-
[37]
From mind to machine: The rise of manus ai as a fully autonomous digital agent,
Minjie Shen, Yanshu Li, Lulu Chen, and Qikai Yang. From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv:2505.02024, 2025
-
[38]
Travelplanner: A benchmark for real-world planning with language agents
Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. Travelplanner: A benchmark for real-world planning with language agents. In ICML, 2024
2024
-
[39]
Per- sonal travel solver: A preference-driven llm-solver system for travel planning
Zijian Shao, Jiancan Wu, Weijian Chen, and Xiang Wang. Per- sonal travel solver: A preference-driven llm-solver system for travel planning. InACL, pages 27622–27642, 2025
2025
-
[40]
Agent-as-a-judge: Evaluate agents with agents
Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InICML, 2025
2025
-
[41]
Theagentcompany: Benchmarking llm agents on consequential real world tasks
Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Zhiruo Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: Benchmarking llm agents on consequential real world tasks. InNeurIPS, 2025
2025
-
[42]
Webarena: A realistic web environment for building autonomous agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InICLR, 2024
2024
-
[43]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InNeurIPS, pages 52040–52094, 2024
2024
-
[44]
Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, and Vibhav Vineet. Os-marathon: Bench- marking computer-use agents on long-horizon repetitive tasks. arXiv:2601.20650, 2026. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows16
-
[45]
The dawn of lmms: Preliminary explorations with gpt-4v (ision)
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung- Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision).arXiv:2309.17421, 2023
-
[46]
Visual instruction tuning
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023
2023
-
[47]
Gpt-5 system card, 2025
OpenAI. Gpt-5 system card, 2025. URL https://cdn.openai. com/gpt-5-system-card.pdf
2025
-
[48]
System card: Claude sonnet 4.5,
Anthropic. System card: Claude sonnet 4.5,
-
[49]
URL https://www-cdn.anthropic.com/ 963373e433e489a87a10c823c52a0a013e9172dd.pdf
-
[50]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Navigating the digital world as humans do: Universal visual grounding for gui agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InICLR, 2025
2025
-
[52]
Sail: sample- centric in-context learning for document information extraction
Jinyu Zhang, Zhiyuan You, Jize Wang, and Xinyi Le. Sail: sample- centric in-context learning for document information extraction. InAAAI, pages 25868–25876, 2025
2025
-
[53]
Doca- gent: An agentic framework for multi-modal long-context docu- ment understanding
Li Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu You. Doca- gent: An agentic framework for multi-modal long-context docu- ment understanding. InEMNLP, pages 17712–17727, 2025
2025
-
[54]
Flowith neo: Reinventing ai work beyond chatbots, 2026
Flowith. Flowith neo: Reinventing ai work beyond chatbots, 2026. URLhttps://flowith.io/blog/meet-agent-neo/
2026
-
[55]
Framework for orchestrating role-playing, autonomous ai agents, 2025
CrewAI. Framework for orchestrating role-playing, autonomous ai agents, 2025. URL https://github.com/crewaiinc/ crewai
2025
-
[56]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Al- tenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
The claude 3 model family: Opus, sonnet, haiku
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024
2024
-
[58]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog, 2024
Meta AI. Introducing meta llama 3: The most capable openly available llm to date.Meta AI Blog, 2024
2024
-
[60]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv:2507.20534, 2025
work page internal anchor Pith review arXiv 2025
-
[62]
Grok 4, 2025
xAI. Grok 4, 2025. URLhttps://x.ai/news/grok-4
2025
-
[63]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv:2512.02556, 2025
work page internal anchor Pith review arXiv 2025
-
[64]
The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https://ai.meta.com/ blog/llama-4-multimodal-intelligence/
2025
-
[65]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Opencompass: A universal eval- uation platform for foundation models, 2023
OpenCompass Contributors. Opencompass: A universal eval- uation platform for foundation models, 2023. URL https: //github.com/open-compass/opencompass
2023
-
[67]
Lagent: InternLM a lightweight open- source framework that allows users to efficiently build large lan- guage model(llm)-based agents, 2023
Lagent Developer Team. Lagent: InternLM a lightweight open- source framework that allows users to efficiently build large lan- guage model(llm)-based agents, 2023. URL https://github. com/InternLM/lagent. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use toOpen-EndedWorkflows17 AdditionalGTA-2 Information .1 Tool Definition The detailed de...
2023
-
[68]
The generated data is intended to evaluate the degree to which a large language model completes the task
Based on the content of the query, generate result-oriented checkpoints organized in a tree-structured logical hierarchy with nested subtasks. The generated data is intended to evaluate the degree to which a large language model completes the task
-
[69]
The tools field must contain the tools required for solving the task, and tools can only be selected from the provided list
Follow the structure in the example strictly. The tools field must contain the tools required for solving the task, and tools can only be selected from the provided list. Each tool must include the fields name, description, inputs, and outputs. The files field refers to the files required as input to solve the query. The dialogs field represents the user ...
-
[70]
For medium-difficulty problems, include 5–7 checkpoints, typically with two levels of nesting
For simple problems, include 2–4 checkpoints. For medium-difficulty problems, include 5–7 checkpoints, typically with two levels of nesting. For difficult problems, include 8–10 checkpoints
-
[71]
Do not include any code block markers, comments, or extra text
The output must be strictly in JSON format. Do not include any code block markers, comments, or extra text. All nested JSON strings must be processed usingjson.dumps(). All keys and values must use double quotes. Figure 10: The prompt of raw query and checkpoint generation in GTA-Workflow. Preprint– GTA-2: BenchmarkingGeneralToolAgents fromAtomicTool-Use ...
-
[72]
DELETE - Problems that should be removed entirely
-
[73]
REFINE - Problems that need requirement specification and output format clarification
-
[74]
AUGMENT - Problems that need complexity expansion and tool augmentation
-
[75]
classification
PASS - Problems that meet all quality standards and require no modification Classification Criteria: Category 1: DELETE - Problems requiring deep visual/video understanding beyond basic description (e.g., detailed scene analysis, complex object relationships) - Problems that are essentially pure VLM (Vision-Language Model) evaluation tasks - Problems wher...
-
[76]
Add clear, specific requirements that eliminate ambiguity
-
[77]
Specify exact output format that could be generated by available content generation tools
-
[78]
Maintain the core intent while enhancing professionalism
-
[79]
Input Parsing and Extraction
Ensure the problem represents a realistic productivity scenarios Refinement Guidelines: •Convert vague requests into specific, actionable tasks •Specify the exact output format from available generation tools •Include clear structure expectations (sections, visual elements, etc.) •Add measurable success criteria where possible •Ensure multi-step workflow ...
-
[80]
Expand problem complexity to require 3 to 5+tools total (3-4 for medium tasks, 5+for hard tasks)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.