TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Chenyang Si; Haochen Yin; Wenhui Dong; Yilang Tan; Yuwen Qu; Zhiqiang Liu

arxiv: 2605.16909 · v1 · pith:T2YNBZM6new · submitted 2026-05-16 · 💻 cs.AI

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Zhiqiang Liu , Wenhui Dong , Yilang Tan , Yuwen Qu , Haochen Yin , Chenyang Si This is my paper

Pith reviewed 2026-05-19 20:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-using agentsomni-modal benchmarkclosed-loop evaluationmultimodal verificationMCP serverscustomer service tasksintelligent creationagent performance gap

0 comments

The pith

Current AI agents succeed on only 32 percent of realistic omni-modal tool tasks while humans reach 94 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MM-ToolBench as a benchmark for end-to-end omni-modal tool use in professional settings. It argues that prior evaluations test tool use, computer use, and multimodal reasoning in isolation and therefore miss the integrated requirements of real workflows. The benchmark supplies 100 executable tasks across customer service and intelligent creation, backed by 27 servers offering 324 tools and by task-specific grounded evaluators. Agents must run tools, inspect intermediate artifacts such as rendered outputs, and revise their actions when those artifacts fail the requirements. Experiments show that even strong models fall far short of human performance under this closed-loop protocol.

Core claim

MM-ToolBench evaluates agents on 100 tasks drawn from two macro families and twenty subcategory slices, each supported by grounded evaluators that verify multimodal artifacts after tool execution. Agents operate in a closed loop: they receive multimodal inputs, invoke tools via MCP servers, inspect the resulting artifacts, and self-correct before producing a final result. The benchmark construction uses a semi-automated pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. On this suite the strongest tested model reaches 32.0 percent task success against a 94.0 percent human baseline.

What carries the argument

Closed-loop multimodal verification, in which agents must execute tools through MCP servers, inspect rendered or transformed artifacts, and self-correct according to task-specific grounded evaluators.

If this is right

Agents must coordinate multimodal perception, tool invocation, and iterative revision inside a single workflow rather than in separate stages.
Evaluation harnesses can scale through MCP-based execution paired with automated grounded evaluators and light human audit.
Performance gaps between models and humans remain large even for coding-strong models, indicating that closed-loop inspection is a distinct capability bottleneck.
The two macro task families and twenty slices provide structured coverage for measuring progress across professional domains.
Future agents developed against this benchmark will need explicit mechanisms for artifact inspection and correction to approach human-level results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the benchmark to additional professional domains such as software engineering or scientific data analysis would test whether the observed gaps generalize.
Comparing agent scores with and without the artifact-inspection step would isolate how much the closed-loop requirement drives the performance drop.
Linking MM-ToolBench tasks to existing isolated tool-use or multimodal benchmarks could reveal which component skills transfer and which do not.
Iterating the semi-automated construction pipeline on new task families could produce larger or more diverse test sets without proportional human effort.

Load-bearing premise

The selected 100 tasks and their evaluators accurately represent the essential demands of real-world professional omni-modal tool use without artificial simplifications or selection biases.

What would settle it

A contemporary agent reaching above 80 percent task success on the full set of 100 tasks while still performing explicit artifact inspection and self-correction would indicate the benchmark no longer exposes the claimed limitations.

Figures

Figures reproduced from arXiv: 2605.16909 by Chenyang Si, Haochen Yin, Wenhui Dong, Yilang Tan, Yuwen Qu, Zhiqiang Liu.

**Figure 2.** Figure 2: Overview of the TOBench construction pipeline. TOBench organizes tasks into two macro [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of tools, task difficulty, and task scenarios in TOBench. Left: the number of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the evaluation pipeline in TOBench. Rather than using one benchmark-wide [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmap of model-specific error counts across fine-grained subcategories, grouped into tool call, tool parameter, multimodal ability, and self-verification failures. Tool call and parameter errors remain the most pervasive execution bottleneck. Many trajectories fail before high-level reasoning becomes relevant: models choose the wrong tool, omit a required operation, hallucinate unsupported actions, or … view at source ↗

**Figure 6.** Figure 6: Two examples of failed tasks with initial and final states are presented: the first agent [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Rough floor plan for the Johnson-Smith wedding [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of output results: (a) Ground truth for the signature placement task; (b) Final [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Initial and output files for the signature placement task: (a) Original lease agreement image [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Ground truth screenshot displaying the work hours record interface. [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-ToolBench adds a workable closed-loop verification harness for omni-modal tool agents and shows clear model-human gaps, but the narrow task families make the real-world claims hard to trust without more validation.

read the letter

The main point is that this paper builds a benchmark requiring agents to run tools, inspect multimodal outputs like rendered artifacts, and self-correct in a loop. It reports Claude Opus 4.6 at 32% success against a 94% human baseline across 100 tasks, which gives a concrete signal that current agents still struggle with integrated professional workflows. The setup uses 27 MCP servers and grounded evaluators, which is a step toward scalable testing of end-to-end tool use rather than isolated calls or static reasoning tests. The semi-automated pipeline for task discovery and evaluator synthesis is practical and helps make the whole thing reproducible enough to check against. That part is useful for anyone building or evaluating agent systems that need feedback from tool results. The paper does a decent job making the evaluation executable and tied to specific requirements instead of loose scoring. On the soft spots, the task selection from only two macro families and 20 subcategories is the clearest limitation. Without coverage stats or external checks on how well these scenarios match broader professional omni-modal work, the performance gap could partly reflect the chosen constraints or evaluator tolerances rather than pure agent shortcomings. The human audit is mentioned but lacks detail on inter-rater reliability or realism validation, so the central claim about closing the gap to real-world use rests on thinner ground than the numbers suggest. This is not a deal-breaker for a benchmark paper, but it does mean readers should treat the difficulty ranking as provisional until the task set expands or gets independent review. The work is aimed at researchers testing multimodal agents and tool-use frameworks. People running agent experiments or designing new evaluations will get value from the harness and protocol even if they adapt the tasks. It deserves peer review because the methods are specific enough to critique and the results are falsifiable with the described setup.

Referee Report

1 major / 1 minor

Summary. The paper introduces MM-ToolBench, a benchmark of 100 executable tasks drawn from two macro families (Customer Service and Intelligent Creation) that span 20 subcategory slices and are backed by 27 MCP servers containing 324 tools. The central contribution is a closed-loop multimodal verification harness that requires agents to execute tools, inspect rendered artifacts, and self-correct against task-specific grounded evaluators constructed via a semi-automated pipeline. Experiments on 15 contemporary agentic models report that even Claude Opus 4.6 reaches only 32.0% task success, well below the 94.0% human baseline, and position the benchmark as a practical foundation for advancing real-world omni-modal tool use.

Significance. If the task set and evaluators prove representative, the work supplies a concrete, scalable evaluation framework that directly targets the gap between isolated tool-use or multimodal benchmarks and end-to-end professional workflows. The provision of executable tasks, grounded evaluators, and human-audited scenarios offers a reproducible testbed that could accelerate progress on closed-loop agents capable of artifact inspection and iterative correction.

major comments (1)

[Abstract] Abstract: the headline claim that current models remain highly challenged (Claude Opus 4.6 at 32.0% vs. 94.0% human) is load-bearing on the assumption that the 100 tasks faithfully proxy real-world closed-loop omni-modal requirements. The semi-automated pipeline over only two macro families and 20 subcategory slices lacks any reported coverage statistics, external workflow validation, or inter-rater checks on scenario realism; without these, the observed gap could arise from unrepresentative task constraints or evaluator tolerances rather than intrinsic agent limitations.

minor comments (1)

[Abstract] The acronym MCP appears without expansion on first use; define it explicitly in the abstract and introduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential of MM-ToolBench as an evaluation framework. We address the single major comment below and clarify how the manuscript supports the headline performance claims while committing to targeted revisions for greater transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that current models remain highly challenged (Claude Opus 4.6 at 32.0% vs. 94.0% human) is load-bearing on the assumption that the 100 tasks faithfully proxy real-world closed-loop omni-modal requirements. The semi-automated pipeline over only two macro families and 20 subcategory slices lacks any reported coverage statistics, external workflow validation, or inter-rater checks on scenario realism; without these, the observed gap could arise from unrepresentative task constraints or evaluator tolerances rather than intrinsic agent limitations.

Authors: We agree that explicit documentation of task coverage and validation procedures strengthens the interpretation of the performance gap. The manuscript already describes a semi-automated pipeline that includes human audit by domain experts for each of the 100 tasks to ensure alignment with realistic Customer Service and Intelligent Creation workflows. In the revised manuscript we will add a dedicated subsection detailing (i) the distribution of the 20 subcategory slices across the two macro families, (ii) the number of candidate scenarios reviewed during construction, and (iii) the audit criteria and pass/fail outcomes from the human review step. These additions will make the representativeness arguments more quantitative without altering the core experimental results. We maintain that the 32.0 % vs. 94.0 % gap is not an artifact of overly narrow constraints, because the same grounded evaluators and execution harness were used for both agents and humans, and the human ceiling was reached only after iterative artifact inspection and correction—precisely the closed-loop behavior the benchmark targets. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivational circularity

full rationale

The paper introduces MM-ToolBench as a new empirical evaluation harness consisting of 100 tasks, grounded evaluators, and a semi-automated construction pipeline over MCP servers. No mathematical derivations, first-principles results, or predictions are claimed; the reported performance figures (such as 32.0% success for Claude Opus 4.6 versus 94.0% human) are direct measurements on the defined task set rather than outputs that reduce to fitted inputs or self-citations by construction. The central premise addresses a gap between existing isolated benchmarks and closed-loop omni-modal use, but this positioning rests on the explicit task families, subcategory slices, and human audit steps described, without any load-bearing reduction to prior author work or tautological redefinition. The paper is therefore self-contained as an evaluation artifact whose validity is assessed externally against the realism of its scenarios.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the chosen tasks and automatic evaluators faithfully represent real-world omni-modal tool use; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 100 tasks across customer service and intelligent creation domains, supported by 27 MCP servers, sufficiently cover realistic professional workflows.
This premise is required for the performance gap to generalize beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5824 in / 1322 out tokens · 33769 ms · 2026-05-19T20:39:03.648521+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TOBench contains 100 executable tasks... closed-loop multimodal verification... agents must execute tools, inspect rendered or transformed artifacts, and self-correct

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Y ao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool-agent-user interaction in real-world domains. ArXiv, abs/2406.12045, 2024. URL https://api.semanticscholar.org/CorpusID:270562578

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Y ujia Qin, Shihao Liang, Yining Y e, Kunlun Zhu, Lan Y an, Y axi Lu, Y ankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The twelfth international conference on learning representations , 2023

work page 2023
[5]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Y an, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In F orty-second International Conference on Machine Learning, 2025

work page 2025
[6]

Tooltalk: Evaluating tool-usage in a conversational setting

Nicholas Farn and Richard Shin. Tooltalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775, 2023

work page arXiv 2023
[7]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Y uxuan Cao, Y uzhen Huang, Wei Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution. arXiv preprint arXiv:2510.25726, 2025

work page arXiv 2025
[8]

Mcp-radar: A multi- dimensional benchmark for evaluating tool use capabilities in large language models

Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, and Chao Shen. Mcp-radar: A multi- dimensional benchmark for evaluating tool use capabilities in large language models. arXiv preprint arXiv:2505.16700, 2025

work page arXiv 2025
[9]

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Y ujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453 , 2025

work page arXiv 2025
[10]

Mcp-universe: Benchmark- ing large language models with real-world model context protocol servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Y ang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmark- ing large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025

work page arXiv 2025
[11]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[12]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications

Wei He, Y ueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications. arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025
[13]

Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark

Y ang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Y e, Ligong Han, Can Jin, and Dimitris N Metaxas. Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark. arXiv preprint arXiv:2511.17729, 2025

work page arXiv 2025
[14]

Omnigaia: Towards native omni-modal ai agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Y uan Lu, et al. Omnigaia: Towards native omni-modal ai agents. arXiv preprint arXiv:2602.22897, 2026. 10

work page arXiv 2026
[15]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Y u, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539– 68551, 2023

work page 2023
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Gorilla: Large language model connected with massive apis

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems , 37: 126544–126565, 2024

work page 2024
[18]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Y ongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Y ueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023
[19]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Y ann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learn- ing Representations, 2023

work page 2023
[20]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment. arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark. arXiv preprint arXiv:2508.07575 , 2025

work page arXiv 2025
[22]

Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Y an, Si Liu, Wei Y e, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents. arXiv preprint arXiv:2510.24563, 2025

work page arXiv 2025
[23]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Y u Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Y u Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 881–905, 2024

work page 2024
[25]

τ -voice: Bench- marking full-duplex voice agents on real-world domains

Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan. τ -voice: Bench- marking full-duplex voice agents on real-world domains. arXiv preprint arXiv:2603.13686 , 2026

work page arXiv 2026
[26]

Mmdeepresearch-bench: A benchmark for multimodal deep research agents

Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, et al. Mmdeepresearch-bench: A benchmark for multimodal deep research agents. arXiv preprint arXiv:2601.12346, 2026

work page arXiv 2026
[27]

Visualagent bench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Y u Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Li, Hanlin Zhao, et al. Visualagent bench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024

work page arXiv 2024
[28]

Prosoftarena: Benchmarking hierarchical capabilities of multimodal agents in professional software environments

Jiaxin Ai, Y ukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, and Kaipeng Zhang. Prosoftarena: Benchmarking hierarchical capabilities of multimodal agents in professional software environments. arXiv preprint arXiv:2601.02399, 2025. 11

work page arXiv 2025
[29]

Mllm-tool: A multimodal large language model for tool agent learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6678–6687. IEEE, 2025

work page 2025
[30]

Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Y uechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist. arXiv preprint arXiv:2511.08521, 2025

work page arXiv 2025
[31]

Less is more: Focus attention for efﬁcient detr

Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, and Y unhe Wang. Less is more: Focus attention for efﬁcient detr. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6674–6683, 2023

work page 2023
[32]

Fila-video: Spatio-temporal compression for ﬁne- grained long video understanding

Y anan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Y ang, Yingbo Wang, Y ang Du, Xianing Chen, and Bo Zheng. Fila-video: Spatio-temporal compression for ﬁne- grained long video understanding. arXiv preprint arXiv:2504.20384, 2025

work page arXiv 2025
[33]

Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Y u, Wei Peng, Xinbin Y uan, Yifei Bi, Ming Zhao, Zian Zhou, et al. Ivy-fake: A uniﬁed explainable framework and benchmark for image and video aigc detection. arXiv preprint arXiv:2506.00979, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Fila: Fine- grained vision language models

Shiding Zhu, Wenhui Dong, Jun Song, Yingbo Wang, Y anan Guo, and Bo Zheng. Fila: Fine- grained vision language models. arXiv preprint arXiv:2412.08378, 2024

work page arXiv 2024
[35]

Spinebench: A clinically salient, level- aware benchmark powered by the spinemed-450k corpus

Ming Zhao, Wenhui Dong, Y ang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Y unzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, et al. Spinebench: A clinically salient, level- aware benchmark powered by the spinemed-450k corpus. arXiv preprint arXiv:2510.03160 , 2025

work page arXiv 2025
[36]

Judging llm-as-a-judge with mt- bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Catalog of MCP Servers We show all the MCP servers used in the TOBench in Table 3. M...

work page 2023
[37]

After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

Visual Consistency: When designing the hotel’s welcome-themed PPTX, you should ﬁrst review the template ﬁle example.pptx and then design the PPTX according to the template’s layout. After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

work page
[39]

The Greeting section text should be set to 24 pt

PPTx details: The text for the Welcome Title should be set to 32 pt. The Greeting section text should be set to 24 pt. The provided hotel_logo.png must be inserted at the bottom of the slide, and the image should not be overly eye-catching. ## Operational Guidelines

work page
[40]

• Iterative Execution : Execute your plan step-by-step

Task Processing Protocol • Analyze & Plan : Upon receiving a request, explicitly reason about the requirements and formulate a structured preliminary plan. • Iterative Execution : Execute your plan step-by-step. After each tool call, analyze the result to decide the next step. If a step fails or produces unexpected results, reﬂect on the cause and adjust ...

work page
[41]

Inspect these assets only when essential for task comprehension or result veriﬁcation using available viewer tools

Multimodal Data Handling • Selective Inspection : Y ou have access to multimodal inputs (text, images, audio, video). Inspect these assets only when essential for task comprehension or result veriﬁcation using available viewer tools. • Document Standards: For document generation tasks (PPT, LaTeX, Word), you are responsible for ensuring professional forma...

work page
[42]

#### STOP

Tool Usage & File Management • Parameter Alignment: When using generation tools (e.g., for media creation), carefully select pa- rameters that align with the speciﬁc context, style, and constraints of the user’s request. • Absolute Paths Mandatory : Y ou must use absolute paths for all ﬁle references, whether writing input_file_path or output_file_path pa...

work page
[43]

Faithful to the Persona: Speak and act according to the identity, tone, and context deﬁned in User Persona

work page
[44]

Background Description + Speciﬁc Request

Faithful to the Instructions : Y our requests must strictly align with the Task Configuration . Do not hallucinate or deviate from the given task details. ## Task Conﬁguration ### User Persona I am the General Manager of Nebula Heights Resort. ### Instructions We have a very important guest checking in today: a boy named Leo who is celebrating his 10th bi...

work page
[45]

Inventory & Pricing: The hotel’s standard rates are $25 for round tables, $30 for rectangular tables, and $5 per chair

work page
[46]

Figure 7: Rough ﬂoor plan for the Johnson-Smith wedding layout_sketch.jpg

Excel Format: The Excel workbook should contain only one worksheet named budget, and all calcu- lations and summaries must be performed on this sheet. Figure 7: Rough ﬂoor plan for the Johnson-Smith wedding layout_sketch.jpg. (a) (b) Figure 8: Comparison of output results: (a) Ground truth for the signature placement task; (b) Final image generated by Gem...

work page
[47]

Do not create Word, PDF, or other document formats

Output Scope: Only generate the ﬁnal composited image ﬁle. Do not create Word, PDF, or other document formats

work page
[48]

The tenant’s signature must be placed on the line labeled TENANT’S SIGNATURE

Placement Requirement: The landlord’s signature must be placed on the line labeled LANDLORD’S SIGNATURE at the bottom of the contract. The tenant’s signature must be placed on the line labeled TENANT’S SIGNATURE

work page
[49]

Signature strokes must remain clear and legible

Size and Proportion: Each signature must be proportionally scaled according to the length of the signature line, preserving the original aspect ratio without stretching. Signature strokes must remain clear and legible

work page
[50]

They should only overlap the signature line area

No Obstruction: The signatures must not cover the labels ( LANDLORD’S SIGNATURE / TENANT’S SIGNATURE) or any critical contract text. They should only overlap the signature line area

work page
[51]

Output Naming: Save the ﬁnal output ﬁle as signed_rental_agreement.png. (a) (b) Figure 9: Initial and output ﬁles for the signature placement task: (a) Original lease agreement image (rental_agreement.jpg); (b) Output generated by Qwen 3.5 Plus showing spatial misalignment. In the task Customer_Service/Real_Estate-rental_agreement, agent (Qwen 3.5 Plus) n...

work page
[52]

If the order cannot be found in the table, this must be explicitly stated in the report, and the agent must not speculate about the courier’s identity

Order and Courier V eriﬁcation: Based on the order number provided by the user, the agent must search the Excel order table to identify the corresponding courier information (such as name or courier ID). If the order cannot be found in the table, this must be explicitly stated in the report, and the agent must not speculate about the courier’s identity

work page
[53]

The agent must not infer actions that are not clearly shown

Video Evidence Review Principle: All judgments regarding the courier’s behavior (including but not limited to damaging food, consuming customer items, or engaging in a conﬂict) must be strictly based on what is directly observable in the video. The agent must not infer actions that are not clearly shown. If key details are unclear or not visible, the repo...

work page
[54]

The ﬁle name should follow a generic incident-based naming convention and must not hard-code the order number

Report Output Requirement: The agent must generate a formal investigation report in Word format. The ﬁle name should follow a generic incident-based naming convention and must not hard-code the order number

work page
[55]

Report Structure Speciﬁcation: The report must contain the following top-level sections in order: (1) Incident Summary (2) Courier Information (sourced from order data) (3) Video Evidence Observations (objective, factual description only) (4) Conﬂict Cause Analysis (5) Handling Recommendations and Con- clusion

work page
[56]

Penalty Recommendation Rules: A ﬁne or disciplinary action may only be recommended if the video evidence clearly supports that the courier engaged in improper behavior that harmed the customer’s inter- ests. If a dispute is visible but malicious behavior cannot be conﬁrmed, the recommendation should focus on further investigation, warning, or training, wi...

work page
[57]

Compensation and Redelivery Decision: If it can be conﬁrmed that the food was damaged or consumed, the conclusion must clearly state whether redelivery or refund/compensation is recommended, and the de- cision must be justiﬁed based on veriﬁable evidence

work page
[58]

path": workspace/video_frames/sec_0000.jpg Tool Response 22 turn9: read_media_ﬁle

the ﬁnal report must be named as incident_report.docx turn8: read_media_ﬁle "path": workspace/video_frames/sec_0000.jpg Tool Response 22 turn9: read_media_ﬁle "path": workspace/video_frames/sec_0010.jpg Tool Response turn10: read_media_ﬁle "path": workspace/video_frames/sec_0020.jpg Tool Response turn11: read_media_ﬁle "path": workspace/video_frames/sec_0...

work page
[59]

No additional reasons may be inferred or added

Video Understanding: The agent must fully watch the input video and identify exactly three reasons stated by the speaker explaining why working for UberEats is not recommended. No additional reasons may be inferred or added

work page
[60]

Reason Condensation: Each reason must be summarized into one concise, impactful sentence suitable for social media consumption

work page
[61]

Image Extraction: When the ﬁrst reason is being discussed, if the video displays any work-hour or income-related statistics interface, the agent must capture that frame and save it as a standalone image ﬁle named UberEats_WorkHours_Screenshot.png

work page
[62]

path": workspace/UberEats_audio.mp3 Tool Response

Final Output: The agent must generate a complete Instagram caption and save all written content into a Word document named UberEats_Instagram_Content.docx. The image ﬁle must not be embedded in the Word document. turn4: audio_to_text "path": workspace/UberEats_audio.mp3 Tool Response "srt_text": "Here are the top three reasons not to work for food deliver...

work page
[63]

Glitch Moment Deﬁnition: The glitch moment refers to the earliest timestamp where dupli- cated enchanted items spontaneously appear in inventory or container interface without legitimate crafting, trading, or natural loot acquisition behavior

work page
[64]

Timestamp Identiﬁcation: The agent must review the full video footage to conﬁrm the exact ﬁrst trigger time of the duplication glitch

work page
[65]

Save them as glitch_pre.png, glitch_trigger.png, glitch_post.png

Keyframes: Extract exactly three PNG keyframes from the original video: (a) 2 seconds before the glitch trigger moment, (b) the exact glitch trigger moment, (c) 2 seconds after the glitch trigger moment. Save them as glitch_pre.png, glitch_trigger.png, glitch_post.png

work page
[66]

video_path

Report: Create mc_item_duplication_bug_report.docx with ﬁxed sections: Title, Test Envi- ronment (mark unknown items as Not provided ), Numbered Reproduction Steps, Expected Behavior, Actual Behavior, and Evidence Attachment (list screenshot ﬁles with corresponding MM:SS timestamps). All timestamps must follow MM:SS standard format. turn4: export_frames_e...

work page
[67]

After com- pleting the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

Visual Consistency: When designing the hotel’s welcome-themed PPTX, you should ﬁrst review the template ﬁle example.pptx and then design the PPTX according to the template’s layout. After com- pleting the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

work page
[68]

If the generated image does not meet the user’s actual requirements, you need to re-generate the image until it meets the user’s requirements

work page
[69]

The Greeting section text should be set to 24 pt

PPTx details: The text for the Welcome Title should be set to 32 pt . The Greeting section text should be set to 24 pt . The provided hotel_logo.png must be inserted at the bottom of the slide, and the image should not be overly eye-catching. 30 Thinking Let me break down this task:

work page
[70]

Leo_astronaut.png

First, I need to generate an image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky as the background. The image should be named "Leo_astronaut.png"

work page
[71]

welcome_tv.pptx

Then, I need to create a welcome PPTX ("welcome_tv.pptx") using the template "example.pptx": - Wel- come Title: "Space Explorer" - Insert the generated image ("Leo_astronaut.png") into the position speci- ﬁed by the template - Greeting section: "Happy 10th Birthday, Astronaut Leo!" - Insert "hotel_logo.png" at the bottom of the slide Let me start by: 1. R...

work page
[72]

**Leo.png**: Shows a young boy in a blue striped shirt holding a yellow soccer ball, running on grass with a playground in the background

work page
[73]

**hotel_logo.png**: The Hilton Hotels & Resorts logo

work page
[74]

Title Slide

**Template (example.pptx)**: Has 11 layouts. I need to understand the structure better. The template has layouts like "Title Slide", "Title and Content", etc. Now let me: 1. First generate the astronaut image of Leo on the moon with a spaceship and blue starry sky 2. Then create the PPTX using the template Let me start by generating the astronaut image. i...

work page
[75]

Leo_astronaut.png

Generated "Leo_astronaut.png" - An image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky background. 2. Created "welcome_tv.pptx" from the template

work page
[76]

Space Explorer

Added "Space Explorer" as the Welcome Title (32 pt font). 4. Added "Happy 10th Birthday, Astronaut Leo!" as the Greeting (24 pt font). 5. Inserted the "Leo_astronaut.png" image into the slide. 6. Inserted the "hotel_logo.png" at the bottom of the slide (smaller size so it’s not overly eye-catching). The task is complete! local-claim_done Tool Response you...

work page
[77]

After taking the screenshot, please check that the screenshot content is complete and correct

work page
[78]

Thinking The user wants me to:

Please use Markdown syntax for inserting images and writing their captions. Thinking The user wants me to:

work page
[79]

Listen to a video (‘Notebook.mp4‘) about using pandas to read an Excel table

work page
[80]

Identify two main parts: ‘Data Preparation‘ and ‘Code Example‘

work page
[81]

Create a ‘tutorial.md‘ ﬁle with second-level headers for each part

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Y ao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool-agent-user interaction in real-world domains. ArXiv, abs/2406.12045, 2024. URL https://api.semanticscholar.org/CorpusID:270562578

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Y ujia Qin, Shihao Liang, Yining Y e, Kunlun Zhu, Lan Y an, Y axi Lu, Y ankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. In The twelfth international conference on learning representations , 2023

work page 2023

[5] [5]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Y an, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In F orty-second International Conference on Machine Learning, 2025

work page 2025

[6] [6]

Tooltalk: Evaluating tool-usage in a conversational setting

Nicholas Farn and Richard Shin. Tooltalk: Evaluating tool-usage in a conversational setting. arXiv preprint arXiv:2311.10775, 2023

work page arXiv 2023

[7] [7]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Y uxuan Cao, Y uzhen Huang, Wei Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution. arXiv preprint arXiv:2510.25726, 2025

work page arXiv 2025

[8] [8]

Mcp-radar: A multi- dimensional benchmark for evaluating tool use capabilities in large language models

Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, and Chao Shen. Mcp-radar: A multi- dimensional benchmark for evaluating tool use capabilities in large language models. arXiv preprint arXiv:2505.16700, 2025

work page arXiv 2025

[9] [9]

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Y ujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453 , 2025

work page arXiv 2025

[10] [10]

Mcp-universe: Benchmark- ing large language models with real-world model context protocol servers

Ziyang Luo, Zhiqi Shen, Wenzhuo Y ang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. Mcp-universe: Benchmark- ing large language models with real-world model context protocol servers. arXiv preprint arXiv:2508.14704, 2025

work page arXiv 2025

[11] [11]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[12] [12]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications

Wei He, Y ueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications. arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025

[13] [13]

Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark

Y ang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Y e, Ligong Han, Can Jin, and Dimitris N Metaxas. Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark. arXiv preprint arXiv:2511.17729, 2025

work page arXiv 2025

[14] [14]

Omnigaia: Towards native omni-modal ai agents

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong, Jiajie Jin, Hao Wang, Yinuo Wang, Ji-Rong Wen, Y uan Lu, et al. Omnigaia: Towards native omni-modal ai agents. arXiv preprint arXiv:2602.22897, 2026. 10

work page arXiv 2026

[15] [15]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Y u, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539– 68551, 2023

work page 2023

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Y ao, Jeffrey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Gorilla: Large language model connected with massive apis

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems , 37: 126544–126565, 2024

work page 2024

[18] [18]

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Y ongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Y ueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

work page 2023

[19] [19]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Y ann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learn- ing Representations, 2023

work page 2023

[20] [20]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment. arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark. arXiv preprint arXiv:2508.07575 , 2025

work page arXiv 2025

[22] [22]

Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Y an, Si Liu, Wei Y e, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents. arXiv preprint arXiv:2510.24563, 2025

work page arXiv 2025

[23] [23]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Y u Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Y u Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 881–905, 2024

work page 2024

[25] [25]

τ -voice: Bench- marking full-duplex voice agents on real-world domains

Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan. τ -voice: Bench- marking full-duplex voice agents on real-world domains. arXiv preprint arXiv:2603.13686 , 2026

work page arXiv 2026

[26] [26]

Mmdeepresearch-bench: A benchmark for multimodal deep research agents

Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, et al. Mmdeepresearch-bench: A benchmark for multimodal deep research agents. arXiv preprint arXiv:2601.12346, 2026

work page arXiv 2026

[27] [27]

Visualagent bench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Y u Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Li, Hanlin Zhao, et al. Visualagent bench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024

work page arXiv 2024

[28] [28]

Prosoftarena: Benchmarking hierarchical capabilities of multimodal agents in professional software environments

Jiaxin Ai, Y ukang Feng, Fanrui Zhang, Jianwen Sun, Zizhen Li, Chuanhao Li, Yifan Chang, Wenxiao Wu, Ruoxi Wang, Mingliang Zhai, and Kaipeng Zhang. Prosoftarena: Benchmarking hierarchical capabilities of multimodal agents in professional software environments. arXiv preprint arXiv:2601.02399, 2025. 11

work page arXiv 2025

[29] [29]

Mllm-tool: A multimodal large language model for tool agent learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6678–6687. IEEE, 2025

work page 2025

[30] [30]

Univa: Universal video agent towards open-source next-generation video generalist.arXiv preprint arXiv:2511.08521, 2025

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Y uechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, et al. Univa: Universal video agent towards open-source next-generation video generalist. arXiv preprint arXiv:2511.08521, 2025

work page arXiv 2025

[31] [31]

Less is more: Focus attention for efﬁcient detr

Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, and Y unhe Wang. Less is more: Focus attention for efﬁcient detr. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6674–6683, 2023

work page 2023

[32] [32]

Fila-video: Spatio-temporal compression for ﬁne- grained long video understanding

Y anan Guo, Wenhui Dong, Jun Song, Shiding Zhu, Xuan Zhang, Hanqing Y ang, Yingbo Wang, Y ang Du, Xianing Chen, and Bo Zheng. Fila-video: Spatio-temporal compression for ﬁne- grained long video understanding. arXiv preprint arXiv:2504.20384, 2025

work page arXiv 2025

[33] [33]

Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Y u, Wei Peng, Xinbin Y uan, Yifei Bi, Ming Zhao, Zian Zhou, et al. Ivy-fake: A uniﬁed explainable framework and benchmark for image and video aigc detection. arXiv preprint arXiv:2506.00979, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Fila: Fine- grained vision language models

Shiding Zhu, Wenhui Dong, Jun Song, Yingbo Wang, Y anan Guo, and Bo Zheng. Fila: Fine- grained vision language models. arXiv preprint arXiv:2412.08378, 2024

work page arXiv 2024

[35] [35]

Spinebench: A clinically salient, level- aware benchmark powered by the spinemed-450k corpus

Ming Zhao, Wenhui Dong, Y ang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Y unzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, et al. Spinebench: A clinically salient, level- aware benchmark powered by the spinemed-450k corpus. arXiv preprint arXiv:2510.03160 , 2025

work page arXiv 2025

[36] [36]

Judging llm-as-a-judge with mt- bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Y onghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 12 A Catalog of MCP Servers We show all the MCP servers used in the TOBench in Table 3. M...

work page 2023

[37] [37]

After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

Visual Consistency: When designing the hotel’s welcome-themed PPTX, you should ﬁrst review the template ﬁle example.pptx and then design the PPTX according to the template’s layout. After completing the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

work page

[38] [39]

The Greeting section text should be set to 24 pt

PPTx details: The text for the Welcome Title should be set to 32 pt. The Greeting section text should be set to 24 pt. The provided hotel_logo.png must be inserted at the bottom of the slide, and the image should not be overly eye-catching. ## Operational Guidelines

work page

[39] [40]

• Iterative Execution : Execute your plan step-by-step

Task Processing Protocol • Analyze & Plan : Upon receiving a request, explicitly reason about the requirements and formulate a structured preliminary plan. • Iterative Execution : Execute your plan step-by-step. After each tool call, analyze the result to decide the next step. If a step fails or produces unexpected results, reﬂect on the cause and adjust ...

work page

[40] [41]

Inspect these assets only when essential for task comprehension or result veriﬁcation using available viewer tools

Multimodal Data Handling • Selective Inspection : Y ou have access to multimodal inputs (text, images, audio, video). Inspect these assets only when essential for task comprehension or result veriﬁcation using available viewer tools. • Document Standards: For document generation tasks (PPT, LaTeX, Word), you are responsible for ensuring professional forma...

work page

[41] [42]

#### STOP

Tool Usage & File Management • Parameter Alignment: When using generation tools (e.g., for media creation), carefully select pa- rameters that align with the speciﬁc context, style, and constraints of the user’s request. • Absolute Paths Mandatory : Y ou must use absolute paths for all ﬁle references, whether writing input_file_path or output_file_path pa...

work page

[42] [43]

Faithful to the Persona: Speak and act according to the identity, tone, and context deﬁned in User Persona

work page

[43] [44]

Background Description + Speciﬁc Request

Faithful to the Instructions : Y our requests must strictly align with the Task Configuration . Do not hallucinate or deviate from the given task details. ## Task Conﬁguration ### User Persona I am the General Manager of Nebula Heights Resort. ### Instructions We have a very important guest checking in today: a boy named Leo who is celebrating his 10th bi...

work page

[44] [45]

Inventory & Pricing: The hotel’s standard rates are $25 for round tables, $30 for rectangular tables, and $5 per chair

work page

[45] [46]

Figure 7: Rough ﬂoor plan for the Johnson-Smith wedding layout_sketch.jpg

Excel Format: The Excel workbook should contain only one worksheet named budget, and all calcu- lations and summaries must be performed on this sheet. Figure 7: Rough ﬂoor plan for the Johnson-Smith wedding layout_sketch.jpg. (a) (b) Figure 8: Comparison of output results: (a) Ground truth for the signature placement task; (b) Final image generated by Gem...

work page

[46] [47]

Do not create Word, PDF, or other document formats

Output Scope: Only generate the ﬁnal composited image ﬁle. Do not create Word, PDF, or other document formats

work page

[47] [48]

The tenant’s signature must be placed on the line labeled TENANT’S SIGNATURE

Placement Requirement: The landlord’s signature must be placed on the line labeled LANDLORD’S SIGNATURE at the bottom of the contract. The tenant’s signature must be placed on the line labeled TENANT’S SIGNATURE

work page

[48] [49]

Signature strokes must remain clear and legible

Size and Proportion: Each signature must be proportionally scaled according to the length of the signature line, preserving the original aspect ratio without stretching. Signature strokes must remain clear and legible

work page

[49] [50]

They should only overlap the signature line area

No Obstruction: The signatures must not cover the labels ( LANDLORD’S SIGNATURE / TENANT’S SIGNATURE) or any critical contract text. They should only overlap the signature line area

work page

[50] [51]

Output Naming: Save the ﬁnal output ﬁle as signed_rental_agreement.png. (a) (b) Figure 9: Initial and output ﬁles for the signature placement task: (a) Original lease agreement image (rental_agreement.jpg); (b) Output generated by Qwen 3.5 Plus showing spatial misalignment. In the task Customer_Service/Real_Estate-rental_agreement, agent (Qwen 3.5 Plus) n...

work page

[51] [52]

If the order cannot be found in the table, this must be explicitly stated in the report, and the agent must not speculate about the courier’s identity

Order and Courier V eriﬁcation: Based on the order number provided by the user, the agent must search the Excel order table to identify the corresponding courier information (such as name or courier ID). If the order cannot be found in the table, this must be explicitly stated in the report, and the agent must not speculate about the courier’s identity

work page

[52] [53]

The agent must not infer actions that are not clearly shown

Video Evidence Review Principle: All judgments regarding the courier’s behavior (including but not limited to damaging food, consuming customer items, or engaging in a conﬂict) must be strictly based on what is directly observable in the video. The agent must not infer actions that are not clearly shown. If key details are unclear or not visible, the repo...

work page

[53] [54]

The ﬁle name should follow a generic incident-based naming convention and must not hard-code the order number

Report Output Requirement: The agent must generate a formal investigation report in Word format. The ﬁle name should follow a generic incident-based naming convention and must not hard-code the order number

work page

[54] [55]

Report Structure Speciﬁcation: The report must contain the following top-level sections in order: (1) Incident Summary (2) Courier Information (sourced from order data) (3) Video Evidence Observations (objective, factual description only) (4) Conﬂict Cause Analysis (5) Handling Recommendations and Con- clusion

work page

[55] [56]

Penalty Recommendation Rules: A ﬁne or disciplinary action may only be recommended if the video evidence clearly supports that the courier engaged in improper behavior that harmed the customer’s inter- ests. If a dispute is visible but malicious behavior cannot be conﬁrmed, the recommendation should focus on further investigation, warning, or training, wi...

work page

[56] [57]

Compensation and Redelivery Decision: If it can be conﬁrmed that the food was damaged or consumed, the conclusion must clearly state whether redelivery or refund/compensation is recommended, and the de- cision must be justiﬁed based on veriﬁable evidence

work page

[57] [58]

path": workspace/video_frames/sec_0000.jpg Tool Response 22 turn9: read_media_ﬁle

the ﬁnal report must be named as incident_report.docx turn8: read_media_ﬁle "path": workspace/video_frames/sec_0000.jpg Tool Response 22 turn9: read_media_ﬁle "path": workspace/video_frames/sec_0010.jpg Tool Response turn10: read_media_ﬁle "path": workspace/video_frames/sec_0020.jpg Tool Response turn11: read_media_ﬁle "path": workspace/video_frames/sec_0...

work page

[58] [59]

No additional reasons may be inferred or added

Video Understanding: The agent must fully watch the input video and identify exactly three reasons stated by the speaker explaining why working for UberEats is not recommended. No additional reasons may be inferred or added

work page

[59] [60]

Reason Condensation: Each reason must be summarized into one concise, impactful sentence suitable for social media consumption

work page

[60] [61]

Image Extraction: When the ﬁrst reason is being discussed, if the video displays any work-hour or income-related statistics interface, the agent must capture that frame and save it as a standalone image ﬁle named UberEats_WorkHours_Screenshot.png

work page

[61] [62]

path": workspace/UberEats_audio.mp3 Tool Response

Final Output: The agent must generate a complete Instagram caption and save all written content into a Word document named UberEats_Instagram_Content.docx. The image ﬁle must not be embedded in the Word document. turn4: audio_to_text "path": workspace/UberEats_audio.mp3 Tool Response "srt_text": "Here are the top three reasons not to work for food deliver...

work page

[62] [63]

Glitch Moment Deﬁnition: The glitch moment refers to the earliest timestamp where dupli- cated enchanted items spontaneously appear in inventory or container interface without legitimate crafting, trading, or natural loot acquisition behavior

work page

[63] [64]

Timestamp Identiﬁcation: The agent must review the full video footage to conﬁrm the exact ﬁrst trigger time of the duplication glitch

work page

[64] [65]

Save them as glitch_pre.png, glitch_trigger.png, glitch_post.png

Keyframes: Extract exactly three PNG keyframes from the original video: (a) 2 seconds before the glitch trigger moment, (b) the exact glitch trigger moment, (c) 2 seconds after the glitch trigger moment. Save them as glitch_pre.png, glitch_trigger.png, glitch_post.png

work page

[65] [66]

video_path

Report: Create mc_item_duplication_bug_report.docx with ﬁxed sections: Title, Test Envi- ronment (mark unknown items as Not provided ), Numbered Reproduction Steps, Expected Behavior, Actual Behavior, and Evidence Attachment (list screenshot ﬁles with corresponding MM:SS timestamps). All timestamps must follow MM:SS standard format. turn4: export_frames_e...

work page

[66] [67]

After com- pleting the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

Visual Consistency: When designing the hotel’s welcome-themed PPTX, you should ﬁrst review the template ﬁle example.pptx and then design the PPTX according to the template’s layout. After com- pleting the PPT, you should check that the layout is consistent and the design is visually appealing, and adjust the PPT as needed

work page

[67] [68]

If the generated image does not meet the user’s actual requirements, you need to re-generate the image until it meets the user’s requirements

work page

[68] [69]

The Greeting section text should be set to 24 pt

PPTx details: The text for the Welcome Title should be set to 32 pt . The Greeting section text should be set to 24 pt . The provided hotel_logo.png must be inserted at the bottom of the slide, and the image should not be overly eye-catching. 30 Thinking Let me break down this task:

work page

[69] [70]

Leo_astronaut.png

First, I need to generate an image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky as the background. The image should be named "Leo_astronaut.png"

work page

[70] [71]

welcome_tv.pptx

Then, I need to create a welcome PPTX ("welcome_tv.pptx") using the template "example.pptx": - Wel- come Title: "Space Explorer" - Insert the generated image ("Leo_astronaut.png") into the position speci- ﬁed by the template - Greeting section: "Happy 10th Birthday, Astronaut Leo!" - Insert "hotel_logo.png" at the bottom of the slide Let me start by: 1. R...

work page

[71] [72]

**Leo.png**: Shows a young boy in a blue striped shirt holding a yellow soccer ball, running on grass with a playground in the background

work page

[72] [73]

**hotel_logo.png**: The Hilton Hotels & Resorts logo

work page

[73] [74]

Title Slide

**Template (example.pptx)**: Has 11 layouts. I need to understand the structure better. The template has layouts like "Title Slide", "Title and Content", etc. Now let me: 1. First generate the astronaut image of Leo on the moon with a spaceship and blue starry sky 2. Then create the PPTX using the template Let me start by generating the astronaut image. i...

work page

[74] [75]

Leo_astronaut.png

Generated "Leo_astronaut.png" - An image of Leo wearing a spacesuit, standing on the moon, with a spaceship next to him and a blue starry sky background. 2. Created "welcome_tv.pptx" from the template

work page

[75] [76]

Space Explorer

Added "Space Explorer" as the Welcome Title (32 pt font). 4. Added "Happy 10th Birthday, Astronaut Leo!" as the Greeting (24 pt font). 5. Inserted the "Leo_astronaut.png" image into the slide. 6. Inserted the "hotel_logo.png" at the bottom of the slide (smaller size so it’s not overly eye-catching). The task is complete! local-claim_done Tool Response you...

work page

[76] [77]

After taking the screenshot, please check that the screenshot content is complete and correct

work page

[77] [78]

Thinking The user wants me to:

Please use Markdown syntax for inserting images and writing their captions. Thinking The user wants me to:

work page

[78] [79]

Listen to a video (‘Notebook.mp4‘) about using pandas to read an Excel table

work page

[79] [80]

Identify two main parts: ‘Data Preparation‘ and ‘Code Example‘

work page

[80] [81]

Create a ‘tutorial.md‘ ﬁle with second-level headers for each part

work page