InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Ali Payani; Bin Lei; Caiwen Ding; Mimi Xie; Mingyi Hong; Shan Zuo; Weitai Kang; Winson Chen; Xi Xie; Yan Yan

arxiv: 2505.10887 · v3 · submitted 2025-05-16 · 💻 cs.AI

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Bin Lei , Weitai Kang , Zijian Zhang , Winson Chen , Xi Xie , Shan Zuo , Mimi Xie , Ali Payani

show 3 more authors

Mingyi Hong Yan Yan Caiwen Ding

This is my paper

Pith reviewed 2026-05-22 15:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords multimodal agentcomputer interactiontool-based agentsvision agentsmodular architectureOSWorldgeneralist agentSWE-Bench

0 comments

The pith

InfantAgent-Next reaches 7.27 percent accuracy on OSWorld by letting tool and vision agents collaborate in a modular setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InfantAgent-Next as a multimodal generalist agent for computer interaction that handles text, images, audio, and video inputs. It builds a highly modular architecture that combines tool-based agents with pure vision agents so different models can address separate parts of a task one step at a time. This setup is shown to work on vision-heavy benchmarks such as OSWorld and on tool-intensive ones such as GAIA and SWE-Bench. A reader would care because the design aims to avoid both single-model rigidity and complex fixed workflows, opening a route toward more flexible agents that can switch between perception and action modes as needed.

Core claim

InfantAgent-Next achieves 7.27 percent accuracy on OSWorld, higher than Claude-Computer-Use, by integrating tool-based and pure vision agents within a highly modular architecture that enables different models to collaboratively solve decoupled tasks in a step-by-step manner. The same architecture supports evaluation on GAIA and SWE-Bench to demonstrate broader applicability across vision-based and tool-intensive computer interaction benchmarks.

What carries the argument

The highly modular architecture that integrates tool-based agents and pure vision agents to allow collaborative, step-by-step solving of decoupled tasks.

If this is right

The agent can be evaluated on pure vision-based real-world benchmarks such as OSWorld.
It also performs on general or tool-intensive benchmarks including GAIA and SWE-Bench.
Different models can be plugged into the architecture to handle specific aspects of the overall task.
The open-sourced codes and evaluation scripts allow direct replication and extension on new benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same splitting principle might reduce the size of any single model required for complex computer tasks by distributing perception and action across specialized components.
Similar modularity could be tested in non-computer domains that mix visual observation with tool use, such as robotic manipulation or document processing.
Future measurements could track whether coordination between the two agent types adds latency or error accumulation as task length increases.

Load-bearing premise

The modular split between tool-based and vision agents produces reliable step-by-step collaboration without hidden coordination costs or benchmark-specific tuning that would invalidate cross-benchmark comparisons.

What would settle it

A controlled test on OSWorld that runs the same underlying models both with and without the modular split and finds that the split version does not exceed single-model baselines would show the architecture does not deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2505.10887 by Ali Payani, Bin Lei, Caiwen Ding, Mimi Xie, Mingyi Hong, Shan Zuo, Weitai Kang, Winson Chen, Xi Xie, Yan Yan, Zijian Zhang.

**Figure 2.** Figure 2: INFANTAGENT-NEXT architecture overview. : User input argument and request. Environment related icons: : Agent Interaction Environment. : Terminal interface. : GNOME desktop. : Jupyter. Models related icons: : Load Workflow models. : Planning model : Tool Selection model : Execution model : Planner. : Tool Selector. : Executer. Tools related icon: : Load Toolsets. : Tool models argument: Vision_model_name.… view at source ↗

**Figure 3.** Figure 3: We conduct an ablation study on the Iterative Region Cropping setup from four perspectives. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation on a subset of SWE-Bench-Verified. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Cases analysis. Zoom in to view the detailed content in the screenshot. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InfantAgent-Next, a multimodal generalist agent for automated computer interaction that integrates tool-based and pure-vision agents within a highly modular architecture. This design is claimed to enable different models to collaboratively solve decoupled tasks in a step-by-step manner. The approach is evaluated on OSWorld (reporting 7.27% accuracy, exceeding Claude-Computer-Use), as well as GAIA and SWE-Bench, with code and evaluation scripts open-sourced.

Significance. If the modular collaboration demonstrably improves performance beyond single-agent baselines or model selection effects, the work could advance generalist agents by showing how decoupled tool and vision components can be orchestrated without heavy workflow engineering. The open-sourcing of code and multi-benchmark evaluation are positive for reproducibility and generality claims.

major comments (2)

[Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.
[Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.

minor comments (2)

[Architecture] The abstract and evaluation sections could more explicitly define the coordination protocol between tool-based and vision agents (e.g., message passing format or decision criteria for handoff).
[Abstract] Minor notation inconsistency: the paper uses both 'InfantAgent-Next' and 'InfantAgent' in the abstract; standardize throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will incorporate additional experimental details and analyses in the revised manuscript to better substantiate the claims.

read point-by-point responses

Referee: [Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.

Authors: We agree that explicit ablations would more rigorously isolate the contribution of the modular handoff mechanism. The manuscript does compare against Claude-Computer-Use, which operates as a single-model agent without the described tool-vision decoupling, and reports higher accuracy on OSWorld while also showing results on GAIA and SWE-Bench. However, we did not run dedicated single-model or no-handoff variants on the identical OSWorld task subset. In the revision we will add these controls: a unified-model baseline using the same component models without inter-agent collaboration, and a version that disables handoff, to clarify whether the step-by-step modular orchestration provides gains beyond model selection or prompting. revision: yes
Referee: [Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.

Authors: We acknowledge that the experimental reporting was insufficiently detailed. The original manuscript presented the headline accuracy but omitted variance measures and protocol specifics. In the revised version we will add error bars computed over multiple runs, state the number of runs and temperature settings (e.g., 0 for deterministic decoding), provide explicit exclusion criteria for task instances, and include a failure-mode breakdown. The complete protocol will be documented in an expanded appendix to support reproducibility and robustness evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical evaluations on public benchmarks

full rationale

The manuscript presents an agent architecture and reports accuracy figures (e.g., 7.27% on OSWorld) obtained via direct evaluation on external public benchmarks. No equations, fitted parameters, or first-principles derivations are described that would reduce to the inputs by construction. The modular integration of tool-based and vision agents is an engineering choice whose contribution is asserted through benchmark outcomes rather than any self-referential definition or self-citation chain that bears the central load. This is a standard empirical systems paper whose claims remain independent of the circularity patterns enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from the AI-agent literature about modularity and benchmark validity; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption Modular decomposition of tasks between tool and vision agents enables effective collaboration without prohibitive integration overhead
Invoked in the description of the architecture that allows different models to solve decoupled tasks step-by-step.

pith-pipeline@v0.9.0 · 5706 in / 1199 out tokens · 44840 ms · 2026-05-22T15:14:08.462960+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Iterative Region Cropping and Mouse Click Logic (Algorithm 1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 14 internal anchors

[1]

Cortexa: Enhancing llm agents for software engineering tasks via improved localization and solution diversity.https://research.nvidia.com/labs/adlr/cortexa/

work page
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

work page internal anchor Pith review arXiv 2025
[4]

Agentfarm.https://aide.dev/

AgentFarm. Agentfarm.https://aide.dev/

work page
[5]

Amazon q developer.https://aws.amazon.com/q/developer/

Amazon. Amazon q developer.https://aws.amazon.com/q/developer/

work page
[6]

Claude 3.7 sonnet

Anthropic. Claude 3.7 sonnet. Available athttps://www.anthropic.com/claude/sonnet

work page
[7]

Claude computer use

Anthropic. Claude computer use. Available at https://www.anthropic.com/news/ 3-5-models-and-computer-use

work page
[8]

Appmap navie v2.https://appmap.io/product/appmap-navie.html

AppMap. Appmap navie v2.https://appmap.io/product/appmap-navie.html

work page
[9]

Autocoderover.https://www.autocoderover.net/

AutoCodeRover. Autocoderover.https://www.autocoderover.net/

work page
[10]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

work page arXiv 2023
[12]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

https://devlo.ai/.https://devlo.ai/

devlo. https://devlo.ai/.https://devlo.ai/

work page
[14]

GAIA benchmark leaderboard

Hugging Face. GAIA benchmark leaderboard. https://huggingface.co/spaces/gaia-benchmark/ leaderboard, 2025. Accessed: 2025-05-15

work page 2025
[15]

Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024
[16]

Google. Langfun. GitHub repository, 2025.https://github.com/google/langfun

work page 2025
[17]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Significant Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2025. Ac- cessed: 2025-05-15

work page 2025
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

H2o.ai.https://id.public.h2o.ai/

H2O.ai. H2o.ai.https://id.public.h2o.ai/

work page
[21]

Auto-deep-research

HKUDS. Auto-deep-research. GitHub repository. https://github.com/HKUDS/ Auto-Deep-Research

work page
[22]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 3(4):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Huggingface agents

huggingface. Huggingface agents. https://huggingface.co/docs/transformers/v4.51.3/ agents

work page
[24]

open deep research.https://huggingface.co/blog/open-deep-research

huggingface. open deep research.https://huggingface.co/blog/open-deep-research

work page
[25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[27]

Langchain

LangChain AI. Langchain. GitHub repository, 2025. Accessed: 2025-05-14

work page 2025
[28]

Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, and Qiuwu Chen. Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

work page arXiv 2024
[29]

Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023
[30]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025
[31]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024
[32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[33]

Gaia: a benchmark for general ai assistants, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023

work page 2023
[34]

Yohei Nakajima. Babyagi. GitHub repository, 2024. Accessed: 2025-05-14

work page 2024
[35]

A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

work page arXiv 2025
[36]

Chatgpt.https: // openai

OpenAI. Chatgpt.https: // openai. com/ gpt-4

work page
[37]

Computer-using agent

OpenAI. Computer-using agent. Available at https://openai.com/index/ computer-using-agent/

work page
[38]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. Available at https://openai.com/index/ introducing-o3-and-o4-mini/

work page
[39]

Ormind.https://ormind.ai/

Ormind. Ormind.https://ormind.ai/. 11

work page
[40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Tapeagents

ServiceNow. Tapeagents. GitHub repository. https://github.com/ServiceNow/TapeAgents/tree/ ui_demo/examples/gaia_agent

work page
[43]

SIMA. Sima. https://github.com/swe-bench/experiments/tree/main/evaluation/lite/ 20240706_sima_gpt4o

work page
[44]

Swe-agent.https://github.com/SWE-agent/SWE-agent

SWE-agent. Swe-agent.https://github.com/SWE-agent/SWE-agent

work page
[45]

Trase.https://www.trasesystems.com/

TRASE. Trase.https://www.trasesystems.com/

work page
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[48]

codeshell.https://github.com/WisdomShell/codeshell

WisdomShell. codeshell.https://github.com/WisdomShell/codeshell

work page
[49]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

work page arXiv 2024
[51]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

work page 2024
[53]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024
[54]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

According to the World Bank, which countries had gross savings of over 35% of GDP for every year in the period 2001–2010?

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 12 A Case Analysis Figure 5 illustrates the step-by-step process by which INFANTAGENT-NEXTsolves a real-world query:“According to the World Bank, which countries had gross savings of over 3...

work page arXiv 2024
[56]

If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

File editing related commands : This set of commands can be used to view file content , as well as perform additions , deletions , searches , and m o d i f i c a t i o n s on files . If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

work page
[57]

If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

Code ex ec ut ion related commands : This set of commands can be used to execute code snippets . If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

work page
[58]

If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

Computer i n t e r a c t i o n commands : These commands can be used to interact with the computer via the keyboard and mouse . If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

work page
[59]

If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

Web browsing related commands : This set of commands can be used to interact with web pages . If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

work page
[60]

Search Google or type a URL

File u n d e r s t a n d i n g related commands : This set of commands can be used to u n d e r s t a n d the content of files . Such as reading files , view images , listen to audios , watch videos , etc . If you want to select this set of commands , please return : < toolkit > file_understand </ toolkit > If you want to select multiple sets of commands ...

work page 1921
[61]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[1] [1]

Cortexa: Enhancing llm agents for software engineering tasks via improved localization and solution diversity.https://research.nvidia.com/labs/adlr/cortexa/

work page

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

work page internal anchor Pith review arXiv 2025

[4] [4]

Agentfarm.https://aide.dev/

AgentFarm. Agentfarm.https://aide.dev/

work page

[5] [5]

Amazon q developer.https://aws.amazon.com/q/developer/

Amazon. Amazon q developer.https://aws.amazon.com/q/developer/

work page

[6] [6]

Claude 3.7 sonnet

Anthropic. Claude 3.7 sonnet. Available athttps://www.anthropic.com/claude/sonnet

work page

[7] [7]

Claude computer use

Anthropic. Claude computer use. Available at https://www.anthropic.com/news/ 3-5-models-and-computer-use

work page

[8] [8]

Appmap navie v2.https://appmap.io/product/appmap-navie.html

AppMap. Appmap navie v2.https://appmap.io/product/appmap-navie.html

work page

[9] [9]

Autocoderover.https://www.autocoderover.net/

AutoCodeRover. Autocoderover.https://www.autocoderover.net/

work page

[10] [10]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

work page arXiv 2023

[12] [12]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

https://devlo.ai/.https://devlo.ai/

devlo. https://devlo.ai/.https://devlo.ai/

work page

[14] [14]

GAIA benchmark leaderboard

Hugging Face. GAIA benchmark leaderboard. https://huggingface.co/spaces/gaia-benchmark/ leaderboard, 2025. Accessed: 2025-05-15

work page 2025

[15] [15]

Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

work page arXiv 2024

[16] [16]

Google. Langfun. GitHub repository, 2025.https://github.com/google/langfun

work page 2025

[17] [17]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Significant Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2025. Ac- cessed: 2025-05-15

work page 2025

[19] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

H2o.ai.https://id.public.h2o.ai/

H2O.ai. H2o.ai.https://id.public.h2o.ai/

work page

[21] [21]

Auto-deep-research

HKUDS. Auto-deep-research. GitHub repository. https://github.com/HKUDS/ Auto-Deep-Research

work page

[22] [22]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 3(4):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Huggingface agents

huggingface. Huggingface agents. https://huggingface.co/docs/transformers/v4.51.3/ agents

work page

[24] [24]

open deep research.https://huggingface.co/blog/open-deep-research

huggingface. open deep research.https://huggingface.co/blog/open-deep-research

work page

[25] [25]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[27] [27]

Langchain

LangChain AI. Langchain. GitHub repository, 2025. Accessed: 2025-05-14

work page 2025

[28] [28]

Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, and Qiuwu Chen. Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

work page arXiv 2024

[29] [29]

Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

work page 2023

[30] [30]

Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025

[31] [31]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024

[32] [32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024

[33] [33]

Gaia: a benchmark for general ai assistants, 2023

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023

work page 2023

[34] [34]

Yohei Nakajima. Babyagi. GitHub repository, 2024. Accessed: 2025-05-14

work page 2024

[35] [35]

A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

work page arXiv 2025

[36] [36]

Chatgpt.https: // openai

OpenAI. Chatgpt.https: // openai. com/ gpt-4

work page

[37] [37]

Computer-using agent

OpenAI. Computer-using agent. Available at https://openai.com/index/ computer-using-agent/

work page

[38] [38]

Introducing openai o3 and o4-mini

OpenAI. Introducing openai o3 and o4-mini. Available at https://openai.com/index/ introducing-o3-and-o4-mini/

work page

[39] [39]

Ormind.https://ormind.ai/

Ormind. Ormind.https://ormind.ai/. 11

work page

[40] [40]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Tapeagents

ServiceNow. Tapeagents. GitHub repository. https://github.com/ServiceNow/TapeAgents/tree/ ui_demo/examples/gaia_agent

work page

[43] [43]

SIMA. Sima. https://github.com/swe-bench/experiments/tree/main/evaluation/lite/ 20240706_sima_gpt4o

work page

[44] [44]

Swe-agent.https://github.com/SWE-agent/SWE-agent

SWE-agent. Swe-agent.https://github.com/SWE-agent/SWE-agent

work page

[45] [45]

Trase.https://www.trasesystems.com/

TRASE. Trase.https://www.trasesystems.com/

work page

[46] [46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[48] [48]

codeshell.https://github.com/WisdomShell/codeshell

WisdomShell. codeshell.https://github.com/WisdomShell/codeshell

work page

[49] [49]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

work page arXiv 2024

[51] [51]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

work page 2024

[53] [53]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024

[54] [54]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

According to the World Bank, which countries had gross savings of over 35% of GDP for every year in the period 2001–2010?

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 12 A Case Analysis Figure 5 illustrates the step-by-step process by which INFANTAGENT-NEXTsolves a real-world query:“According to the World Bank, which countries had gross savings of over 3...

work page arXiv 2024

[56] [56]

If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

File editing related commands : This set of commands can be used to view file content , as well as perform additions , deletions , searches , and m o d i f i c a t i o n s on files . If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

work page

[57] [57]

If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

Code ex ec ut ion related commands : This set of commands can be used to execute code snippets . If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

work page

[58] [58]

If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

Computer i n t e r a c t i o n commands : These commands can be used to interact with the computer via the keyboard and mouse . If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

work page

[59] [59]

If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

Web browsing related commands : This set of commands can be used to interact with web pages . If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

work page

[60] [60]

Search Google or type a URL

File u n d e r s t a n d i n g related commands : This set of commands can be used to u n d e r s t a n d the content of files . Such as reading files , view images , listen to audios , watch videos , etc . If you want to select this set of commands , please return : < toolkit > file_understand </ toolkit > If you want to select multiple sets of commands ...

work page 1921

[61] [61]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025