pith. sign in

arxiv: 2505.10887 · v3 · submitted 2025-05-16 · 💻 cs.AI

InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Pith reviewed 2026-05-22 15:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal agentcomputer interactiontool-based agentsvision agentsmodular architectureOSWorldgeneralist agentSWE-Bench
0
0 comments X

The pith

InfantAgent-Next reaches 7.27 percent accuracy on OSWorld by letting tool and vision agents collaborate in a modular setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InfantAgent-Next as a multimodal generalist agent for computer interaction that handles text, images, audio, and video inputs. It builds a highly modular architecture that combines tool-based agents with pure vision agents so different models can address separate parts of a task one step at a time. This setup is shown to work on vision-heavy benchmarks such as OSWorld and on tool-intensive ones such as GAIA and SWE-Bench. A reader would care because the design aims to avoid both single-model rigidity and complex fixed workflows, opening a route toward more flexible agents that can switch between perception and action modes as needed.

Core claim

InfantAgent-Next achieves 7.27 percent accuracy on OSWorld, higher than Claude-Computer-Use, by integrating tool-based and pure vision agents within a highly modular architecture that enables different models to collaboratively solve decoupled tasks in a step-by-step manner. The same architecture supports evaluation on GAIA and SWE-Bench to demonstrate broader applicability across vision-based and tool-intensive computer interaction benchmarks.

What carries the argument

The highly modular architecture that integrates tool-based agents and pure vision agents to allow collaborative, step-by-step solving of decoupled tasks.

If this is right

  • The agent can be evaluated on pure vision-based real-world benchmarks such as OSWorld.
  • It also performs on general or tool-intensive benchmarks including GAIA and SWE-Bench.
  • Different models can be plugged into the architecture to handle specific aspects of the overall task.
  • The open-sourced codes and evaluation scripts allow direct replication and extension on new benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same splitting principle might reduce the size of any single model required for complex computer tasks by distributing perception and action across specialized components.
  • Similar modularity could be tested in non-computer domains that mix visual observation with tool use, such as robotic manipulation or document processing.
  • Future measurements could track whether coordination between the two agent types adds latency or error accumulation as task length increases.

Load-bearing premise

The modular split between tool-based and vision agents produces reliable step-by-step collaboration without hidden coordination costs or benchmark-specific tuning that would invalidate cross-benchmark comparisons.

What would settle it

A controlled test on OSWorld that runs the same underlying models both with and without the modular split and finds that the split version does not exceed single-model baselines would show the architecture does not deliver the claimed benefit.

Figures

Figures reproduced from arXiv: 2505.10887 by Ali Payani, Bin Lei, Caiwen Ding, Mimi Xie, Mingyi Hong, Shan Zuo, Weitai Kang, Winson Chen, Xi Xie, Yan Yan, Zijian Zhang.

Figure 1
Figure 1. Figure 1: Three real-world task examples addressed by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: INFANTAGENT-NEXT architecture overview. : User input argument and request. Envi￾ronment related icons: : Agent Interaction Environment. : Terminal interface. : GNOME desktop. : Jupyter. Models related icons: : Load Workflow models. : Planning model : Tool Selection model : Execution model : Planner. : Tool Selector. : Executer. Tools related icon: : Load Toolsets. : Tool models argument: Vision_model_name.… view at source ↗
Figure 3
Figure 3. Figure 3: We conduct an ablation study on the Iterative Region Cropping setup from four perspectives. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation on a subset of SWE-Bench-Verified. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cases analysis. Zoom in to view the detailed content in the screenshot. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InfantAgent-Next, a multimodal generalist agent for automated computer interaction that integrates tool-based and pure-vision agents within a highly modular architecture. This design is claimed to enable different models to collaboratively solve decoupled tasks in a step-by-step manner. The approach is evaluated on OSWorld (reporting 7.27% accuracy, exceeding Claude-Computer-Use), as well as GAIA and SWE-Bench, with code and evaluation scripts open-sourced.

Significance. If the modular collaboration demonstrably improves performance beyond single-agent baselines or model selection effects, the work could advance generalist agents by showing how decoupled tool and vision components can be orchestrated without heavy workflow engineering. The open-sourcing of code and multi-benchmark evaluation are positive for reproducibility and generality claims.

major comments (2)
  1. [Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.
  2. [Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.
minor comments (2)
  1. [Architecture] The abstract and evaluation sections could more explicitly define the coordination protocol between tool-based and vision agents (e.g., message passing format or decision criteria for handoff).
  2. [Abstract] Minor notation inconsistency: the paper uses both 'InfantAgent-Next' and 'InfantAgent' in the abstract; standardize throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will incorporate additional experimental details and analyses in the revised manuscript to better substantiate the claims.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] The central claim attributes the 7.27% OSWorld accuracy to the modular integration of tool-based and pure-vision agents enabling reliable step-by-step collaboration. However, the manuscript provides no ablation studies that disable inter-agent handoff, force a unified model, or compare against single-agent variants on the same OSWorld subset. Without these controls, the performance gain cannot be isolated from component model choice or prompting.

    Authors: We agree that explicit ablations would more rigorously isolate the contribution of the modular handoff mechanism. The manuscript does compare against Claude-Computer-Use, which operates as a single-model agent without the described tool-vision decoupling, and reports higher accuracy on OSWorld while also showing results on GAIA and SWE-Bench. However, we did not run dedicated single-model or no-handoff variants on the identical OSWorld task subset. In the revision we will add these controls: a unified-model baseline using the same component models without inter-agent collaboration, and a version that disables handoff, to clarify whether the step-by-step modular orchestration provides gains beyond model selection or prompting. revision: yes

  2. Referee: [Experiments] § on OSWorld results: the reported accuracy lacks error bars, detailed exclusion rules for task instances, or a full experimental protocol (e.g., number of runs, temperature settings, or failure mode categorization). This makes it difficult to assess whether the result is robust or benchmark-specific.

    Authors: We acknowledge that the experimental reporting was insufficiently detailed. The original manuscript presented the headline accuracy but omitted variance measures and protocol specifics. In the revised version we will add error bars computed over multiple runs, state the number of runs and temperature settings (e.g., 0 for deterministic decoding), provide explicit exclusion criteria for task instances, and include a failure-mode breakdown. The complete protocol will be documented in an expanded appendix to support reproducibility and robustness evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical evaluations on public benchmarks

full rationale

The manuscript presents an agent architecture and reports accuracy figures (e.g., 7.27% on OSWorld) obtained via direct evaluation on external public benchmarks. No equations, fitted parameters, or first-principles derivations are described that would reduce to the inputs by construction. The modular integration of tool-based and vision agents is an engineering choice whose contribution is asserted through benchmark outcomes rather than any self-referential definition or self-citation chain that bears the central load. This is a standard empirical systems paper whose claims remain independent of the circularity patterns enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from the AI-agent literature about modularity and benchmark validity; no new mathematical axioms or invented physical entities are introduced.

axioms (1)
  • domain assumption Modular decomposition of tasks between tool and vision agents enables effective collaboration without prohibitive integration overhead
    Invoked in the description of the architecture that allows different models to solve decoupled tasks step-by-step.

pith-pipeline@v0.9.0 · 5706 in / 1199 out tokens · 44840 ms · 2026-05-22T15:14:08.462960+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 14 internal anchors

  1. [1]

    Cortexa: Enhancing llm agents for software engineering tasks via improved localization and solution diversity.https://research.nvidia.com/labs/adlr/cortexa/

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

  4. [4]

    Agentfarm.https://aide.dev/

    AgentFarm. Agentfarm.https://aide.dev/

  5. [5]

    Amazon q developer.https://aws.amazon.com/q/developer/

    Amazon. Amazon q developer.https://aws.amazon.com/q/developer/

  6. [6]

    Claude 3.7 sonnet

    Anthropic. Claude 3.7 sonnet. Available athttps://www.anthropic.com/claude/sonnet

  7. [7]

    Claude computer use

    Anthropic. Claude computer use. Available at https://www.anthropic.com/news/ 3-5-models-and-computer-use

  8. [8]

    Appmap navie v2.https://appmap.io/product/appmap-navie.html

    AppMap. Appmap navie v2.https://appmap.io/product/appmap-navie.html

  9. [9]

    Autocoderover.https://www.autocoderover.net/

    AutoCodeRover. Autocoderover.https://www.autocoderover.net/

  10. [10]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  11. [11]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

  12. [12]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

  13. [13]

    https://devlo.ai/.https://devlo.ai/

    devlo. https://devlo.ai/.https://devlo.ai/

  14. [14]

    GAIA benchmark leaderboard

    Hugging Face. GAIA benchmark leaderboard. https://huggingface.co/spaces/gaia-benchmark/ leaderboard, 2025. Accessed: 2025-05-15

  15. [15]

    Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

    Dawei Gao, Zitao Li, Xuchen Pan, Weirui Kuang, Zhijian Ma, Bingchen Qian, Fei Wei, Wenhao Zhang, Yuexiang Xie, Daoyuan Chen, et al. Agentscope: A flexible yet robust multi-agent platform.arXiv preprint arXiv:2402.14034, 2024

  16. [16]

    Google. Langfun. GitHub repository, 2025.https://github.com/google/langfun

  17. [17]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 10

  18. [18]

    Significant Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2025. Ac- cessed: 2025-05-15

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    H2o.ai.https://id.public.h2o.ai/

    H2O.ai. H2o.ai.https://id.public.h2o.ai/

  21. [21]

    Auto-deep-research

    HKUDS. Auto-deep-research. GitHub repository. https://github.com/HKUDS/ Auto-Deep-Research

  22. [22]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 3(4):6, 2023

  23. [23]

    Huggingface agents

    huggingface. Huggingface agents. https://huggingface.co/docs/transformers/v4.51.3/ agents

  24. [24]

    open deep research.https://huggingface.co/blog/open-deep-research

    huggingface. open deep research.https://huggingface.co/blog/open-deep-research

  25. [25]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  26. [26]

    SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

  27. [27]

    Langchain

    LangChain AI. Langchain. GitHub repository, 2025. Accessed: 2025-05-14

  28. [28]

    Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

    Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, and Qiuwu Chen. Infant agent: A tool-integrated, logic-driven agent with cost-effective api usage.arXiv preprint arXiv:2411.01114, 2024

  29. [29]

    Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Commu- nicative agents for" mind" exploration of large language model society.Advances in Neural Information Processing Systems, 36:51991–52008, 2023

  30. [30]

    Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981, 2025

  31. [31]

    Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

  32. [32]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  33. [33]

    Gaia: a benchmark for general ai assistants, 2023

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023

  34. [34]

    Yohei Nakajima. Babyagi. GitHub repository, 2024. Accessed: 2025-05-14

  35. [35]

    A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

    Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

  36. [36]

    Chatgpt.https: // openai

    OpenAI. Chatgpt.https: // openai. com/ gpt-4

  37. [37]

    Computer-using agent

    OpenAI. Computer-using agent. Available at https://openai.com/index/ computer-using-agent/

  38. [38]

    Introducing openai o3 and o4-mini

    OpenAI. Introducing openai o3 and o4-mini. Available at https://openai.com/index/ introducing-o3-and-o4-mini/

  39. [39]

    Ormind.https://ormind.ai/

    Ormind. Ormind.https://ormind.ai/. 11

  40. [40]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  41. [41]

    A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

    Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150, 2025

  42. [42]

    Tapeagents

    ServiceNow. Tapeagents. GitHub repository. https://github.com/ServiceNow/TapeAgents/tree/ ui_demo/examples/gaia_agent

  43. [43]

    SIMA. Sima. https://github.com/swe-bench/experiments/tree/main/evaluation/lite/ 20240706_sima_gpt4o

  44. [44]

    Swe-agent.https://github.com/SWE-agent/SWE-agent

    SWE-agent. Swe-agent.https://github.com/SWE-agent/SWE-agent

  45. [45]

    Trase.https://www.trasesystems.com/

    TRASE. Trase.https://www.trasesystems.com/

  46. [46]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  47. [47]

    Openhands: An open platform for ai software developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations, 2024

  48. [48]

    codeshell.https://github.com/WisdomShell/codeshell

    WisdomShell. codeshell.https://github.com/WisdomShell/codeshell

  49. [49]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

  50. [50]

    Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

  51. [51]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

  52. [52]

    Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.arXiv preprint, 2024

  53. [53]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  54. [54]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

  55. [55]

    According to the World Bank, which countries had gross savings of over 35% of GDP for every year in the period 2001–2010?

    Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions.arXiv preprint arXiv:2412.16256, 2024. 12 A Case Analysis Figure 5 illustrates the step-by-step process by which INFANTAGENT-NEXTsolves a real-world query:“According to the World Bank, which countries had gross savings of over 3...

  56. [56]

    If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

    File editing related commands : This set of commands can be used to view file content , as well as perform additions , deletions , searches , and m o d i f i c a t i o n s on files . If you want to select this set of commands , please return : < toolkit > file_edit </ toolkit >

  57. [57]

    If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

    Code ex ec ut ion related commands : This set of commands can be used to execute code snippets . If you want to select this set of commands , please return : < toolkit > code_exec </ toolkit >

  58. [58]

    If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

    Computer i n t e r a c t i o n commands : These commands can be used to interact with the computer via the keyboard and mouse . If you want to select this set of commands , please return : < toolkit > c o m p u t e r _ i n t e r a c t i o n </ toolkit >

  59. [59]

    If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

    Web browsing related commands : This set of commands can be used to interact with web pages . If you want to select this set of commands , please return : < toolkit > web_browse </ toolkit >

  60. [60]

    Search Google or type a URL

    File u n d e r s t a n d i n g related commands : This set of commands can be used to u n d e r s t a n d the content of files . Such as reading files , view images , listen to audios , watch videos , etc . If you want to select this set of commands , please return : < toolkit > file_understand </ toolkit > If you want to select multiple sets of commands ...

  61. [61]

    Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...