pith. machine review for the scientific record. sign in

arxiv: 2509.02544 · v2 · submitted 2025-09-02 · 💻 cs.AI · cs.CL· cs.CV· cs.HC

Recognition: 2 theorem links

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 10:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.HC
keywords GUI agentsmulti-turn reinforcement learningdata flywheelhybrid environmentsunified sandboxagent benchmarksend-to-end agents
0
0 comments X

The pith

UI-TARS-2 reaches 88.2 on Online-Mind2Web and 59.8 mean game score by training a native GUI agent with multi-turn reinforcement learning and a data flywheel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UI-TARS-2 as a native GUI-centered agent that unifies perception, reasoning, action, and memory through end-to-end learning. It addresses open problems in data scalability, multi-turn RL stability, GUI-only limitations, and environment consistency by introducing a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox for large-scale rollouts. If these components work as described, the model produces measurable gains over its predecessor UI-TARS-1.5 and over strong baselines including Claude and OpenAI agents on GUI and game benchmarks while generalizing to long-horizon and software-engineering tasks.

Core claim

UI-TARS-2 achieves 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld, and a mean normalized score of 59.8 across a 15-game suite by applying a data flywheel, stabilized multi-turn RL, hybrid GUI environment, and unified sandbox; the same system also shows competitive results with frontier models on LMGame-Bench and extends to information-seeking and software-engineering benchmarks.

What carries the argument

The stabilized multi-turn reinforcement learning framework together with a data flywheel for scalable data generation and a hybrid GUI environment that adds file-system and terminal access inside a unified sandbox.

If this is right

  • Outperforms Claude and OpenAI agents on multiple GUI benchmarks while remaining competitive with OpenAI o3 on game suites.
  • Generalizes to long-horizon information-seeking tasks and software-engineering benchmarks without task-specific retraining.
  • Yields training-dynamics insights that support stable and efficient large-scale agent reinforcement learning.
  • Maintains roughly 60 percent of human-level performance across the 15-game evaluation suite.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid environment may be the main factor that allows training signals from file and terminal actions to improve pure GUI performance.
  • Continued scaling of the data flywheel could close more of the remaining gap to human performance on long-horizon tasks.
  • The same combination of multi-turn RL and sandbox rollouts might transfer to non-GUI agent settings such as web navigation or code execution agents.

Load-bearing premise

The hybrid GUI environment and unified sandbox produce training and evaluation conditions that are stable and representative enough for the observed gains to transfer to real-world interactive scenarios outside the controlled benchmarks.

What would settle it

Running UI-TARS-2 on a fresh collection of real desktop and mobile tasks that lie outside the provided benchmarks and sandbox and measuring whether the reported score margins over prior models and proprietary agents are preserved.

read the original abstract

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents UI-TARS-2, a native GUI-centered agent model trained via a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment integrating file systems and terminals, and a unified sandbox for large-scale rollouts. It reports substantial gains over UI-TARS-1.5, including 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, 73.3 on AndroidWorld, and a mean normalized score of 59.8 across a 15-game suite, while claiming outperformance over baselines such as Claude and OpenAI agents and generalization to long-horizon and software engineering tasks.

Significance. If the performance gains prove attributable to the multi-turn RL stabilization and data flywheel rather than the expanded hybrid action space, the work would offer a useful empirical advance in scalable GUI agent training, providing benchmark numbers that can serve as reference points for future native agent models. The multi-benchmark evaluation and training dynamics analysis add value, though the hybrid setup's role requires clarification to support claims of GUI-centric generalization.

major comments (3)
  1. [Hybrid GUI environment and unified sandbox] Hybrid GUI environment description: the central claim of advancing GUI agents rests on the hybrid integration of file systems and terminals, yet no ablation isolates their contribution from pure GUI actions; without this, the reported deltas (e.g., 88.2 on Online-Mind2Web, 47.5 on OSWorld) may reflect richer action spaces rather than improved perception-reasoning loops, weakening the generalization assertion to real-world GUI scenarios.
  2. [Empirical evaluation] Empirical evaluation and results: benchmark scores are presented without error bars, standard deviations, number of evaluation runs, or data-exclusion criteria; this absence makes it impossible to assess statistical reliability of outperformance claims over UI-TARS-1.5 and proprietary baselines.
  3. [Training methodology] Multi-turn RL framework: the stabilized multi-turn RL is positioned as a core methodological advance, but the manuscript supplies no ablation on its components (e.g., reward shaping or turn-length handling) or concrete hyperparameters, leaving the source of training stability and the 59.8 game-suite score opaque.
minor comments (2)
  1. A consolidated table comparing all reported benchmarks against baselines (including UI-TARS-1.5, Claude, and OpenAI agents) would improve readability of the performance claims.
  2. The game-environment normalization procedure and the exact composition of the 15-game suite should be specified to allow direct replication of the 59.8 mean score.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our technical report. We address each major point below with honest responses based on the current manuscript. Where the comments identify gaps, we commit to revisions that strengthen the paper without overstating what was originally presented.

read point-by-point responses
  1. Referee: Hybrid GUI environment description: the central claim of advancing GUI agents rests on the hybrid integration of file systems and terminals, yet no ablation isolates their contribution from pure GUI actions; without this, the reported deltas (e.g., 88.2 on Online-Mind2Web, 47.5 on OSWorld) may reflect richer action spaces rather than improved perception-reasoning loops, weakening the generalization assertion to real-world GUI scenarios.

    Authors: We agree that the manuscript does not contain an explicit ablation isolating the hybrid file-system and terminal components from pure GUI actions. The hybrid environment is presented as an integrated part of the unified sandbox to support realistic long-horizon tasks that require non-GUI operations, which aligns with our generalization claims. However, without dedicated ablations, attribution of the performance deltas remains correlational rather than causal. We will revise the manuscript to include a dedicated limitations paragraph and, where feasible, preliminary comparative runs that clarify the incremental value of the hybrid actions. revision: partial

  2. Referee: Empirical evaluation and results: benchmark scores are presented without error bars, standard deviations, number of evaluation runs, or data-exclusion criteria; this absence makes it impossible to assess statistical reliability of outperformance claims over UI-TARS-1.5 and proprietary baselines.

    Authors: The referee is correct that the initial submission omitted error bars, standard deviations, run counts, and exclusion criteria. These statistics were collected during evaluation but not reported. We will add them to all main benchmark tables in the revision, along with explicit statements on the number of independent runs and any data filtering applied, to allow proper assessment of statistical reliability. revision: yes

  3. Referee: Multi-turn RL framework: the stabilized multi-turn RL is positioned as a core methodological advance, but the manuscript supplies no ablation on its components (e.g., reward shaping or turn-length handling) or concrete hyperparameters, leaving the source of training stability and the 59.8 game-suite score opaque.

    Authors: We acknowledge that the manuscript describes the stabilized multi-turn RL framework at a high level but does not provide component ablations (e.g., on reward shaping or turn-length handling) or a full hyperparameter table. These details exist in our internal training logs but were not included in the submitted version. We will add both the requested ablations and a comprehensive hyperparameter appendix in the revision to make the sources of stability and the 59.8 score transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are independent of internal definitions

full rationale

The paper presents a training methodology (data flywheel, multi-turn RL, hybrid environment, unified sandbox) and reports performance numbers on external public benchmarks (Online-Mind2Web 88.2, OSWorld 47.5, etc.). No equations, fitted parameters, or self-citations are shown that reduce these scores to quantities defined inside the training loop by construction. The central claims rest on measured deltas against independent baselines rather than renaming or self-referential derivations, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that the described training pipeline produces stable multi-turn behavior and that the chosen benchmarks measure general agent capability; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)
  • multi-turn RL hyperparameters
    Learning rates, discount factors, and rollout lengths are chosen to stabilize training but are not enumerated in the abstract.
axioms (1)
  • domain assumption Benchmark scores on Online-Mind2Web, OSWorld, WindowsAgentArena, AndroidWorld, and the 15-game suite accurately reflect real-world GUI agent performance.
    Invoked when the abstract equates higher benchmark numbers with advancement and generalization.

pith-pipeline@v0.9.0 · 6064 in / 1331 out tokens · 30509 ms · 2026-05-13T10:08:29.204788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  2. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  3. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  4. Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

    cs.AI 2026-05 unverdicted novelty 7.0

    VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

  5. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 7.0

    ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...

  6. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  7. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  8. Faithful Mobile GUI Agents with Guided Advantage Estimator

    cs.AI 2026-05 unverdicted novelty 7.0

    Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.

  9. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  10. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  11. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  12. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 6.0

    ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...

  13. How Mobile World Model Guides GUI Agents?

    cs.AI 2026-05 unverdicted novelty 6.0

    Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...

  14. SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.

  15. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  16. AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.

  17. Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

  18. Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.

  19. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  20. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 5.0

    An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  21. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  22. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  23. HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

  24. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

  25. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    cs.MA 2026-02 unverdicted novelty 4.0

    The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.

  26. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · cited by 23 Pith papers · 25 internal anchors

  1. [1]

    Introducing the model context protocol, 2024

    Anthropic. Introducing the model context protocol, 2024. URL https://www.anthropic.com/news/ model-context-protocol

  2. [2]

    Developing a computer use model.https://www.anthropic.com/news/developing-computer-use,

    Anthropic. Developing a computer use model.https://www.anthropic.com/news/developing-computer-use,

  3. [3]

    Product announcement

  4. [4]

    Claude 3.7 sonnet system card

    Anthropic. Claude 3.7 sonnet system card. 2025

  5. [5]

    Claude’s extended thinking, 2025

    anthropic. Claude’s extended thinking, 2025. URL https://www.anthropic.com/news/ visible-extended-thinking

  6. [6]

    Introducing claude 4, 2025

    anthropic. Introducing claude 4, 2025. URLhttps://www.anthropic.com/news/claude-4

  7. [7]

    Scaling data collection for training software engineering agents.Nebius blog, 2024

    Ibragim Badertdinov, Maria Trofimova, Yury Anapolskiy, Sergey Abramov, Karina Zainullina, Alexander Golubev, Sergey Polezhaev, Daria Litvintseva, Simon Karasik, Filipp Fisin, Sergey Skvortsov, Maxim Nekrashevich, Anton Shevtsov, and Boris Yangel. Scaling data collection for training software engineering agents.Nebius blog, 2024

  8. [8]

    SWE-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents.arXiv preprint arXiv:2505.20411,

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URLhttps://arxiv.org/ abs/2505.20411

  9. [9]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35:24639–24654, 2022

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos.Advances in Neural Information Processing Systems, 35:24639–24654, 2022

  10. [10]

    The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of artificial intelligence research, 47:253–279, 2013

  11. [11]

    Windows agent arena: Evaluating multi-modal os agents at scale

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale. September 2024

  12. [12]

    Seed-thinking-1.6, 2025

    ByteDance. Seed-thinking-1.6, 2025. URLhttps://seed.bytedance.com/zh/seed1_6

  13. [13]

    Mindsearch: Mimicking human minds elicits deep ai searcher

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. Mindsearch: Mimicking human minds elicits deep ai searcher, 2024. URLhttps://arxiv.org/abs/2407.20183

  14. [14]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

  15. [15]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  16. [16]

    Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv e-prints, pages arXiv–2409, 2024

  17. [17]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    Xiang Deng, Kelvin Guu, Panupong Pasupat, Afra Akyürek, Sheng Zhuang, Wenlong Chen, Tatsunori Hashimoto, Kelvin Guu, and Percy Liang. Mind2web: Towards a generalist agent for the web. InNeurIPS Datasets and Benchmarks, 2023. URLhttps://arxiv.org/abs/2306.06070

  18. [18]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advancesin Neural Information Processing Systems, 35:18343–18362, 2022

  19. [19]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps: //arxiv.org/abs/2504.11536

  20. [20]

    Tora: A tool-integrated reasoning agent for mathematical problem solving.arXiv preprint arXiv:2309.17452,

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. URLhttps://arxiv.org/abs/ 2309.17452. 24

  21. [21]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  23. [23]

    Owl: A large language model for it operations, 2024

    Hongcheng Guo, Jian Yang, Jiaheng Liu, Liqun Yang, Linzheng Chai, Jiaqi Bai, Junran Peng, Xiaorong Hu, Chao Chen, Dongfeng Zhang, Xu Shi, Tieqiao Zheng, Liangfan Zheng, Bo Zhang, Ke Xu, and Zhoujun Li. Owl: A large language model for it operations, 2024. URLhttps://arxiv.org/abs/2309.09298

  24. [24]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  25. [25]

    lmgame-bench: How good are llms at playing games?arXiv preprint arXiv:2505.15146,

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?, 2025. URLhttps://arxiv.org/abs/ 2505.15146

  26. [26]

    Os agents: A survey on mllm-based agents for general computing devices use.arXiv preprint arXiv:2508.04482, 2025

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for general computing devices use.arXiv preprint arXiv:2508.04482, 2025

  27. [27]

    Manusearch: Democratizing deep search in large language models with a transparent and open multi-agent framework, 2025

    Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, and Wayne Xin Zhao. Manusearch: Democratizing deep search in large language models with a transparent and open multi-agent framework, 2025. URLhttps://arxiv.org/abs/2505.18105

  28. [28]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  29. [29]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  30. [30]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning.arXiv preprint arXiv:2205.00445, 2022

  31. [31]

    Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent, 2025. URL https://arxiv.org/abs/2507.02592

  32. [32]

    Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

    Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, and Yitao Liang. Jarvis-vla: Post-training large-scale vision language models to play visual games with keyboards and mouse.arXiv preprint arXiv:2503.16365, 2025

  33. [33]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025. URLhttps://arxiv.org/abs/2501.05366

  34. [34]

    Torl: Scaling tool-integrated rl, 2025 b

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl, 2025. URLhttps://arxiv.org/abs/ 2503.23383

  35. [35]

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990, 2025

  36. [36]

    Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

    Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Arpo: End-to-end policy optimization for gui agents with experience replay.arXiv preprint arXiv:2505.16282, 2025

  37. [37]

    Repoagent: An llm-powered open-source framework for repository-level code documentation generation.arXiv preprint arXiv:2402.16667, 2024

    Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. Repoagent: An llm-powered open-source framework for repository-level code documentation generation.arXiv preprint arXiv:2402.16667, 2024. URL https://arxiv. org/abs/2402.16667

  38. [38]

    Large language models play starcraft ii: Benchmarks and a chain of summarization approach.Advances in Neural Information Processing Systems, 37:133386–133442, 2024

    Weiyu Ma, Qirui Mi, Yongcheng Zeng, Xue Yan, Runji Lin, Yuqiao Wu, Jun Wang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach.Advances in Neural Information Processing Systems, 37:133386–133442, 2024. 25

  39. [39]

    Human-level control through deep reinforcement learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

  40. [40]

    Kimi-researcher: End-to-end rl training for emerging agentic capabilities.https://moonshotai

    MoonshotAI. Kimi-researcher: End-to-end rl training for emerging agentic capabilities.https://moonshotai. github.io/Kimi-Researcher/, 2025

  41. [41]

    Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey.arXiv preprint arXiv:2412.13501, 2024

  42. [42]

    OpenAI: Introducing ChatGPT, 2022

    OpenAI. OpenAI: Introducing ChatGPT, 2022. URLhttps://openai.com/blog/chatgpt

  43. [43]

    Introducing gpt 5, 2025

    OpenAI. Introducing gpt 5, 2025. URLhttps://openai.com/index/introducing-gpt-5/

  44. [44]

    Introducing deep research - openai.https://openai.com/index/introducing-deep-research/, 2025

    OpenAI. Introducing deep research - openai.https://openai.com/index/introducing-deep-research/, 2025

  45. [45]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://cdn.openai.com/pdf/ 2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf/, 2025

  46. [46]

    Computer-using agent (cua)

    OpenAI. Computer-using agent (cua). https://openai.com/index/computer-using-agent/, 2025. Research preview / blog

  47. [47]

    Operator, 2025

    openai. Operator, 2025. URLhttps://openai.com/index/introducing-operator/

  48. [48]

    Training software engineering agents and verifiers with swe-gym, 2024

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URLhttps://arxiv.org/abs/2412.21139

  49. [49]

    Exploring mode connectivity for pre-trained language models

    Yujia Qin, Cheng Qian, Jing Yi, Weize Chen, Yankai Lin, Xu Han, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Exploring mode connectivity for pre-trained language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6726–6746, Abu Dhabi, United Arab Emir...

  50. [50]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

  51. [51]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025. URLhttps://arxiv...

  52. [52]

    Team et al.Scaling Instructable Agents Across Many Simulated Worlds

    Maria Abi Raad, Arun Ahuja, Catarina Barros, Frederic Besse, Andrew Bolt, Adrian Bolton, Bethanie Brownfield, Gavin Buttimore, Max Cant, Sarah Chakera, et al. Scaling instructable agents across many simulated worlds. arXiv preprint arXiv:2404.10179, 2024

  53. [53]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573

  54. [54]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175, 2022

  55. [55]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint arXiv:2302.04761, 2023

  56. [56]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  57. [57]

    Ui-tars-1.5.https://seed-tars.com/1.5, 2025

    ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

  58. [58]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300. 26

  59. [59]

    Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

  60. [60]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  61. [61]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017

  62. [62]

    R1-searcher++: Incentivizing the dynamic knowledge acquisition of llms via reinforcement learning.arXiv preprint arXiv:2505.17005, 2025

    Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher++: Incentivizing the dynamic knowledge acquisition of llms via reinforcement learning.arXiv preprint arXiv:2505.17005, 2025

  63. [63]

    Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.arXiv preprint arXiv:2505.16834, 2025

    Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. Simpledeepsearcher: Deep information seeking via web-powered reasoning trajectory synthesis.CoRR, abs/2505.16834, 2025. doi: 10.48550/ARXIV.2505.16834. URLhttps://doi.org/10.48550/arXiv.2505.16834

  64. [64]

    A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025

    Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025

  65. [65]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi k2: Open agentic intelligence, 2025. URLhttps://arxiv.org/abs/2507.20534

  66. [66]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  67. [67]

    Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025

    The Terminal-Bench Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025. URLhttps://github.com/laude-institute/terminal-bench

  68. [68]

    Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

    Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019

  69. [69]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  70. [70]

    Acting less is reasoning more! teaching model to act efficiently, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URL https://arxiv.org/abs/2504.14870

  71. [71]

    Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

    Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

  72. [72]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  73. [73]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, et al. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. IEEE Transactionson Pattern Analysis and Machine Intelligence, 2024

  74. [74]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  75. [75]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 27

  76. [76]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

  77. [77]

    Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

  78. [78]

    Xue et al

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. 2025. URLhttps://arxiv.org/abs/2504.01382

  79. [79]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. URLhttps://arxiv.org/ abs/2405.15793

  80. [80]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

Showing first 80 references.