pith. sign in

arxiv: 2606.02031 · v2 · pith:HDDNTKDNnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Pith reviewed 2026-06-28 15:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords visual web agentsonline reinforcement learningmulti-turn RLlive-browser infrastructuretrajectory-level success judgingOpenWebRLweb agent trainingdynamic websites
0
0 comments X

The pith

Online multi-turn RL on live websites trains a 4B visual web agent to 67% success with 0.4K init trajectories and 2.2K tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that online multi-turn reinforcement learning can train capable visual web agents directly on dynamic real-world websites without depending on large static collections of curated trajectories. It presents an open framework covering live-browser infrastructure, supervised initialization, multimodal context handling, trajectory-level success judging, and multi-turn policy optimization to generate usable reward signals. With this setup the resulting 4B model reaches 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger size and remaining competitive with proprietary systems. A reader would care because the approach removes the main scalability bottleneck of expensive demonstration collection and shows that modest numbers of open-ended tasks suffice for effective training on the live web.

Core claim

OpenWebRL is an open framework for training visual web agents via online multi-turn RL on real websites. It supplies the full pipeline of scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Trained with only 0.4K initialization trajectories and 2.2K open-ended RL tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, establishing new open-source state-of-the-art results while remaining competitive with proprietary systems such as OpenAI CUA and Gemini CUA. The work also examines the design choices that enable effective RL and analyzes how t

What carries the argument

The OpenWebRL framework whose live-browser infrastructure and trajectory-level success judging supply the reward signals that support stable multi-turn policy optimization on changing websites.

If this is right

  • Visual web agents can be trained scalably without collecting large curated demonstration datasets.
  • Online RL directly on live sites improves agentic reasoning beyond what supervised post-training alone achieves.
  • Modest numbers of open-ended tasks suffice for effective multi-turn optimization when paired with trajectory-level rewards.
  • Open-source agents can reach performance levels competitive with proprietary systems through this training route.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same live-environment RL loop could be adapted to train agents for other interactive interfaces such as mobile apps.
  • The reported data efficiency suggests the method could lower the compute and annotation cost of building new web agents in resource-constrained settings.
  • Further experiments that vary the judging granularity might reveal whether finer-grained rewards would accelerate learning on complex sites.

Load-bearing premise

Trajectory-level success judging on live browsers supplies reward signals with low enough noise to support stable multi-turn policy optimization on dynamic real-world websites.

What would settle it

If the trained 4B agent shows success rates below 50% when evaluated on a fresh set of live websites whose interfaces were not encountered during the 2.2K RL tasks, the claim that the online RL pipeline produces effective and stable policies would be falsified.

read the original abstract

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenWebRL, an open framework for training visual web agents via online multi-turn RL directly on live websites. It covers the full pipeline: scalable live-browser infrastructure, supervised initialization from 0.4K trajectories, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. The central empirical claim is that OpenWebRL-4B, after 2.2K open-ended RL tasks, reaches 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale while remaining competitive with proprietary systems such as OpenAI CUA and Gemini CUA. The work also examines key design choices and how RL improves agentic reasoning.

Significance. If the reported results hold under rigorous validation, the contribution would be significant: it demonstrates that online multi-turn RL can produce competitive visual web agents with far smaller data volumes than supervised post-training on curated trajectories, while releasing infrastructure, data, models, and code to enable reproducible open research. This directly addresses the scalability bottleneck highlighted in the abstract and provides concrete evidence on effective design choices for RL in dynamic web environments.

major comments (2)
  1. [§4 and §3.4] §4 (Experiments) and §3.4 (Success Judging): The headline performance numbers rest on the assumption that trajectory-level success judging supplies sufficiently low-noise rewards for stable multi-turn policy optimization. The manuscript must supply quantitative validation of the judge (e.g., agreement rate with human labels on a held-out set of trajectories, false-positive/false-negative rates on dynamic pages) to substantiate that credit assignment across long horizons is reliable; without this, the gains achieved with only 2.2K RL tasks remain difficult to attribute to effective RL rather than judge artifacts.
  2. [Table 1 / Results] Table 1 / Results: The reported success rates (67.0% and 64.0%) are presented without error bars, variance across seeds, or ablations isolating the contribution of the success judge versus other pipeline components. This omission is load-bearing because the central claim is that online RL succeeds with small data volumes; the absence of these controls leaves open the possibility that results are sensitive to judging noise or website state variability.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'we systematically study the key design choices' is stated without enumerating them; a brief parenthetical list or forward reference to the relevant section would improve clarity.
  2. [§5] §5 (Analysis): The discussion of how RL improves agentic reasoning would benefit from explicit comparison of pre- and post-RL trajectories on the same tasks to illustrate the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for rigorous validation of the success judge and improved statistical reporting. We address each major comment below and will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses
  1. Referee: [§4 and §3.4] §4 (Experiments) and §3.4 (Success Judging): The headline performance numbers rest on the assumption that trajectory-level success judging supplies sufficiently low-noise rewards for stable multi-turn policy optimization. The manuscript must supply quantitative validation of the judge (e.g., agreement rate with human labels on a held-out set of trajectories, false-positive/false-negative rates on dynamic pages) to substantiate that credit assignment across long horizons is reliable; without this, the gains achieved with only 2.2K RL tasks remain difficult to attribute to effective RL rather than judge artifacts.

    Authors: We agree that quantitative validation of the judge is necessary to confidently attribute performance gains to the RL process. The manuscript describes the trajectory-level success judge in §3.4 but does not include human agreement metrics or error rate breakdowns. In the revised manuscript we will add a dedicated analysis in §3.4 reporting agreement rates with human labels on a held-out trajectory set together with false-positive and false-negative rates stratified by page dynamism. This addition will directly address concerns about reward noise and credit assignment reliability. revision: yes

  2. Referee: [Table 1 / Results] Table 1 / Results: The reported success rates (67.0% and 64.0%) are presented without error bars, variance across seeds, or ablations isolating the contribution of the success judge versus other pipeline components. This omission is load-bearing because the central claim is that online RL succeeds with small data volumes; the absence of these controls leaves open the possibility that results are sensitive to judging noise or website state variability.

    Authors: We acknowledge that the absence of error bars, seed variance, and judge-specific ablations weakens the robustness claims. The current manuscript reports point estimates only. In the revision we will add error bars for the main results (computed from available repeated evaluations), report observed variance, and include an ablation that isolates the judge by comparing against alternative reward formulations. Due to the high cost of live-web RL runs, full multi-seed experiments are resource-intensive, so we will provide all feasible statistical controls and ablations rather than exhaustive ones. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training results with no derived quantities or self-referential fits

full rationale

The paper reports direct empirical success rates (67.0% on Online-Mind2Web, 64.0% on DeepShop) obtained by running online multi-turn RL on live websites using 0.4K initialization trajectories and 2.2K RL tasks. No equations, parameter fits, or first-principles derivations are described; the central claims are measured benchmark outcomes rather than quantities obtained by construction from prior self-citations or normalizations. The trajectory-level success judge is presented as an engineering component of the framework, not as a fitted or self-defined predictor. This is a standard empirical RL paper whose performance numbers stand or fall on external replication, not on internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, no fitted constants, and no explicit background assumptions; full text needed to populate ledger entries.

pith-pipeline@v0.9.1-grok · 5884 in / 1209 out tokens · 28456 ms · 2026-06-28T15:44:25.939991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 33 canonical work pages · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

    Mathieu Andreux, Breno Baldas Skuk, Hamza Benchekroun, Emilien Bir´ e, Antoine Bonnet, Riaz Bordie, Nathan Bout, Matthias Brunel, Pierre-Louis Cedoz, Antoine Chassang, et al. Surfer-h meets holo1: Cost-efficient web agent powered by open weights.arXiv preprint arXiv:2506.02865, 2025

  3. [3]

    Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025

    Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer Whitehead, and An- drew Zhao. Fara-7B: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025. 13 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for...

  4. [4]

    WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. WebGym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026

  5. [5]

    DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. In Advances in Neural Information Processing Systems, volume 37, 2024

  6. [6]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  7. [7]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation

    Hyungjoo Chae, Namyoung Kim, Kai Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. InInternational Conference on Learning Representations, volume 2025, pages 63707–63738, 2025

  8. [8]

    Era: Transforming vlms into embodied agents via embodied prior learning and online reinforcement learning.arXiv preprint arXiv:2510.12693, 2025

    Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, et al. Era: Transforming vlms into embodied agents via embodied prior learning and online reinforcement learning.arXiv preprint arXiv:2510.12693, 2025

  9. [9]

    CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training

    Yuxi Chen, Haoyu Zhai, Chenkai Wang, Rui Yang, Lingming Zhang, Gang Wang, and Huan Zhang. Captcha solving for native gui agents: Automated reasoning-action data generation and self-corrective training.arXiv preprint arXiv:2603.23559, 2026

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  11. [11]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

  12. [12]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  13. [13]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  14. [14]

    Navigating the digital world as humans do: Universal visual grounding for gui agents

    Boyu Gou, Demi Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations, volume 2025, pages 30851–30883, 2025

  15. [15]

    Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  16. [16]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  17. [17]

    MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

  18. [18]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In 14 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics...

  19. [19]

    Openwebvoyager: Building multimodal web agents via iterative real-world ex- ploration, feedback and optimization

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Hongming Zhang, Tianqing Fang, Zhenzhong Lan, and Dong Yu. Openwebvoyager: Building multimodal web agents via iterative real-world ex- ploration, feedback and optimization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27545–27564, 2025

  20. [20]

    Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

    Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, and Xia Song. Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

  21. [21]

    Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

  22. [22]

    Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.Advances in Neural Information Processing Systems, 38, 2026

    Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan Wang, and Kai-Wei Chang. Embodied web agents: Bridging physical-digital realms for integrated agent intelligence.Advances in Neural Information Processing Systems, 38, 2026

  23. [23]

    Rethinking memory mechanisms of foundation agents in the second half: A survey.arXiv preprint arXiv:2602.06052, 2026

    Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half: A survey.arXiv preprint arXiv:2602.06052, 2026

  24. [24]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  25. [25]

    Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

    Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

  26. [26]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  27. [27]

    AgentRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942,

    Xing Han L` u, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Sta´ nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agentrewardbench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

  28. [28]

    Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

  29. [29]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  30. [30]

    Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuying Chen. Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

  31. [31]

    Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling.Advances in Neural Information Processing Systems, 37:134387–134429, 2024

    Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling.Advances in Neural Information Processing Systems, 37:134387–134429, 2024

  32. [32]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024. 15 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

  33. [33]

    Orchard: An Open-Source Agentic Modeling Framework

    Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandrio Sordoni, Xingdi Yuan, Yelong Shen, et al. Orchard: An open-source agentic modeling framework. arXiv preprint arXiv:2605.15040, 2026

  34. [34]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, et al. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, volume 2025, pages 79791–79821, 2025

  35. [35]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  36. [36]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  37. [37]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  38. [38]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  39. [39]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  40. [40]

    Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

    Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Insta: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776, 2025

  41. [41]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  42. [42]

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.Advances in Neural Information Processing Systems, 38:30865–30891, 2026

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.Advances in Neural Information Processing Systems, 38:30865–30891, 2026

  43. [43]

    Vagen: Reinforcing world model reasoning for multi-turn vlm agents.Advances in Neural Information Processing Systems, 38:172871–172933, 2026

    Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Yiping Lu, Zhengyuan Yang, Lijuan Wang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.Advances in Neural Information Processing Systems, 38:172871–172933, 2026

  44. [44]

    WebXSkill: Skill Learning for Autonomous Web Agents

    Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, et al. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026

  45. [45]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  46. [46]

    Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning

    Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, et al. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7920–7939, 2025

  47. [47]

    Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143, 2025

    Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143, 2025. 16 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

  48. [48]

    Os-atlas: Foundation action model for generalist gui agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: Foundation action model for generalist gui agents. InInternational Conference on Learning Representations, volume 2025, pages 5090–5108, 2025

  49. [49]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

  50. [50]

    An illusion of progress? assessing the current state of web agents

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. InSecond Conference on Language Modeling, 2025

  51. [51]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

  52. [52]

    Agentoccam: A simple yet strong baseline for llm-based web agents

    Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents. In International Conference on Learning Representations, volume 2025, pages 97533–97565, 2025

  53. [53]

    EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025

  54. [54]

    Regularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024

    Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024

  55. [55]

    GUI-Libra: Training native GUI agents to reason and act with action-aware supervision and partially verifiable RL, 2026

    Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. GUI-Libra: Training native GUI agents to reason and act with action-aware supervision and partially verifiable RL, 2026

  56. [56]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  57. [57]

    How do visual attributes influence web agents? a comprehensive evaluation of user interface design factors.arXiv preprint arXiv:2601.21961, 2026

    Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, and Huan Zhang. How do visual attributes influence web agents? a comprehensive evaluation of user interface design factors.arXiv preprint arXiv:2601.21961, 2026

  58. [58]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  59. [59]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

  60. [60]

    Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning, 2026

    Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liang-Yan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, and Daniel Kang. Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning, 2026

  61. [61]

    AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025

    Hanchen Zhang, Xiao Liu, Bowen Lv, Xueqiao Sun, Bohao Jing, Iat Long Iong, Zhenyu Hou, Zehan Qi, Hanyu Lai, Yifan Xu, Rui Lu, Hongning Wang, Jie Tang, and Yuxiao Dong. AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework, 2025

  62. [62]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. In Yixin Cao, Yang Feng, and Deyi Xiong, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 17 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual We...

  63. [63]

    Deepresearcher: Scaling deep research via reinforcement learning in real-world environments

    Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 414–431, 2025

  64. [64]

    Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents

    Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents. InInternational Conference on Machine Learning, 2025

  65. [65]

    Workforceagent-r1: Incentivizing reasoning capability in llm-based web agents via reinforcement learning

    Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, and Chao Zhang. Workforceagent-r1: Incentivizing reasoning capability in llm-based web agents via reinforcement learning. InFindings of the Association for Computational Linguistics: EACL 2026, pages 34–49, 2026. 18 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Ag...

  66. [66]

    Alpine Ridge

    executed. Note: no visible navigation or new tab detected. writeFocused element and typed contentSucceed: ‘write‘ typed "Alpine Ridge" into <input> role=combobox "Search". writeActual value mismatchNote: the field’s actual value is "New York, NY", which differs from the typed text. scroll Scroll direction, amount, and boundary detection Succeed: ‘scroll‘ ...

  67. [67]

    name": ...,

    Observation, including tab info, screenshot, and optional A11y Tree 30 OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents ## Output Requirements Your output must include one<think>block and one or more<tool call>blocks. The response must follow this exact structure: <think>...</think> <tool call>{"name": ..., "arguments...

  68. [68]

    Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out

  69. [69]

    Use it to understand what the agent tried to do, but do not treat it as ground truth if it conflicts with the screenshots

    Agent Action History: This is a chronological summary of the agent’s observed actions across steps. Use it to understand what the agent tried to do, but do not treat it as ground truth if it conflicts with the screenshots

  70. [70]

    point 2d

    Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. Each screenshot will be annotated with an inferred step index in text. 4.Result Response: This is a textual response obtained after the execution of the web task. ### TASK:{} ### Agent Action History:{} ### Result Response:{...

  71. [71]

    SHOP MEN'S

    Add to cart I see there's a "SHOP MEN'S" link which would be a good starting point. ... I should click on the "SHOP MEN'S" link to navigate to the men's department page. Action: {"name": "click" , "arguments": {"point_2d": [308,530]}} 1 Reasoning: The user wants to add the most top-selling Birkenstock's men clogs in brown, size 10-10.5 to the cart. ... I'...

  72. [72]

    name": "hover

    Look for a way to sort by customer rating … I should hover over the product images or names to see if detailed star ratings become visible … Action: {"name": "hover" , "arguments": {"point_2d": [344,584]}} 9 Reasoning:The hover action over the KIVIK Sofa … didn't surface any customer star ratings as previously hoped. ... The MANNARP at $429 might be a can...