pith. sign in

arxiv: 2601.02439 · v6 · submitted 2026-01-05 · 💻 cs.LG · cs.CV

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Pith reviewed 2026-05-16 17:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords visual web agentsreinforcement learningvision-language modelsweb environmentstask scalingout-of-distribution generalizationagent rollouts
0
0 comments X

The pith

Fine-tuning on WebGym raises an open vision model's web agent success from 26% to 43% on unseen sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WebGym supplies nearly 300,000 rubric-scored tasks across real websites to train visual web agents. The authors apply a straightforward reinforcement learning method that improves the policy using the agent's own interaction traces and task rewards. They introduce an asynchronous rollout system that delivers 4-5x faster trajectory sampling to support large-scale training. Fine-tuning Qwen-3-VL-8B-Instruct on the expanded data set lifts success on an out-of-distribution test set of never-seen websites from 26.2% to 42.9%. The result exceeds agents built on GPT-4o and GPT-5-Thinking.

Core claim

WebGym is the largest open environment for realistic visual web agents, containing nearly 300,000 tasks with rubric-based evaluations on diverse real-world sites. A simple RL recipe that trains on the agent's own rollouts, accelerated by a high-throughput asynchronous sampling system, allows scaling task breadth and depth. Fine-tuning Qwen-3-VL-8B-Instruct on this data improves out-of-distribution success from 26.2% to 42.9%, outperforming proprietary-model agents at 27.1% and 29.8%.

What carries the argument

The WebGym environment with its rubric-evaluated tasks on real websites, combined with the asynchronous rollout system that speeds trajectory sampling.

Load-bearing premise

Rubric-based evaluations on real websites accurately capture meaningful task success and the out-of-distribution test set of never-seen sites is representative of broader web diversity.

What would settle it

Running the fine-tuned model on a new collection of real websites whose tasks are scored by human judges and finding success rates no higher than the base model or the GPT baselines.

Figures

Figures reproduced from arXiv: 2601.02439 by Alexey Taymanov, Aviral Kumar, Hao Bai, Spencer Whitehead, Tong Zhang.

Figure 1
Figure 1. Figure 1: Example rollouts from visual web agents trained on different training environments. Tasks in prior large-scale training setups were relatively simple, e.g., Test-Time-Interaction (TTI; Shen et al. (2025), Row 1), resulting in failures of trained agents on many held-out tasks (task shown: From ArXiv, access the website of the university that maintains and manages ArXiv. How many undergraduate students are c… view at source ↗
Figure 2
Figure 2. Figure 2: Task decomposition system. WebGym decomposes tasks by generating valid combinations of fact groups from the original task’s rubric. Decomposition requires ≥2 groups with at least one “large” group (≥3 facts). Each valid combination (excluding the full set) produces a new task with lower difficulty while maintaining consistency with the original objectives. on narrow domains; c) BrowseComp (Wei et al., 2025… view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the WebGym training task set. Left: website distribution as a function of the sorted index (from more tasks to less) of website. Right: difficulty distribution of the train and test task sets. The transparent bars over the original bar mean decomposed tasks, and the slashed bars means test set.] difficulty of 9. Since the decomposition conditions are met (3 groups ≥ 2, and G2 and G3 are “larg… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of the WebGym task set. Left: distribution of tasks across domains according to the Mind2Web-2 taxonomy (Gou et al., 2025). Right: Distribution of trajectory lengths by task difficulty for answered trajectories over multiple iterations. For trajectories working on medium- and hard-difficulty tasks that exceeds 30 tasks during evaluation, we filter in only trajectories under 30 tasks to make the co… view at source ↗
Figure 5
Figure 5. Figure 5: Agreement between automated evaluators and human judgment. Rubric-based evaluation (with explicit criteria) consistently improves agreement over task-only evaluation, yielding higher accuracy and precision across LLM￾based evaluators. Among the evaluators, GPT-4o shows the largest shift after adding the rubric: precision increases the most while recall drops, indicating that the rubric makes GPT-4o apply s… view at source ↗
Figure 6
Figure 6. Figure 6: Asynchrony eliminates burst-idle behavior in web rollouts. Left: WebGym implements an asynchronous rollout system that (1) shortens single-trajectory collection duration by isolating rollout processes and (2) allows new rollout processes to join early by replacing batched inference with a process pool; the example shown is a toy case with 3 available environment buckets but 4 tasks waiting to roll out. Rig… view at source ↗
Figure 7
Figure 7. Figure 7: Benchmarking speed and throughput of the WebGym asynchronous rollout framework. Left: the WebGym framework boosts the rollout speed up significantly with a 4x-5x speedup. This figure shows time cost and average CPU utility percentage when reaching all 256 environments running w.r.t. different amounts of CPUs, while using the same amount of GPU resources for running inference on the VLM-based agent. The lim… view at source ↗
Figure 8
Figure 8. Figure 8: Ablations on base models, prompting, and action filtering. (left) test-set success rate curves of different models (Qwen3-VL￾Instruct-8B, Qwen3-VL-Thinking-8B, GPT-4o, and GPT-5-Thinking), and of Qwen3-VL-Instruct-8B under different constraints (either removing the memory prompt or removing the repetition penalty during RL). (right) test-set same-screenshot (repetitive inefficient action) rate and trajecto… view at source ↗
Figure 9
Figure 9. Figure 9: Exploring scaling dimensions of WebGym and difficulty-aware training. Test-set success rate curves under different variations to training: removing domains (“exclude domains”), tuning difficulty ratios (“uniform sampling”, “biased to hard”, “only easy”, “only medium”), and shortening the train-time step budget (“shorten horizon”). Results are reported for (a) Overall (all tasks), (b) Easy (1–3), (c) Medium… view at source ↗
Figure 10
Figure 10. Figure 10: WebGym implements an operation-specific queue system that balances the CPU and GPU machine usages. Here we illustrate the computational model of this pool system in the left subplot, and the comparison of the two queue designs on the right. We represent the centralized CPU queue with Queue, and the cascading queue system for with the initial letter of the request type, specifically, Navigating to a webpag… view at source ↗
Figure 11
Figure 11. Figure 11: Left: Number of trajectories successfully finished and crashed out under high CPU load when [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Trajectory for the comparison between vanilla evaluation and rubric-based evaluation. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WebGym, the largest open-source environment for training visual web agents, containing nearly 300,000 rubric-evaluated tasks across diverse real-world websites. It presents a simple RL recipe that trains on the agent's own rollouts with task rewards, accelerated by a custom high-throughput asynchronous rollout system achieving 4-5x speedup. Scaling task breadth and depth yields continued gains; fine-tuning Qwen-3-VL-8B-Instruct on WebGym raises success rate on an explicitly out-of-distribution test set of never-seen sites from 26.2% to 42.9%, outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%).

Significance. If the empirical claims hold, the work supplies the first large-scale, open, real-website training resource for visual web agents together with a reproducible RL scaling recipe and a concrete demonstration that such training can surpass closed proprietary models on generalization to unseen sites. The open release and the reported 4-5x rollout acceleration are concrete assets for the community.

major comments (3)
  1. [Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.
  2. [Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.
  3. [Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.
minor comments (2)
  1. [Abstract] The abstract states 'nearly 300,000 tasks'; the main text should give the exact count and a breakdown by difficulty level and website category.
  2. [Rollout System] The asynchronous rollout system is credited with a 4-5x speedup, but the paper would benefit from a brief pseudocode or timing table showing where the gains arise (e.g., parallel browser instances, caching).

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thorough review. We have carefully considered each comment and revised the manuscript accordingly to enhance the transparency and reproducibility of our results. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.

    Authors: We agree that additional details on the evaluation rubrics are necessary to substantiate the central claim. In the revised manuscript, we have expanded the Evaluation section to include multiple example rubrics for representative tasks, a step-by-step description of the rubric construction process involving domain experts, and inter-rater reliability results from a pilot study with three annotators on 150 tasks yielding a Cohen's kappa of 0.87. We believe these additions address the concern regarding potential lenient scoring. revision: yes

  2. Referee: [Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.

    Authors: We acknowledge the importance of these experimental details for reproducibility. The revised paper now reports standard deviations computed over three independent training runs for the main results. We have also included the precise fine-tuning hyperparameters used: a learning rate of 2e-5 with cosine decay, batch size of 64, and training for 4 epochs. Furthermore, we added an ablation experiment showing performance as a function of rollout count (from 50k to 300k tasks), demonstrating that gains continue to scale with more data. revision: yes

  3. Referee: [Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.

    Authors: We thank the referee for this suggestion. To better support the out-of-distribution claim, we have incorporated quantitative metrics in the revised Task Set Construction section. This includes the coverage of site categories (with percentages for train and test), average DOM similarity scores using normalized tree-edit distance (0.12 for test vs train), and navigation pattern overlap via cosine similarity of action embedding vectors (average 0.18). These low similarity scores support that the test sites are indeed distinct from the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results on OOD test set

full rationale

The paper reports direct empirical measurements: construction of WebGym (~300k tasks with rubric evaluations), RL training on agent rollouts, and success-rate evaluation on an explicitly out-of-distribution test set of never-seen websites (26.2% → 42.9%). No equations, parameter fits, or self-referential definitions reduce the reported success rates to inputs by construction. The pipeline is standard RL scaling plus held-out evaluation and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions plus the domain assumption that rubric scores on real websites provide reliable learning signals; no explicit free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Rubric-based task rewards provide sufficient and accurate feedback for policy improvement
    The RL recipe uses task rewards from rubrics to guide learning on real websites.

pith-pipeline@v0.9.0 · 5560 in / 1428 out tokens · 36973 ms · 2026-05-16T17:27:17.993095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  2. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  3. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.

  4. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  5. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 5 Pith papers

  1. [1]

    Include the current step

    Memory: facts you would like to memorize for future actions in json format. Include the current step

  2. [2]

    Include progress of the current step

    Progress: Decompose the task into subtasks and what has been finished so far with json format. Include progress of the current step

  3. [3]

    Intention: clearly state which subtask you’re working on at this step with the json key

  4. [4]

    Action: a short sentence describing what to do in the UI to accomplish the next subtask

  5. [5]

    Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>

    A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>. − You MUST use json format for the Memory and Progress parts. − Example Task: ”Search and compare the prices and locations of product 1 and produc...

  6. [6]

    WebGym: Scaling Training Environments for Visual Web Agents30

    Action: a short sentence describing what to do in the UI. WebGym: Scaling Training Environments for Visual Web Agents30

  7. [7]

    Rules: − Output exactly in the order: Action,<tool call>

    A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Action,<tool call>. − Action describes the high−level intention of the tool call within a single sentence. − Do not output anything else outside those two parts. {{/if w/o Memory Prompt}} =======...

  8. [8]

    **Reasoning**: [One sentence explanation]

  9. [9]

    **Decision**: [YES or NO] User Prompt: **Task**:{task} **Key Points for Task Completion**: {eval rubric} The snapshot of the web page is shown in the image. Does this image contain relevant information for the task? (Answer YES unless it’s completely irrelevant) Where: −{task}= task.task name (the task description) −{eval rubric}= List of all criteria fro...

  10. [10]

    Task Instruction: The original task description (provided for CONTEXT ONLY)

  11. [11]

    Fact Group: A group of related facts decomposed from the task instruction (provided for CONTEXT ONLY)

  12. [12]

    Fact to Check: A specific fact that you need to verify (THIS IS YOUR PRIMARY FOCUS)

  13. [13]

    Trajectory: A complete list of observations and actions that were taken by the agent

  14. [14]

    You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact

    Result Screenshots: Visual representation of the screen showing the result or intermediate state CRITICAL: Your judgment should ONLY focus on whether the FACT TO CHECK can be verified by the screenshots. You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact. Guidelines for evaluation: −− Your primary respon...

  15. [15]

    Analysis: [Describe what evidence you see in the screenshots related to the fact to check]

  16. [16]

    Evaluation: (MUST end with line ”2

    Verdict: [SUCCESS if the fact is verified by screenshots, NOT SUCCESS otherwise] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction (for context only): [task instruction] Fact Group (for context on...

  17. [17]

    Task Instruction: The task the agent was trying to complete

  18. [18]

    Final Response: The agent’s answer/response to the task

  19. [19]

    Agents frequently hallucinate or make up answers that are not verified by what they actually saw

    Result Screenshots: Visual representation of the screens the agent visited CRITICAL: Your job is to check if the agent’s response contains information that is NOT shown in the screenshots. Agents frequently hallucinate or make up answers that are not verified by what they actually saw. Guidelines for evaluation: −− Check whether EVERY claim in the agent’s...

  20. [20]

    Claims in response: [List the specific claims/facts in the agent’s response]

  21. [21]

    Screenshot verification: [For each claim, state whether it appears in the screenshots]

  22. [22]

    Evaluation: (MUST end with line ”3

    Verdict: [SUCCESS if ALL claims are verified by screenshots, NOT SUCCESS if ANY claim is not verified] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction: [task instruction] Agent’s Final Response:...

  23. [23]

    Analysis: [Describe what you see in the screenshots − any EXPLICIT signs of anti−bot blocking measures?]

  24. [24]

    Did the website block the agent? (MUST end with line ”2

    Blocked: [YES or NO] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task: [task] Trajectory: [trajectory] Screenshots: attached. Did the website block the agent? (MUST end with line ”2. Blocked: [YES or NO]”) W...

  25. [25]

    Break down the task into hierarchical fact groups, where each group contains specific facts to verify

  26. [26]

    Each fact represents ONE verifiable piece of information that can be checked in a trajectory

  27. [27]

    Treat the entire computation as a SINGLE fact

    The overall task difficulty equals the TOTAL NUMBER of all facts across all groups Key principles: − FIRST analyze the task to identify logical groupings of related information − For each group, list ALL specific facts that need to be verified − Facts should be AS DETAILED AS POSSIBLE − break down complex requirements into individual checkable facts − Eac...

  28. [28]

    Create a new task description that incorporates ONLY the requirements explicitly listed in the selected fact groups above

  29. [29]

    The new task should read naturally and be self−contained

  30. [30]

    Do NOT include requirements from fact groups that were not selected

  31. [31]

    Maintain the original context and domain of the task

  32. [32]

    **CRITICAL**: Do NOT add any information that is not explicitly present in the selected fact groups above, even if that information appears in the original task Important Guidelines: − If a fact group mentions ”concert” but does NOT mention ”upcoming” or ”in the US or Canada”, do NOT include those constraints in the generated task − ONLY use constraints a...