WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Pith reviewed 2026-05-16 17:27 UTC · model grok-4.3
The pith
Fine-tuning on WebGym raises an open vision model's web agent success from 26% to 43% on unseen sites.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WebGym is the largest open environment for realistic visual web agents, containing nearly 300,000 tasks with rubric-based evaluations on diverse real-world sites. A simple RL recipe that trains on the agent's own rollouts, accelerated by a high-throughput asynchronous sampling system, allows scaling task breadth and depth. Fine-tuning Qwen-3-VL-8B-Instruct on this data improves out-of-distribution success from 26.2% to 42.9%, outperforming proprietary-model agents at 27.1% and 29.8%.
What carries the argument
The WebGym environment with its rubric-evaluated tasks on real websites, combined with the asynchronous rollout system that speeds trajectory sampling.
Load-bearing premise
Rubric-based evaluations on real websites accurately capture meaningful task success and the out-of-distribution test set of never-seen sites is representative of broader web diversity.
What would settle it
Running the fine-tuned model on a new collection of real websites whose tasks are scored by human judges and finding success rates no higher than the base model or the GPT baselines.
Figures
read the original abstract
We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebGym, the largest open-source environment for training visual web agents, containing nearly 300,000 rubric-evaluated tasks across diverse real-world websites. It presents a simple RL recipe that trains on the agent's own rollouts with task rewards, accelerated by a custom high-throughput asynchronous rollout system achieving 4-5x speedup. Scaling task breadth and depth yields continued gains; fine-tuning Qwen-3-VL-8B-Instruct on WebGym raises success rate on an explicitly out-of-distribution test set of never-seen sites from 26.2% to 42.9%, outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%).
Significance. If the empirical claims hold, the work supplies the first large-scale, open, real-website training resource for visual web agents together with a reproducible RL scaling recipe and a concrete demonstration that such training can surpass closed proprietary models on generalization to unseen sites. The open release and the reported 4-5x rollout acceleration are concrete assets for the community.
major comments (3)
- [Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.
- [Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.
- [Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.
minor comments (2)
- [Abstract] The abstract states 'nearly 300,000 tasks'; the main text should give the exact count and a breakdown by difficulty level and website category.
- [Rollout System] The asynchronous rollout system is credited with a 4-5x speedup, but the paper would benefit from a brief pseudocode or timing table showing where the gains arise (e.g., parallel browser instances, caching).
Simulated Author's Rebuttal
Thank you for the thorough review. We have carefully considered each comment and revised the manuscript accordingly to enhance the transparency and reproducibility of our results. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.
Authors: We agree that additional details on the evaluation rubrics are necessary to substantiate the central claim. In the revised manuscript, we have expanded the Evaluation section to include multiple example rubrics for representative tasks, a step-by-step description of the rubric construction process involving domain experts, and inter-rater reliability results from a pilot study with three annotators on 150 tasks yielding a Cohen's kappa of 0.87. We believe these additions address the concern regarding potential lenient scoring. revision: yes
-
Referee: [Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.
Authors: We acknowledge the importance of these experimental details for reproducibility. The revised paper now reports standard deviations computed over three independent training runs for the main results. We have also included the precise fine-tuning hyperparameters used: a learning rate of 2e-5 with cosine decay, batch size of 64, and training for 4 epochs. Furthermore, we added an ablation experiment showing performance as a function of rollout count (from 50k to 300k tasks), demonstrating that gains continue to scale with more data. revision: yes
-
Referee: [Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.
Authors: We thank the referee for this suggestion. To better support the out-of-distribution claim, we have incorporated quantitative metrics in the revised Task Set Construction section. This includes the coverage of site categories (with percentages for train and test), average DOM similarity scores using normalized tree-edit distance (0.12 for test vs train), and navigation pattern overlap via cosine similarity of action embedding vectors (average 0.18). These low similarity scores support that the test sites are indeed distinct from the training distribution. revision: yes
Circularity Check
No significant circularity in empirical results on OOD test set
full rationale
The paper reports direct empirical measurements: construction of WebGym (~300k tasks with rubric evaluations), RL training on agent rollouts, and success-rate evaluation on an explicitly out-of-distribution test set of never-seen websites (26.2% → 42.9%). No equations, parameter fits, or self-referential definitions reduce the reported success rates to inputs by construction. The pipeline is standard RL scaling plus held-out evaluation and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rubric-based task rewards provide sufficient and accurate feedback for policy improvement
Forward citations
Cited by 5 Pith papers
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
Reference graph
Works this paper leans on
-
[1]
Memory: facts you would like to memorize for future actions in json format. Include the current step
-
[2]
Include progress of the current step
Progress: Decompose the task into subtasks and what has been finished so far with json format. Include progress of the current step
-
[3]
Intention: clearly state which subtask you’re working on at this step with the json key
-
[4]
Action: a short sentence describing what to do in the UI to accomplish the next subtask
-
[5]
Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>
A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>. − You MUST use json format for the Memory and Progress parts. − Example Task: ”Search and compare the prices and locations of product 1 and produc...
-
[6]
WebGym: Scaling Training Environments for Visual Web Agents30
Action: a short sentence describing what to do in the UI. WebGym: Scaling Training Environments for Visual Web Agents30
-
[7]
Rules: − Output exactly in the order: Action,<tool call>
A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Action,<tool call>. − Action describes the high−level intention of the tool call within a single sentence. − Do not output anything else outside those two parts. {{/if w/o Memory Prompt}} =======...
work page 2025
-
[8]
**Reasoning**: [One sentence explanation]
-
[9]
**Decision**: [YES or NO] User Prompt: **Task**:{task} **Key Points for Task Completion**: {eval rubric} The snapshot of the web page is shown in the image. Does this image contain relevant information for the task? (Answer YES unless it’s completely irrelevant) Where: −{task}= task.task name (the task description) −{eval rubric}= List of all criteria fro...
-
[10]
Task Instruction: The original task description (provided for CONTEXT ONLY)
-
[11]
Fact Group: A group of related facts decomposed from the task instruction (provided for CONTEXT ONLY)
-
[12]
Fact to Check: A specific fact that you need to verify (THIS IS YOUR PRIMARY FOCUS)
-
[13]
Trajectory: A complete list of observations and actions that were taken by the agent
-
[14]
Result Screenshots: Visual representation of the screen showing the result or intermediate state CRITICAL: Your judgment should ONLY focus on whether the FACT TO CHECK can be verified by the screenshots. You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact. Guidelines for evaluation: −− Your primary respon...
-
[15]
Analysis: [Describe what evidence you see in the screenshots related to the fact to check]
-
[16]
Evaluation: (MUST end with line ”2
Verdict: [SUCCESS if the fact is verified by screenshots, NOT SUCCESS otherwise] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction (for context only): [task instruction] Fact Group (for context on...
-
[17]
Task Instruction: The task the agent was trying to complete
-
[18]
Final Response: The agent’s answer/response to the task
-
[19]
Agents frequently hallucinate or make up answers that are not verified by what they actually saw
Result Screenshots: Visual representation of the screens the agent visited CRITICAL: Your job is to check if the agent’s response contains information that is NOT shown in the screenshots. Agents frequently hallucinate or make up answers that are not verified by what they actually saw. Guidelines for evaluation: −− Check whether EVERY claim in the agent’s...
-
[20]
Claims in response: [List the specific claims/facts in the agent’s response]
-
[21]
Screenshot verification: [For each claim, state whether it appears in the screenshots]
-
[22]
Evaluation: (MUST end with line ”3
Verdict: [SUCCESS if ALL claims are verified by screenshots, NOT SUCCESS if ANY claim is not verified] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction: [task instruction] Agent’s Final Response:...
-
[23]
Analysis: [Describe what you see in the screenshots − any EXPLICIT signs of anti−bot blocking measures?]
-
[24]
Did the website block the agent? (MUST end with line ”2
Blocked: [YES or NO] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task: [task] Trajectory: [trajectory] Screenshots: attached. Did the website block the agent? (MUST end with line ”2. Blocked: [YES or NO]”) W...
-
[25]
Break down the task into hierarchical fact groups, where each group contains specific facts to verify
-
[26]
Each fact represents ONE verifiable piece of information that can be checked in a trajectory
-
[27]
Treat the entire computation as a SINGLE fact
The overall task difficulty equals the TOTAL NUMBER of all facts across all groups Key principles: − FIRST analyze the task to identify logical groupings of related information − For each group, list ALL specific facts that need to be verified − Facts should be AS DETAILED AS POSSIBLE − break down complex requirements into individual checkable facts − Eac...
-
[28]
Create a new task description that incorporates ONLY the requirements explicitly listed in the selected fact groups above
-
[29]
The new task should read naturally and be self−contained
-
[30]
Do NOT include requirements from fact groups that were not selected
-
[31]
Maintain the original context and domain of the task
-
[32]
**CRITICAL**: Do NOT add any information that is not explicitly present in the selected fact groups above, even if that information appears in the original task Important Guidelines: − If a fact group mentions ”concert” but does NOT mention ”upcoming” or ”in the US or Canada”, do NOT include those constraints in the generated task − ONLY use constraints a...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.