Fine-tuning Qwen-3-VL-8B on WebGym's 300k real-web tasks raises out-of-distribution success from 26.2% to 42.9%, beating GPT-4o (27.1%) and GPT-5 (29.8%).
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Fine-tuning Qwen-3-VL-8B on WebGym's 300k real-web tasks raises out-of-distribution success from 26.2% to 42.9%, beating GPT-4o (27.1%) and GPT-5 (29.8%).