pith. sign in

arxiv: 2605.25624 · v2 · pith:JCLACOT4new · submitted 2026-05-25 · 💻 cs.AI · cs.LG

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Pith reviewed 2026-06-29 21:38 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords computer-use agentsreinforcement learning with verifiable rewardsCUA-GymOSWorldWebArenaagent training environmentsscalable data synthesisverifiable rewards
0
0 comments X

The pith

CUA-Gym uses generator-discriminator-orchestrator loops and filters to produce 32,112 verified RLVR training tuples across 110 environments for computer-use agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to overcome the lack of scalable, deterministic-reward data for training computer-use agents with reinforcement learning. It introduces a pipeline in which separate agents generate task instructions and environment states, write reward functions, and iterate until consistency is reached, followed by a voting-and-rollout quality filter. The resulting dataset, built on top of a new collection of mock web applications, lets models trained with GSPO reach 62.1 percent and 72.6 percent on OSWorld-Verified while also improving on the held-out WebArena benchmark. Performance improves steadily as the amount of data and the number of environments increase.

Core claim

CUA-Gym co-generates task instructions, initial and golden environment states, and reward functions through an iterative Generator-Discriminator-Orchestrator loop; the resulting tuples are further vetted by LLM majority voting plus agent rollouts. This process yields 32,112 high-fidelity RLVR training examples grounded in 110 environments from the synthesized CUA-Gym-Hub suite. Models trained on the data with GSPO outperform prior open-source computer-use agents at similar scales on OSWorld-Verified and show positive transfer to WebArena.

What carries the argument

The Generator-Discriminator-Orchestrator iterative loop plus final LLM-voting and rollout filter that together produce deterministic, verifiable task-environment-reward tuples.

If this is right

  • Model performance on OSWorld-Verified scales smoothly with both the volume of training tuples and the diversity of environments.
  • Checkpoints trained on CUA-Gym also improve accuracy on the held-out WebArena benchmark.
  • CUA-Gym-Hub expands the pool of high-fidelity mock web environments by an order of magnitude over prior collections.
  • The full pipeline, dataset, environments, and trained models are released for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-generation loop could be applied to other agent domains that currently lack verifiable rewards, such as desktop or mobile interfaces.
  • If the LLM-based filter introduces subtle biases, increasing model scale during data synthesis might amplify rather than reduce those biases.
  • Adding environments whose action distributions more closely match real user logs could further strengthen transfer to production settings.

Load-bearing premise

The language models inside the generator, discriminator, and filter produce reward functions that remain deterministic and free of systematic biases across repeated executions.

What would settle it

Running the same generated reward function on identical agent trajectories multiple times and observing inconsistent pass/fail outcomes.

Figures

Figures reproduced from arXiv: 2605.25624 by Bowen Wang, Dayiheng Liu, Dunjie Lu, HaiQuan Wang, Hao Hu, Junli Wang, Junyang Lin, Que Shen, Shixuan Liu, Shuai Bai, Tao Yu, Tianbao Xie, Tianyi Bai, Zhipeng Zhang.

Figure 1
Figure 1. Figure 1: Overview of the CUA-GYM data synthesis pipeline. From a task instruction and its grounded context, the Orchestrator provisions paired virtual machines and spawns two adversarially coupled agents. The Generator writes initial_setup.py and golden_patch.py to construct the initial and golden states. The Discriminator, isolated from the Generator’s scripts by an information barrier, independently drafts reward… view at source ↗
Figure 2
Figure 2. Figure 2: Multi-agent pipeline for mock environment synthesis. From a target application seed, the Plan Agent drafts DESIGN.md and TODO.md specifying features, data schema, API protocol, and UI layout. The Dev Agent implements the single-page application from these specifications. The Web Agent then exercises every interactive element via Playwright, comparing the live DOM against the spec and feeding discrepancies … view at source ↗
Figure 3
Figure 3. Figure 3: State-injected environments in CUA-GYM-HUB. The same mail mock can be instantiated with distinct task-specific initial states: an empty inbox, a project-deadline state with pending coordination threads, and a high-volume backlog. Because all mutations and resets are scoped to a session state, parallel rollout workers can share the same application implementation while observing isolated world states. 3 Exp… view at source ↗
Figure 4
Figure 4. Figure 4: Training scaffold for long-horizon CUA trajectories. Instead of training on a single sliding window that drops older interaction state, trajectory slicing keeps recent multimodal observations active while collapsing earlier screenshots into deterministic placeholders. This preserves late-trajectory supervision under a fixed context budget and exposes stable prefixes that are friendly to KV-cache reuse duri… view at source ↗
Figure 5
Figure 5. Figure 5: Main results. (a) Per-domain success rates on OSWorld-Verified for CUA-GYM-A3B (red) versus its SFT-initialized base (gray, dashed); n denotes task count per domain. Largest gains on multi￾app workflows (+21.5 pp), libreoffice_calc (+14.9 pp), and vs_code (+13.6 pp). (b) Success rates on OSWorld-Verified and WebArena. RLVR on CUA-GYM data improves both Qwen3.5-35B-A3B and Qwen3.5-397B-A17B bases, with CUA-… view at source ↗
Figure 6
Figure 6. Figure 6: CUA-GYM data and environments. (a) CUA-GYM spans 110 environments, including CUA￾GYM-HUB web environments aligned with O*NET occupational categories [20]. (b) The 32,112 verified tuples cover application categories evenly (no category exceeds 21%) and are skewed toward harder (45%) and cross-application (38%) tasks. (c) Compared to existing RLVR datasets, CUA-GYM is the largest open-released collection and… view at source ↗
Figure 7
Figure 7. Figure 7: Data scaling of RL training on CUA-GYM. OSWorld-Verified score (left) and training reward (right) vs. RL training step on Qwen3.5-35B-A3B for three CUA-GYM training subsets (1.4K, 3K, 12K verified tuples), all initialized from the same SFT checkpoint (gray dashed line at 0.53). Faint lines show raw per-step values; bold lines are exponentially smoothed. and the overall shape of the training curve, indicati… view at source ↗
Figure 8
Figure 8. Figure 8: Environment scaling under teacher dis￾tillation. OSWorld-Verified score versus number of training environments at 100-step evaluation budgets. The broad setting (80 envs, 75 trajectories each) outperforms the narrow setting (10 envs, 300 each) despite using 4× fewer trajectories per envi￾ronment. The data-scaling results of §4.1 establish trajectory volume as one axis along which CUA-GYM data drives RL per… view at source ↗
Figure 9
Figure 9. Figure 9: Emergent action batching during RL. Average tool calls per model step on CUA-GYM￾A3B. The SFT-initialized policy emits approxi￾mately one call per step; RL drives this to a sta￾ble 1.4–1.9 band. A behavior we did not optimize for emerges in CUA￾GYM RL training: the policy spontaneously learns to issue multiple tool calls per turn, compressing what begin as one-action-per-turn rollouts into denser multi￾act… view at source ↗
Figure 10
Figure 10. Figure 10: Landing-page screenshots of 32 representative released mock web applications, organized by thematic pair-of-rows: communication and meetings (rows 1–2), productivity and collaboration (rows 3–4), engineering and developer tooling (rows 5–6), and commerce / cloud / social (rows 7–8). Each mock reproduces the surface visual layout, navigation tree, and primary-feature inventory of its named reference applic… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CUA-Gym, a pipeline that uses a Generator-Discriminator-Orchestrator loop to co-generate task instructions, environment states, and reward functions for computer-use agents, followed by an LLM majority-vote plus rollout filter to produce 32,112 verified RLVR training tuples across 110 environments (including the synthesized CUA-Gym-Hub mock web apps). Models trained via GSPO on this data (CUA-Gym-A3B and CUA-Gym-A17B) reach 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at similar scales, with claimed smooth scaling in data volume and environment diversity plus positive transfer to the held-out WebArena benchmark. The full pipeline, dataset, environments, and models are to be open-sourced.

Significance. If the generated rewards are shown to be deterministic and free of systematic LLM-induced biases, the work would meaningfully advance RLVR for CUAs by removing the data bottleneck that has limited prior efforts. The explicit open-sourcing of the synthesis pipeline, the 32k-tuple dataset, the CUA-Gym-Hub environments, and the trained models is a concrete strength that supports reproducibility. The reported smooth scaling trends and cross-benchmark transfer, if robust, would provide useful empirical evidence for the value of environment diversity in this domain.

major comments (2)
  1. [abstract and pipeline description] Abstract and pipeline description: the central performance claims (62.1%/72.6% on OSWorld-Verified, scaling, WebArena transfer) rest on the assertion that the Generator-Discriminator-Orchestrator loop plus LLM-voting/rollout filter produces deterministic, unbiased rewards; however, no quantitative metrics are supplied on filter error rates, inter-voter agreement, or failure modes, nor is there comparison against human-verified ground truth or formal specifications. This directly affects the verifiability premise required for the RLVR results.
  2. [results and evaluation sections] Results and evaluation sections: no ablation studies isolate the contribution of the reward filter (or the number of environments) from other factors such as data volume or model scale; without these, it is difficult to attribute the observed gains specifically to the claimed deterministic rewards rather than potential artifacts in the LLM-generated reward functions.
minor comments (2)
  1. [methods] Clarify the exact prompting templates and temperature settings used for the Discriminator and voting stages to improve reproducibility.
  2. [dataset construction] Add a table summarizing the distribution of the 110 environments and the 32,112 tuples by application category.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of quantitative validation for the reward generation pipeline and the need for ablations. We address each major comment below with honest assessments and commit to revisions that strengthen the manuscript without overstating current results.

read point-by-point responses
  1. Referee: [abstract and pipeline description] Abstract and pipeline description: the central performance claims (62.1%/72.6% on OSWorld-Verified, scaling, WebArena transfer) rest on the assertion that the Generator-Discriminator-Orchestrator loop plus LLM-voting/rollout filter produces deterministic, unbiased rewards; however, no quantitative metrics are supplied on filter error rates, inter-voter agreement, or failure modes, nor is there comparison against human-verified ground truth or formal specifications. This directly affects the verifiability premise required for the RLVR results.

    Authors: We agree that the absence of quantitative metrics on filter error rates, inter-voter agreement, failure modes, and human ground-truth comparisons weakens the verifiability claims. The manuscript describes the multi-agent loop and LLM majority-vote/rollout filter but provides no such statistics. In the revised manuscript we will add reported inter-voter agreement rates and rollout pass rates across the 32,112 tuples. A large-scale human verification study was not performed owing to cost; we will explicitly list this as a limitation while noting that open-sourcing the pipeline and dataset enables external validation. We maintain that the adversarial loop plus filter offers a practical scalable mechanism, but we accept the referee's point that formal specifications and error-rate quantification are currently missing. revision: partial

  2. Referee: [results and evaluation sections] Results and evaluation sections: no ablation studies isolate the contribution of the reward filter (or the number of environments) from other factors such as data volume or model scale; without these, it is difficult to attribute the observed gains specifically to the claimed deterministic rewards rather than potential artifacts in the LLM-generated reward functions.

    Authors: The referee is correct that no ablations isolate the reward filter or environment count from data volume and model scale. The reported scaling curves and WebArena transfer are consistent with the value of verified data, yet they do not directly demonstrate the filter's incremental contribution. In revision we will add limited ablations on a data subset comparing filtered versus unfiltered training runs and performance as a function of environment diversity via subsampling. Full isolation remains challenging because the filter is integral to the generation process; we will present these new experiments while acknowledging that complete separation of all factors was not feasible in the original study. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical data generation and benchmark evaluation are self-contained

full rationale

The paper describes an empirical pipeline that generates training tuples via LLM-based Generator, Discriminator, and Orchestrator agents followed by voting/rollout filters, then applies standard GSPO training and reports results on OSWorld-Verified and WebArena. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claims (62.1%/72.6% scores and scaling behavior) are obtained directly from new data and held-out evaluation rather than reducing to any input by construction. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper on data synthesis and model training. No mathematical free parameters, axioms, or invented entities are introduced or required for the central claim.

pith-pipeline@v0.9.1-grok · 5926 in / 1380 out tokens · 33540 ms · 2026-06-29T21:38:16.619909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

    cs.CL 2026-06 conditional novelty 7.0

    ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.

  2. OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

    cs.AI 2026-06 unverdicted novelty 6.0

    OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

  3. PhoneBuddy: Training Open Models for Agentic Phone Use

    cs.CL 2026-06 unverdicted novelty 6.0

    PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

80 extracted references · 26 canonical work pages · cited by 3 Pith papers · 8 internal anchors

  1. [1]

    Gym-Anything: Turn any Software into an Agent Environment

    Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment, 2026. URLhttps://arxiv.org/abs/2604.06126

  2. [2]

    Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

    Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use, 2024

  3. [3]

    Anthropic economic index.https://www.anthropic.com/economic-index, 2025

    Anthropic. Anthropic economic index.https://www.anthropic.com/economic-index, 2025

  4. [4]

    Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

    Anthropic. Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

  5. [5]

    Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026

    Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026. URLhttps://arxiv.org/abs/2602.14093

  6. [6]

    Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026

    Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, and Binyuan Hui. Swe-universe: Scale real-world verifiable environments to millions, 2026. URLhttps://arxiv.org/abs/2602.02361

  7. [7]

    Goodman, and Dimitris Papailiopoulos

    Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. URLhttps://arxiv.org/abs/2601.16443

  8. [8]

    Training long-context, multi-turn software engineering agents with reinforcement learning, 2025

    Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. Training long-context, multi-turn software engineering agents with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2508.03501

  9. [9]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  10. [10]

    R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025

    Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URL https://arxiv.org/abs/2504.07164

  11. [11]

    Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026

    Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents, 2026. URL https://arxiv.org/ abs/2603.20691. 11

  12. [12]

    VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

    Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/ abs/2510.19488

  13. [13]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https://arxiv.org/ abs/2412.21139

  14. [14]

    On data engineering for scaling llm terminal capabilities, 2026

    Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, and Wei Ping. On data engineering for scaling llm terminal capabilities, 2026. URLhttps://arxiv.org/abs/2602.21193

  15. [15]

    Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

    Qwen Team. Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

  16. [16]

    Scaling synthetic task generation for agents via exploration,

    Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration,

  17. [17]

    URLhttps://arxiv.org/abs/2509.25047

  18. [18]

    SGLang: Efficient execution of structured language model programs

    SGLang Project. SGLang: Efficient execution of structured language model programs. https: //github.com/sgl-project/sglang, 2024

  19. [19]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  20. [20]

    Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025. URL https://arxiv.org/abs/2508.04700

  21. [21]

    Department of Labor, Employment and Training Administration

    U.S. Department of Labor, Employment and Training Administration. O*NET OnLine. https: //www.onetonline.org/, 2026. Accessed: 2026-05-06

  22. [22]

    verl: Volcano engine reinforcement learning for LLMs

    verl Project. verl: Volcano engine reinforcement learning for LLMs. https://github.com/ verl-project/verl, 2024

  23. [23]

    Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents

    Bowen Wang, Xinyuan Wang, Jiaqi Deng, Tianbao Xie, Ryan Li, Yanzhe Zhang, Junli Wang, Dunjie Lu, Zicheng Gong, Gavin Li, Toh Jing Hua, Wei-Lin Chiang, Ion Stoica, Diyi Yang, Yu Su, Yi Zhang, Zhiguo Wang, Victor Zhong, and Tao Yu. Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents. InInternational Conference on Learni...

  24. [24]

    URLhttps://openreview.net/forum?id=3x4SDbXbgl

  25. [25]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...

  26. [26]

    Charles, Zhilin Yang, and Tao Yu

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

  27. [27]

    WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025

    WebArena Team. WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025. 12

  28. [28]

    Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

    Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

  29. [29]

    Scaling computer-use grounding via user interface decomposition and synthesis, 2025

    Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis, 2025. URL https://arxiv.org/abs/2505.13227

  30. [30]

    AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024

    Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URL https://arxiv.org/abs/2412.09605

  31. [31]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

  32. [32]

    arXiv preprint arXiv:2601.15876 , year=

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

  33. [33]

    Zerogui: Automating online gui learning at zero human cost, 2025

    Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost, 2025. URLhttps://arxiv.org/abs/2505.23762

  34. [34]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

  35. [35]

    UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

    Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, and Zhe Gan. UltraCUA: A foundation model for computer use agents with hybrid action, 2025. URL https://arxiv.org/abs/ 2510.17790

  36. [36]

    Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

    Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URL https://arxiv.org/abs/ 2601.04126

  37. [37]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071. Appendix Contents 13 A Data Synthesis Pipeline Details A.1 Pipeline Pseudocode Algorithm 1 summarizes the end-to-end co-generation ...

  38. [38]

    Direct Boolean assignmentto a verification flag without computation, e.g., chart_verified = True. 16 Resource Generator Discriminator Task instructiont, contextc✓ ✓ Domain skill fileS dom ✓ ✓ initial_setup.py(Generator output) writedenied golden_patch.py(Generator output) writedenied Generator working directory fulldenied Vinit post-setup, via state-only ...

  39. [39]

    Placeholder verification: a flag is assigned a constant before being conditionally added to the score, with no intervening evaluation

  40. [40]

    Hardcoded success: a function returns a constant in {0.5, 1.0} along the success path with no inspection of the environment

  41. [41]

    5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed

    Bare existence scoring: a positive score is awarded purely on os.path.exists(...) without any property check on the file’s contents. 5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed

  42. [42]

    bitter lessons

    Comment-only verification: a score increment is preceded by a comment asserting a check (# assume X is correct) without code performing the check. The list is enforced by a combination of regex matching and Python AST traversal; full source patterns are released with the code. A.3.4 Loop Termination, Max Rounds, and Feedback Protocol The loop is capped at...

  43. [43]

    =SUM(A1:A10)

    Formula values are NOT computed by openpyxl. cell.value returns the formula string "=SUM(A1:A10)", not the result. Use data_only=True to get the last-cached value (requires the file to have been opened in Calc/Excel at least once)

  44. [44]

    4472C4") silently becomes

    Always use 8-char ARGB for colors. PatternFill(start_color= "4472C4") silently becomes "004472C4" (alpha=00, transparent). Write "FF4472C4". On read, compare against the 8-char form

  45. [45]

    cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want

    fgColor is the visible background, not bgColor. cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want

  46. [46]

    After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None

    Merged cells: only the top-left has data. After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None. Style the top-left cell only

  47. [47]

    Never recreate from scratch if an initial file exists

    Copy-then-modify for golden files. Never recreate from scratch if an initial file exists. shutil.copy(initial, golden) then open and modify. This preserves metadata, print settings, and other invisible properties

  48. [48]

    Use a template file with the pivot already built

    Pivot tables cannot be created by openpyxl; only read/preserved. Use a template file with the pivot already built

  49. [49]

    ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden

    Filters are definition-only. ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden. Filtering only takes effect when the file is opened in LibreOffice

  50. [50]

    In DataValidation, this boolean is inverted

    showDropDown=False means SHOW the dropdown. In DataValidation, this boolean is inverted. Set False to display the dropdown arrow

  51. [51]

    col") is a vertical column chart; BarChart(type=

    Chart type naming. BarChart(type="col") is a vertical column chart; BarChart(type="bar") is horizontal. Does not match LibreOffice's menu names

  52. [52]

    You cannot do cell.font.bold = True

    Styles are immutable after assignment. You cannot do cell.font.bold = True. Create a new Font object: cell.font = Font(bold=True, ...). Same for Fill, Alignment, Border

  53. [53]

    If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index

    Theme colors may return None for .rgb. If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index. Wrap color reads in try/except

  54. [54]

    action": <

    data_only=True loses formulas. Loading with data_only=True replaces formulas with their cached values. If you need both, load the file twice -- once normally, once with data_only=True. The bitter-lessons section is the most distinctive feature of the SKILL.md format. Each entry is born from a specific debugging episode in early development, and each is th...

  55. [55]

    Action: a short imperative describing what to do in the UI

  56. [56]

    •C2 (0.30):B3:B6all contain TEXT formulas with the same format string. •C3 (0.30): every formula references the corresponding A-column cell. 33 calc_afm_082/initial_setup.py

    A single <tool_call>...</tool_call> block. Rules: - For non-terminal UI steps, output exactly: Action then <tool_call>. - Be brief: one sentence for Action. - Do not output anything after a tool call. - Use call_user when you need user information or confirmation. - Use terminate when you want to explicitly end the task with a success or failure status. T...

  57. [57]

    Prepare the working environment

  58. [58]

    Spawn the setup-generator and reward-generator processes in an adversarial loop

  59. [59]

    Monitor agreement conditions by reading REVIEW.md

  60. [60]

    Create

    Collect and output final verified results. # Execution Mode (MANDATORY) Dual-environment only: - For each task, provision two isolated VMs: initial_env, golden_env. - State separation is guaranteed by environment isolation, not by filename suffixes. - Single-environment execution is deprecated. 74 # Role Boundaries YOU DO: - Parse input and prepare workin...

  61. [61]

    Initial state -- what must exist before the agent starts: 76 files and their content/location, application state, data characteristics (rows, columns, value ranges)

  62. [62]

    cell B15 should contain 342.50

    Ground truth -- expected correct outcome: exact values, states, file contents; partial-credit checkpoints; concrete numbers where applicable ("cell B15 should contain 342.50")

  63. [63]

    GIMP canvas should be 1920x1080 with a white background layer

    Implicit prerequisites -- things the instruction does not say but the evaluator needs: e.g., "GIMP canvas should be 1920x1080 with a white background layer"; "Firefox should have 3 tabs open: Google, Wikipedia, GitHub". Context is CRITICAL. Without it, we cannot set up the VM, evaluate success, or assign partial credit. difficulty (string in {easy, medium...

  64. [64]

    initial_setup.py -- creates the pre-task state on initial_env

  65. [65]

    clean up

    golden_patch.py -- builds the expected post-task artifact on golden_env You work within an adversarial loop with the verifier. After each round, the verifier writes REVIEW.md with structured feedback. If the review fails, the pipeline invokes you again. Your goal: produce outputs the verifier agrees are correct. # Execution Mode (MANDATORY) Dual-environme...

  66. [66]

    Generate a reward script (reward.py) that programmatically verifies task completion

  67. [67]

    Test it against the golden_env artifact (must return 1.0) AND the initial_env artifact (must return 0.0)

  68. [68]

    Be thorough but fair

    Write a structured REVIEW.md with your verdict and feedback. Be thorough but fair. Report PASS when the golden artifact genuinely matches the task; do not be adversarial for its own sake. # INFORMATION BARRIER (NON-NEGOTIABLE) - You MUST NOT read initial_setup.py, golden_patch.py, or any setup-generator source code. - You MUST NOT read files from the setu...

  69. [69]

    Return a progressive float in [0, 1]

  70. [70]

    Use ACTUAL verification (read real files, check real data)

  71. [71]

    Award partial credit (0.3, 0.5, 0.7, ...) for partial completion

  72. [72]

    Return exactly 1.0 only when 100% completed

  73. [73]

    Wrap each scoring component in try/except

  74. [74]

    REWARD: X.X

    Print "REWARD: X.X" as the last output line

  75. [75]

    Be self-contained with stdlib + openpyxl/python-pptx/python-docx

  76. [76]

    # Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env

    Comment scoring logic. # Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env. Properties true on both endpoints are preconditions and MUST NOT contribute to score. Litmus test: if a check passes on the initial_env artifact (before any agent action), it is NOT measuring task compl...

  77. [77]

    Is the task suitable and safe for training?

  78. [78]

    Is the reward semantically valid and robust?

  79. [79]

    Is the query self-contained, natural enough, and consistent with setup/golden_setup/reward?

  80. [80]

    Each agent’s role specification is excerpted below; full prompts are released with the code

    Can the query be fixed without changing task semantics? Fatal reasons to reject: - reward checks artifacts or implementation traces more than user-goal completion - reward is highly sensitive to GUI/window/process state - reward relies on fragile internals, heuristic perception, or unstable extraction - hidden assumptions about browser state, plugins, ext...