CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang; Dayiheng Liu; Dunjie Lu; HaiQuan Wang; Hao Hu; Junli Wang; Junyang Lin; Que Shen; Shixuan Liu; Shuai Bai

arxiv: 2605.25624 · v2 · pith:JCLACOT4new · submitted 2026-05-25 · 💻 cs.AI · cs.LG

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang , Dunjie Lu , Junli Wang , Tianyi Bai , Shixuan Liu , Zhipeng Zhang , Haiquan Wang , Hao Hu

show 6 more authors

Tianbao Xie Shuai Bai Dayiheng Liu Que Shen Junyang Lin Tao Yu

This is my paper

Pith reviewed 2026-06-29 21:38 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords computer-use agentsreinforcement learning with verifiable rewardsCUA-GymOSWorldWebArenaagent training environmentsscalable data synthesisverifiable rewards

0 comments

The pith

CUA-Gym uses generator-discriminator-orchestrator loops and filters to produce 32,112 verified RLVR training tuples across 110 environments for computer-use agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to overcome the lack of scalable, deterministic-reward data for training computer-use agents with reinforcement learning. It introduces a pipeline in which separate agents generate task instructions and environment states, write reward functions, and iterate until consistency is reached, followed by a voting-and-rollout quality filter. The resulting dataset, built on top of a new collection of mock web applications, lets models trained with GSPO reach 62.1 percent and 72.6 percent on OSWorld-Verified while also improving on the held-out WebArena benchmark. Performance improves steadily as the amount of data and the number of environments increase.

Core claim

CUA-Gym co-generates task instructions, initial and golden environment states, and reward functions through an iterative Generator-Discriminator-Orchestrator loop; the resulting tuples are further vetted by LLM majority voting plus agent rollouts. This process yields 32,112 high-fidelity RLVR training examples grounded in 110 environments from the synthesized CUA-Gym-Hub suite. Models trained on the data with GSPO outperform prior open-source computer-use agents at similar scales on OSWorld-Verified and show positive transfer to WebArena.

What carries the argument

The Generator-Discriminator-Orchestrator iterative loop plus final LLM-voting and rollout filter that together produce deterministic, verifiable task-environment-reward tuples.

If this is right

Model performance on OSWorld-Verified scales smoothly with both the volume of training tuples and the diversity of environments.
Checkpoints trained on CUA-Gym also improve accuracy on the held-out WebArena benchmark.
CUA-Gym-Hub expands the pool of high-fidelity mock web environments by an order of magnitude over prior collections.
The full pipeline, dataset, environments, and trained models are released for further use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same co-generation loop could be applied to other agent domains that currently lack verifiable rewards, such as desktop or mobile interfaces.
If the LLM-based filter introduces subtle biases, increasing model scale during data synthesis might amplify rather than reduce those biases.
Adding environments whose action distributions more closely match real user logs could further strengthen transfer to production settings.

Load-bearing premise

The language models inside the generator, discriminator, and filter produce reward functions that remain deterministic and free of systematic biases across repeated executions.

What would settle it

Running the same generated reward function on identical agent trajectories multiple times and observing inconsistent pass/fail outcomes.

Figures

Figures reproduced from arXiv: 2605.25624 by Bowen Wang, Dayiheng Liu, Dunjie Lu, HaiQuan Wang, Hao Hu, Junli Wang, Junyang Lin, Que Shen, Shixuan Liu, Shuai Bai, Tao Yu, Tianbao Xie, Tianyi Bai, Zhipeng Zhang.

**Figure 1.** Figure 1: Overview of the CUA-GYM data synthesis pipeline. From a task instruction and its grounded context, the Orchestrator provisions paired virtual machines and spawns two adversarially coupled agents. The Generator writes initial_setup.py and golden_patch.py to construct the initial and golden states. The Discriminator, isolated from the Generator’s scripts by an information barrier, independently drafts reward… view at source ↗

**Figure 2.** Figure 2: Multi-agent pipeline for mock environment synthesis. From a target application seed, the Plan Agent drafts DESIGN.md and TODO.md specifying features, data schema, API protocol, and UI layout. The Dev Agent implements the single-page application from these specifications. The Web Agent then exercises every interactive element via Playwright, comparing the live DOM against the spec and feeding discrepancies … view at source ↗

**Figure 3.** Figure 3: State-injected environments in CUA-GYM-HUB. The same mail mock can be instantiated with distinct task-specific initial states: an empty inbox, a project-deadline state with pending coordination threads, and a high-volume backlog. Because all mutations and resets are scoped to a session state, parallel rollout workers can share the same application implementation while observing isolated world states. 3 Exp… view at source ↗

**Figure 4.** Figure 4: Training scaffold for long-horizon CUA trajectories. Instead of training on a single sliding window that drops older interaction state, trajectory slicing keeps recent multimodal observations active while collapsing earlier screenshots into deterministic placeholders. This preserves late-trajectory supervision under a fixed context budget and exposes stable prefixes that are friendly to KV-cache reuse duri… view at source ↗

**Figure 5.** Figure 5: Main results. (a) Per-domain success rates on OSWorld-Verified for CUA-GYM-A3B (red) versus its SFT-initialized base (gray, dashed); n denotes task count per domain. Largest gains on multiapp workflows (+21.5 pp), libreoffice_calc (+14.9 pp), and vs_code (+13.6 pp). (b) Success rates on OSWorld-Verified and WebArena. RLVR on CUA-GYM data improves both Qwen3.5-35B-A3B and Qwen3.5-397B-A17B bases, with CUA-… view at source ↗

**Figure 6.** Figure 6: CUA-GYM data and environments. (a) CUA-GYM spans 110 environments, including CUAGYM-HUB web environments aligned with O*NET occupational categories [20]. (b) The 32,112 verified tuples cover application categories evenly (no category exceeds 21%) and are skewed toward harder (45%) and cross-application (38%) tasks. (c) Compared to existing RLVR datasets, CUA-GYM is the largest open-released collection and… view at source ↗

**Figure 7.** Figure 7: Data scaling of RL training on CUA-GYM. OSWorld-Verified score (left) and training reward (right) vs. RL training step on Qwen3.5-35B-A3B for three CUA-GYM training subsets (1.4K, 3K, 12K verified tuples), all initialized from the same SFT checkpoint (gray dashed line at 0.53). Faint lines show raw per-step values; bold lines are exponentially smoothed. and the overall shape of the training curve, indicati… view at source ↗

**Figure 8.** Figure 8: Environment scaling under teacher distillation. OSWorld-Verified score versus number of training environments at 100-step evaluation budgets. The broad setting (80 envs, 75 trajectories each) outperforms the narrow setting (10 envs, 300 each) despite using 4× fewer trajectories per environment. The data-scaling results of §4.1 establish trajectory volume as one axis along which CUA-GYM data drives RL per… view at source ↗

**Figure 9.** Figure 9: Emergent action batching during RL. Average tool calls per model step on CUA-GYMA3B. The SFT-initialized policy emits approximately one call per step; RL drives this to a stable 1.4–1.9 band. A behavior we did not optimize for emerges in CUAGYM RL training: the policy spontaneously learns to issue multiple tool calls per turn, compressing what begin as one-action-per-turn rollouts into denser multiact… view at source ↗

**Figure 10.** Figure 10: Landing-page screenshots of 32 representative released mock web applications, organized by thematic pair-of-rows: communication and meetings (rows 1–2), productivity and collaboration (rows 3–4), engineering and developer tooling (rows 5–6), and commerce / cloud / social (rows 7–8). Each mock reproduces the surface visual layout, navigation tree, and primary-feature inventory of its named reference applic… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUA-Gym delivers a working pipeline for generating 32k RLVR tuples for computer-use agents via agent co-generation and filtering, with reported gains on OSWorld and WebArena transfer.

read the letter

The main takeaway is that this paper gives a concrete way to expand training data for RL with verifiable rewards in the computer-use domain. They use separate Generator and Discriminator agents plus an orchestrator to create task specs, initial/golden states, and reward functions, then apply LLM voting and rollout filters. They also build CUA-Gym-Hub, a collection of mock web apps drawn from real software distributions, and release 32k tuples across 110 environments.

The results section shows their 3B and 17B models reaching 62.1% and 72.6% on OSWorld-Verified, beating prior open-source CUAs at similar sizes, plus gains on held-out WebArena. Performance improves smoothly with more data and more environments. That scaling behavior and the transfer result are the parts that look most useful.

The soft spot is the reward construction itself. The discriminators are LLMs, and the only safeguards described are the iterative loop, majority vote, and rollout filter. No numbers on residual error rates or bias checks appear in the abstract, so it remains possible that some of the measured improvement comes from artifacts in the generated rewards rather than genuine agent capability. The paper would be stronger with an ablation on the filter or an external verification step.

This is aimed at groups working on agent RL and data synthesis for tool-using models. The released pipeline and environments make it worth looking at even if the reward quality needs more validation. I would send it to referees because the scale and the benchmark numbers are specific enough to test.

Referee Report

2 major / 2 minor

Summary. The paper introduces CUA-Gym, a pipeline that uses a Generator-Discriminator-Orchestrator loop to co-generate task instructions, environment states, and reward functions for computer-use agents, followed by an LLM majority-vote plus rollout filter to produce 32,112 verified RLVR training tuples across 110 environments (including the synthesized CUA-Gym-Hub mock web apps). Models trained via GSPO on this data (CUA-Gym-A3B and CUA-Gym-A17B) reach 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at similar scales, with claimed smooth scaling in data volume and environment diversity plus positive transfer to the held-out WebArena benchmark. The full pipeline, dataset, environments, and models are to be open-sourced.

Significance. If the generated rewards are shown to be deterministic and free of systematic LLM-induced biases, the work would meaningfully advance RLVR for CUAs by removing the data bottleneck that has limited prior efforts. The explicit open-sourcing of the synthesis pipeline, the 32k-tuple dataset, the CUA-Gym-Hub environments, and the trained models is a concrete strength that supports reproducibility. The reported smooth scaling trends and cross-benchmark transfer, if robust, would provide useful empirical evidence for the value of environment diversity in this domain.

major comments (2)

[abstract and pipeline description] Abstract and pipeline description: the central performance claims (62.1%/72.6% on OSWorld-Verified, scaling, WebArena transfer) rest on the assertion that the Generator-Discriminator-Orchestrator loop plus LLM-voting/rollout filter produces deterministic, unbiased rewards; however, no quantitative metrics are supplied on filter error rates, inter-voter agreement, or failure modes, nor is there comparison against human-verified ground truth or formal specifications. This directly affects the verifiability premise required for the RLVR results.
[results and evaluation sections] Results and evaluation sections: no ablation studies isolate the contribution of the reward filter (or the number of environments) from other factors such as data volume or model scale; without these, it is difficult to attribute the observed gains specifically to the claimed deterministic rewards rather than potential artifacts in the LLM-generated reward functions.

minor comments (2)

[methods] Clarify the exact prompting templates and temperature settings used for the Discriminator and voting stages to improve reproducibility.
[dataset construction] Add a table summarizing the distribution of the 110 environments and the 32,112 tuples by application category.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of quantitative validation for the reward generation pipeline and the need for ablations. We address each major comment below with honest assessments and commit to revisions that strengthen the manuscript without overstating current results.

read point-by-point responses

Referee: [abstract and pipeline description] Abstract and pipeline description: the central performance claims (62.1%/72.6% on OSWorld-Verified, scaling, WebArena transfer) rest on the assertion that the Generator-Discriminator-Orchestrator loop plus LLM-voting/rollout filter produces deterministic, unbiased rewards; however, no quantitative metrics are supplied on filter error rates, inter-voter agreement, or failure modes, nor is there comparison against human-verified ground truth or formal specifications. This directly affects the verifiability premise required for the RLVR results.

Authors: We agree that the absence of quantitative metrics on filter error rates, inter-voter agreement, failure modes, and human ground-truth comparisons weakens the verifiability claims. The manuscript describes the multi-agent loop and LLM majority-vote/rollout filter but provides no such statistics. In the revised manuscript we will add reported inter-voter agreement rates and rollout pass rates across the 32,112 tuples. A large-scale human verification study was not performed owing to cost; we will explicitly list this as a limitation while noting that open-sourcing the pipeline and dataset enables external validation. We maintain that the adversarial loop plus filter offers a practical scalable mechanism, but we accept the referee's point that formal specifications and error-rate quantification are currently missing. revision: partial
Referee: [results and evaluation sections] Results and evaluation sections: no ablation studies isolate the contribution of the reward filter (or the number of environments) from other factors such as data volume or model scale; without these, it is difficult to attribute the observed gains specifically to the claimed deterministic rewards rather than potential artifacts in the LLM-generated reward functions.

Authors: The referee is correct that no ablations isolate the reward filter or environment count from data volume and model scale. The reported scaling curves and WebArena transfer are consistent with the value of verified data, yet they do not directly demonstrate the filter's incremental contribution. In revision we will add limited ablations on a data subset comparing filtered versus unfiltered training runs and performance as a function of environment diversity via subsampling. Full isolation remains challenging because the filter is integral to the generation process; we will present these new experiments while acknowledging that complete separation of all factors was not feasible in the original study. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical data generation and benchmark evaluation are self-contained

full rationale

The paper describes an empirical pipeline that generates training tuples via LLM-based Generator, Discriminator, and Orchestrator agents followed by voting/rollout filters, then applies standard GSPO training and reports results on OSWorld-Verified and WebArena. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central performance claims (62.1%/72.6% scores and scaling behavior) are obtained directly from new data and held-out evaluation rather than reducing to any input by construction. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper on data synthesis and model training. No mathematical free parameters, axioms, or invented entities are introduced or required for the central claim.

pith-pipeline@v0.9.1-grok · 5926 in / 1380 out tokens · 33540 ms · 2026-06-29T21:38:16.619909+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
cs.CL 2026-06 conditional novelty 7.0

ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
cs.AI 2026-06 unverdicted novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
PhoneBuddy: Training Open Models for Agentic Phone Use
cs.CL 2026-06 unverdicted novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

80 extracted references · 26 canonical work pages · cited by 3 Pith papers · 8 internal anchors

[1]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment, 2026. URLhttps://arxiv.org/abs/2604.06126

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use, 2024

2024
[3]

Anthropic economic index.https://www.anthropic.com/economic-index, 2025

Anthropic. Anthropic economic index.https://www.anthropic.com/economic-index, 2025

2025
[4]

Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

2026
[5]

Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026

Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026. URLhttps://arxiv.org/abs/2602.14093

work page arXiv 2026
[6]

Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026

Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, and Binyuan Hui. Swe-universe: Scale real-world verifiable environments to millions, 2026. URLhttps://arxiv.org/abs/2602.02361

work page arXiv 2026
[7]

Goodman, and Dimitris Papailiopoulos

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. URLhttps://arxiv.org/abs/2601.16443

work page arXiv 2026
[8]

Training long-context, multi-turn software engineering agents with reinforcement learning, 2025

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. Training long-context, multi-turn software engineering agents with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2508.03501

work page arXiv 2025
[9]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[10]

R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URL https://arxiv.org/abs/2504.07164

work page arXiv 2025
[11]

Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026

Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents, 2026. URL https://arxiv.org/ abs/2603.20691. 11

work page arXiv 2026
[12]

VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/ abs/2510.19488

work page arXiv 2025
[13]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https://arxiv.org/ abs/2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

On data engineering for scaling llm terminal capabilities, 2026

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, and Wei Ping. On data engineering for scaling llm terminal capabilities, 2026. URLhttps://arxiv.org/abs/2602.21193

work page arXiv 2026
[15]

Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

Qwen Team. Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

2026
[16]

Scaling synthetic task generation for agents via exploration,

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration,
[17]

URLhttps://arxiv.org/abs/2509.25047

work page arXiv
[18]

SGLang: Efficient execution of structured language model programs

SGLang Project. SGLang: Efficient execution of structured language model programs. https: //github.com/sgl-project/sglang, 2024

2024
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025. URL https://arxiv.org/abs/2508.04700

work page arXiv 2025
[21]

Department of Labor, Employment and Training Administration

U.S. Department of Labor, Employment and Training Administration. O*NET OnLine. https: //www.onetonline.org/, 2026. Accessed: 2026-05-06

2026
[22]

verl: Volcano engine reinforcement learning for LLMs

verl Project. verl: Volcano engine reinforcement learning for LLMs. https://github.com/ verl-project/verl, 2024

2024
[23]

Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents

Bowen Wang, Xinyuan Wang, Jiaqi Deng, Tianbao Xie, Ryan Li, Yanzhe Zhang, Junli Wang, Dunjie Lu, Zicheng Gong, Gavin Li, Toh Jing Hua, Wei-Lin Chiang, Ion Stoica, Diyi Yang, Yu Su, Yi Zhang, Zhiguo Wang, Victor Zhong, and Tao Yu. Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents. InInternational Conference on Learni...
[24]

URLhttps://openreview.net/forum?id=3x4SDbXbgl
[25]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Charles, Zhilin Yang, and Tao Yu

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

work page arXiv 2025
[27]

WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025

WebArena Team. WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025. 12

2025
[28]

Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

work page arXiv 2026
[29]

Scaling computer-use grounding via user interface decomposition and synthesis, 2025

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis, 2025. URL https://arxiv.org/abs/2505.13227

work page arXiv 2025
[30]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URL https://arxiv.org/abs/2412.09605

work page arXiv 2025
[31]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

arXiv preprint arXiv:2601.15876 , year=

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

work page arXiv 2026
[33]

Zerogui: Automating online gui learning at zero human cost, 2025

Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost, 2025. URLhttps://arxiv.org/abs/2505.23762

work page arXiv 2025
[34]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, and Zhe Gan. UltraCUA: A foundation model for computer use agents with hybrid action, 2025. URL https://arxiv.org/abs/ 2510.17790

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URL https://arxiv.org/abs/ 2601.04126

work page arXiv 2026
[37]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071. Appendix Contents 13 A Data Synthesis Pipeline Details A.1 Pipeline Pseudocode Algorithm 1 summarizes the end-to-end co-generation ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Direct Boolean assignmentto a verification flag without computation, e.g., chart_verified = True. 16 Resource Generator Discriminator Task instructiont, contextc✓ ✓ Domain skill fileS dom ✓ ✓ initial_setup.py(Generator output) writedenied golden_patch.py(Generator output) writedenied Generator working directory fulldenied Vinit post-setup, via state-only ...
[39]

Placeholder verification: a flag is assigned a constant before being conditionally added to the score, with no intervening evaluation
[40]

Hardcoded success: a function returns a constant in {0.5, 1.0} along the success path with no inspection of the environment
[41]

5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed

Bare existence scoring: a positive score is awarded purely on os.path.exists(...) without any property check on the file’s contents. 5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed
[42]

bitter lessons

Comment-only verification: a score increment is preceded by a comment asserting a check (# assume X is correct) without code performing the check. The list is enforced by a combination of regex matching and Python AST traversal; full source patterns are released with the code. A.3.4 Loop Termination, Max Rounds, and Feedback Protocol The loop is capped at...
[43]

=SUM(A1:A10)

Formula values are NOT computed by openpyxl. cell.value returns the formula string "=SUM(A1:A10)", not the result. Use data_only=True to get the last-cached value (requires the file to have been opened in Calc/Excel at least once)
[44]

4472C4") silently becomes

Always use 8-char ARGB for colors. PatternFill(start_color= "4472C4") silently becomes "004472C4" (alpha=00, transparent). Write "FF4472C4". On read, compare against the 8-char form
[45]

cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want

fgColor is the visible background, not bgColor. cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want
[46]

After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None

Merged cells: only the top-left has data. After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None. Style the top-left cell only
[47]

Never recreate from scratch if an initial file exists

Copy-then-modify for golden files. Never recreate from scratch if an initial file exists. shutil.copy(initial, golden) then open and modify. This preserves metadata, print settings, and other invisible properties
[48]

Use a template file with the pivot already built

Pivot tables cannot be created by openpyxl; only read/preserved. Use a template file with the pivot already built
[49]

ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden

Filters are definition-only. ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden. Filtering only takes effect when the file is opened in LibreOffice
[50]

In DataValidation, this boolean is inverted

showDropDown=False means SHOW the dropdown. In DataValidation, this boolean is inverted. Set False to display the dropdown arrow
[51]

col") is a vertical column chart; BarChart(type=

Chart type naming. BarChart(type="col") is a vertical column chart; BarChart(type="bar") is horizontal. Does not match LibreOffice's menu names
[52]

You cannot do cell.font.bold = True

Styles are immutable after assignment. You cannot do cell.font.bold = True. Create a new Font object: cell.font = Font(bold=True, ...). Same for Fill, Alignment, Border
[53]

If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index

Theme colors may return None for .rgb. If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index. Wrap color reads in try/except
[54]

action": <

data_only=True loses formulas. Loading with data_only=True replaces formulas with their cached values. If you need both, load the file twice -- once normally, once with data_only=True. The bitter-lessons section is the most distinctive feature of the SKILL.md format. Each entry is born from a specific debugging episode in early development, and each is th...

2048
[55]

Action: a short imperative describing what to do in the UI
[56]

•C2 (0.30):B3:B6all contain TEXT formulas with the same format string. •C3 (0.30): every formula references the corresponding A-column cell. 33 calc_afm_082/initial_setup.py

A single <tool_call>...</tool_call> block. Rules: - For non-terminal UI steps, output exactly: Action then <tool_call>. - Be brief: one sentence for Action. - Do not output anything after a tool call. - Use call_user when you need user information or confirmation. - Use terminate when you want to explicitly end the task with a success or failure status. T...

2048
[57]

Prepare the working environment
[58]

Spawn the setup-generator and reward-generator processes in an adversarial loop
[59]

Monitor agreement conditions by reading REVIEW.md
[60]

Create

Collect and output final verified results. # Execution Mode (MANDATORY) Dual-environment only: - For each task, provision two isolated VMs: initial_env, golden_env. - State separation is guaranteed by environment isolation, not by filename suffixes. - Single-environment execution is deprecated. 74 # Role Boundaries YOU DO: - Parse input and prepare workin...
[61]

Initial state -- what must exist before the agent starts: 76 files and their content/location, application state, data characteristics (rows, columns, value ranges)
[62]

cell B15 should contain 342.50

Ground truth -- expected correct outcome: exact values, states, file contents; partial-credit checkpoints; concrete numbers where applicable ("cell B15 should contain 342.50")
[63]

GIMP canvas should be 1920x1080 with a white background layer

Implicit prerequisites -- things the instruction does not say but the evaluator needs: e.g., "GIMP canvas should be 1920x1080 with a white background layer"; "Firefox should have 3 tabs open: Google, Wikipedia, GitHub". Context is CRITICAL. Without it, we cannot set up the VM, evaluate success, or assign partial credit. difficulty (string in {easy, medium...
[64]

initial_setup.py -- creates the pre-task state on initial_env
[65]

clean up

golden_patch.py -- builds the expected post-task artifact on golden_env You work within an adversarial loop with the verifier. After each round, the verifier writes REVIEW.md with structured feedback. If the review fails, the pipeline invokes you again. Your goal: produce outputs the verifier agrees are correct. # Execution Mode (MANDATORY) Dual-environme...
[66]

Generate a reward script (reward.py) that programmatically verifies task completion
[67]

Test it against the golden_env artifact (must return 1.0) AND the initial_env artifact (must return 0.0)
[68]

Be thorough but fair

Write a structured REVIEW.md with your verdict and feedback. Be thorough but fair. Report PASS when the golden artifact genuinely matches the task; do not be adversarial for its own sake. # INFORMATION BARRIER (NON-NEGOTIABLE) - You MUST NOT read initial_setup.py, golden_patch.py, or any setup-generator source code. - You MUST NOT read files from the setu...
[69]

Return a progressive float in [0, 1]
[70]

Use ACTUAL verification (read real files, check real data)
[71]

Award partial credit (0.3, 0.5, 0.7, ...) for partial completion
[72]

Return exactly 1.0 only when 100% completed
[73]

Wrap each scoring component in try/except
[74]

REWARD: X.X

Print "REWARD: X.X" as the last output line
[75]

Be self-contained with stdlib + openpyxl/python-pptx/python-docx
[76]

# Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env

Comment scoring logic. # Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env. Properties true on both endpoints are preconditions and MUST NOT contribute to score. Litmus test: if a check passes on the initial_env artifact (before any agent action), it is NOT measuring task compl...
[77]

Is the task suitable and safe for training?
[78]

Is the reward semantically valid and robust?
[79]

Is the query self-contained, natural enough, and consistent with setup/golden_setup/reward?
[80]

Each agent’s role specification is excerpted below; full prompts are released with the code

Can the query be fixed without changing task semantics? Fatal reasons to reject: - reward checks artifacts or implementation traces more than user-goal completion - reward is highly sensitive to GUI/window/process state - reward relies on fragile internals, heuristic perception, or unstable extraction - hidden assumptions about browser state, plugins, ext...

[1] [1]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment, 2026. URLhttps://arxiv.org/abs/2604.06126

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use, 2024

2024

[3] [3]

Anthropic economic index.https://www.anthropic.com/economic-index, 2025

Anthropic. Anthropic economic index.https://www.anthropic.com/economic-index, 2025

2025

[4] [4]

Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

2026

[5] [5]

Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026

Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, Wei Yang, and Tao Xie. Gui-genesis: Automated synthesis of efficient environments with verifiable rewards for gui agent post-training, 2026. URLhttps://arxiv.org/abs/2602.14093

work page arXiv 2026

[6] [6]

Swe-universe: Scale real-world verifiable environments to millions.CoRR, abs/2602.02361, 2026

Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, Hao Ge, Zongmeng Zhang, Zeyu Cui, Dayiheng Liu, Jingren Zhou, Jianling Sun, Junyang Lin, and Binyuan Hui. Swe-universe: Scale real-world verifiable environments to millions, 2026. URLhttps://arxiv.org/abs/2602.02361

work page arXiv 2026

[7] [7]

Goodman, and Dimitris Papailiopoulos

Kanishk Gandhi, Shivam Garg, Noah D. Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents, 2026. URLhttps://arxiv.org/abs/2601.16443

work page arXiv 2026

[8] [8]

Training long-context, multi-turn software engineering agents with reinforcement learning, 2025

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. Training long-context, multi-turn software engineering agents with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2508.03501

work page arXiv 2025

[9] [9]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[10] [10]

R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents, 2025. URL https://arxiv.org/abs/2504.07164

work page arXiv 2025

[11] [11]

Swe-next: Scalable real-world software engineering tasks for agents.CoRR, abs/2603.20691, 2026

Jiarong Liang, Zhiheng Lyu, Zijie Liu, Xiangchao Chen, Ping Nie, Kai Zou, and Wenhu Chen. Swe-next: Scalable real-world software engineering tasks for agents, 2026. URL https://arxiv.org/ abs/2603.20691. 11

work page arXiv 2026

[12] [12]

VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/ abs/2510.19488

work page arXiv 2025

[13] [13]

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2025. URL https://arxiv.org/ abs/2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

On data engineering for scaling llm terminal capabilities, 2026

Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, and Wei Ping. On data engineering for scaling llm terminal capabilities, 2026. URLhttps://arxiv.org/abs/2602.21193

work page arXiv 2026

[15] [15]

Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

Qwen Team. Qwen3.5 collection.https://huggingface.co/collections/Qwen/qwen35, 2026

2026

[16] [16]

Scaling synthetic task generation for agents via exploration,

Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration,

[17] [17]

URLhttps://arxiv.org/abs/2509.25047

work page arXiv

[18] [18]

SGLang: Efficient execution of structured language model programs

SGLang Project. SGLang: Efficient execution of structured language model programs. https: //github.com/sgl-project/sglang, 2024

2024

[19] [19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience, 2025. URL https://arxiv.org/abs/2508.04700

work page arXiv 2025

[21] [21]

Department of Labor, Employment and Training Administration

U.S. Department of Labor, Employment and Training Administration. O*NET OnLine. https: //www.onetonline.org/, 2026. Accessed: 2026-05-06

2026

[22] [22]

verl: Volcano engine reinforcement learning for LLMs

verl Project. verl: Volcano engine reinforcement learning for LLMs. https://github.com/ verl-project/verl, 2024

2024

[23] [23]

Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents

Bowen Wang, Xinyuan Wang, Jiaqi Deng, Tianbao Xie, Ryan Li, Yanzhe Zhang, Junli Wang, Dunjie Lu, Zicheng Gong, Gavin Li, Toh Jing Hua, Wei-Lin Chiang, Ion Stoica, Diyi Yang, Yu Su, Yi Zhang, Zhiguo Wang, Victor Zhong, and Tao Yu. Computer agent arena: Toward human-centric evaluation and analysis of computer-use agents. InInternational Conference on Learni...

[24] [24]

URLhttps://openreview.net/forum?id=3x4SDbXbgl

[25] [25]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Charles, Zhilin Yang, and Tao Yu

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

work page arXiv 2025

[27] [27]

WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025

WebArena Team. WebArena-infinity.https://webarena.dev/webarena-infinity/, 2025. 12

2025

[28] [28]

Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026

Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Tseng, Zhaoyang Yu, Liang Chen, Yuyao Zhai, Bang Liu, Chenglin Wu, and Yuyu Luo. Autowebworld: Synthesizing infinite verifiable web environments via finite state machines, 2026. URLhttps://arxiv.org/abs/2602.14296

work page arXiv 2026

[29] [29]

Scaling computer-use grounding via user interface decomposition and synthesis, 2025

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis, 2025. URL https://arxiv.org/abs/2505.13227

work page arXiv 2025

[30] [30]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605, 2024

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URL https://arxiv.org/abs/2412.09605

work page arXiv 2025

[31] [31]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

arXiv preprint arXiv:2601.15876 , year=

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

work page arXiv 2026

[33] [33]

Zerogui: Automating online gui learning at zero human cost, 2025

Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost, 2025. URLhttps://arxiv.org/abs/2505.23762

work page arXiv 2025

[34] [34]

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. Swe-smith: Scaling data for software engineering agents, 2025. URLhttps://arxiv.org/abs/2504.21798

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, and Zhe Gan. UltraCUA: A foundation model for computer use agents with hybrid action, 2025. URL https://arxiv.org/abs/ 2510.17790

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Infiniteweb: Scalable web environment synthesis for gui agent training, 2026

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training, 2026. URL https://arxiv.org/abs/ 2601.04126

work page arXiv 2026

[37] [37]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071. Appendix Contents 13 A Data Synthesis Pipeline Details A.1 Pipeline Pseudocode Algorithm 1 summarizes the end-to-end co-generation ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Direct Boolean assignmentto a verification flag without computation, e.g., chart_verified = True. 16 Resource Generator Discriminator Task instructiont, contextc✓ ✓ Domain skill fileS dom ✓ ✓ initial_setup.py(Generator output) writedenied golden_patch.py(Generator output) writedenied Generator working directory fulldenied Vinit post-setup, via state-only ...

[39] [39]

Placeholder verification: a flag is assigned a constant before being conditionally added to the score, with no intervening evaluation

[40] [40]

Hardcoded success: a function returns a constant in {0.5, 1.0} along the success path with no inspection of the environment

[41] [41]

5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed

Bare existence scoring: a positive score is awarded purely on os.path.exists(...) without any property check on the file’s contents. 5.subprocess usage: the reward shells out to external processes, which are non-reproducible and easily spoofed

[42] [42]

bitter lessons

Comment-only verification: a score increment is preceded by a comment asserting a check (# assume X is correct) without code performing the check. The list is enforced by a combination of regex matching and Python AST traversal; full source patterns are released with the code. A.3.4 Loop Termination, Max Rounds, and Feedback Protocol The loop is capped at...

[43] [43]

=SUM(A1:A10)

Formula values are NOT computed by openpyxl. cell.value returns the formula string "=SUM(A1:A10)", not the result. Use data_only=True to get the last-cached value (requires the file to have been opened in Calc/Excel at least once)

[44] [44]

4472C4") silently becomes

Always use 8-char ARGB for colors. PatternFill(start_color= "4472C4") silently becomes "004472C4" (alpha=00, transparent). Write "FF4472C4". On read, compare against the 8-char form

[45] [45]

cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want

fgColor is the visible background, not bgColor. cell.fill .fgColor.rgb gives you the background color you see; bgColor is rarely what you want

[46] [46]

After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None

Merged cells: only the top-left has data. After merge_cells("A1:D1"), B1/C1/D1 become MergedCell with value=None. Style the top-left cell only

[47] [47]

Never recreate from scratch if an initial file exists

Copy-then-modify for golden files. Never recreate from scratch if an initial file exists. shutil.copy(initial, golden) then open and modify. This preserves metadata, print settings, and other invisible properties

[48] [48]

Use a template file with the pivot already built

Pivot tables cannot be created by openpyxl; only read/preserved. Use a template file with the pivot already built

[49] [49]

ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden

Filters are definition-only. ws.auto_filter.ref = "A1:D20" sets the range, but rows are NOT actually hidden. Filtering only takes effect when the file is opened in LibreOffice

[50] [50]

In DataValidation, this boolean is inverted

showDropDown=False means SHOW the dropdown. In DataValidation, this boolean is inverted. Set False to display the dropdown arrow

[51] [51]

col") is a vertical column chart; BarChart(type=

Chart type naming. BarChart(type="col") is a vertical column chart; BarChart(type="bar") is horizontal. Does not match LibreOffice's menu names

[52] [52]

You cannot do cell.font.bold = True

Styles are immutable after assignment. You cannot do cell.font.bold = True. Create a new Font object: cell.font = Font(bold=True, ...). Same for Fill, Alignment, Border

[53] [53]

If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index

Theme colors may return None for .rgb. If a cell uses a theme color instead of explicit RGB, cell.font.color.rgb can be None or a theme index. Wrap color reads in try/except

[54] [54]

action": <

data_only=True loses formulas. Loading with data_only=True replaces formulas with their cached values. If you need both, load the file twice -- once normally, once with data_only=True. The bitter-lessons section is the most distinctive feature of the SKILL.md format. Each entry is born from a specific debugging episode in early development, and each is th...

2048

[55] [55]

Action: a short imperative describing what to do in the UI

[56] [56]

•C2 (0.30):B3:B6all contain TEXT formulas with the same format string. •C3 (0.30): every formula references the corresponding A-column cell. 33 calc_afm_082/initial_setup.py

A single <tool_call>...</tool_call> block. Rules: - For non-terminal UI steps, output exactly: Action then <tool_call>. - Be brief: one sentence for Action. - Do not output anything after a tool call. - Use call_user when you need user information or confirmation. - Use terminate when you want to explicitly end the task with a success or failure status. T...

2048

[57] [57]

Prepare the working environment

[58] [58]

Spawn the setup-generator and reward-generator processes in an adversarial loop

[59] [59]

Monitor agreement conditions by reading REVIEW.md

[60] [60]

Create

Collect and output final verified results. # Execution Mode (MANDATORY) Dual-environment only: - For each task, provision two isolated VMs: initial_env, golden_env. - State separation is guaranteed by environment isolation, not by filename suffixes. - Single-environment execution is deprecated. 74 # Role Boundaries YOU DO: - Parse input and prepare workin...

[61] [61]

Initial state -- what must exist before the agent starts: 76 files and their content/location, application state, data characteristics (rows, columns, value ranges)

[62] [62]

cell B15 should contain 342.50

Ground truth -- expected correct outcome: exact values, states, file contents; partial-credit checkpoints; concrete numbers where applicable ("cell B15 should contain 342.50")

[63] [63]

GIMP canvas should be 1920x1080 with a white background layer

Implicit prerequisites -- things the instruction does not say but the evaluator needs: e.g., "GIMP canvas should be 1920x1080 with a white background layer"; "Firefox should have 3 tabs open: Google, Wikipedia, GitHub". Context is CRITICAL. Without it, we cannot set up the VM, evaluate success, or assign partial credit. difficulty (string in {easy, medium...

[64] [64]

initial_setup.py -- creates the pre-task state on initial_env

[65] [65]

clean up

golden_patch.py -- builds the expected post-task artifact on golden_env You work within an adversarial loop with the verifier. After each round, the verifier writes REVIEW.md with structured feedback. If the review fails, the pipeline invokes you again. Your goal: produce outputs the verifier agrees are correct. # Execution Mode (MANDATORY) Dual-environme...

[66] [66]

Generate a reward script (reward.py) that programmatically verifies task completion

[67] [67]

Test it against the golden_env artifact (must return 1.0) AND the initial_env artifact (must return 0.0)

[68] [68]

Be thorough but fair

Write a structured REVIEW.md with your verdict and feedback. Be thorough but fair. Report PASS when the golden artifact genuinely matches the task; do not be adversarial for its own sake. # INFORMATION BARRIER (NON-NEGOTIABLE) - You MUST NOT read initial_setup.py, golden_patch.py, or any setup-generator source code. - You MUST NOT read files from the setu...

[69] [69]

Return a progressive float in [0, 1]

[70] [70]

Use ACTUAL verification (read real files, check real data)

[71] [71]

Award partial credit (0.3, 0.5, 0.7, ...) for partial completion

[72] [72]

Return exactly 1.0 only when 100% completed

[73] [73]

Wrap each scoring component in try/except

[74] [74]

REWARD: X.X

Print "REWARD: X.X" as the last output line

[75] [75]

Be self-contained with stdlib + openpyxl/python-pptx/python-docx

[76] [76]

# Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env

Comment scoring logic. # Only Score Task-Introduced Changes Every scoring component must verify something that DIFFERS between initial_env and golden_env. Properties true on both endpoints are preconditions and MUST NOT contribute to score. Litmus test: if a check passes on the initial_env artifact (before any agent action), it is NOT measuring task compl...

[77] [77]

Is the task suitable and safe for training?

[78] [78]

Is the reward semantically valid and robust?

[79] [79]

Is the query self-contained, natural enough, and consistent with setup/golden_setup/reward?

[80] [80]

Each agent’s role specification is excerpted below; full prompts are released with the code

Can the query be fixed without changing task semantics? Fatal reasons to reject: - reward checks artifacts or implementation traces more than user-goal completion - reward is highly sensitive to GUI/window/process state - reward relies on fragile internals, heuristic perception, or unstable extraction - hidden assumptions about browser state, plugins, ext...