ProCUA-SFT Technical Report
Pith reviewed 2026-06-27 03:14 UTC · model grok-4.3
The pith
Fine-tuning UI-TARS 7B on ProCUA-SFT raises OSWorld success from 26.3 percent to 45 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProCUA-SFT is produced by a fully automated pipeline that synthesizes grounded tasks on live desktops seeded with real-world content from sources like SpreadsheetBench and Zenodo10K, verifies each task's feasibility through binary precondition checking, and uses a single VLM to act as goal generator, precondition judge, and trajectory executor. Each trajectory is expanded into step-prefix samples that reproduce the context layout at inference time. Fine-tuning the UI-TARS 7B model on this 3.1M sample dataset for one epoch yields 45.0 percent success on OSWorld.
What carries the argument
The single-VLM automated pipeline that generates tasks, judges preconditions, executes trajectories, and expands them into step-prefix SFT samples.
If this is right
- The synthetic data avoids the negative transfer that occurs with AgentNet human trajectories.
- Scaling to 2,484 application combinations with diverse real content seeds enables broad coverage.
- A subset of the data contributed to training Nemotron 3 Nano Omni for computer-use tasks.
- Step-prefix expansion ensures training matches the exact context seen during inference.
Where Pith is reading between the lines
- Larger volumes of synthetic trajectories may compensate for any quality differences from human data in agent training.
- Similar single-model pipelines could be tested on other agent benchmarks beyond OSWorld.
- The success suggests that capability gaps between planner and actor are reduced when one model handles all roles.
Load-bearing premise
A single VLM can generate feasible tasks, judge binary preconditions correctly, and execute trajectories without systematic errors or biases different from human data.
What would settle it
Evaluating the model after training on a version of the dataset where trajectories are filtered or corrected by humans to remove any VLM-specific errors would show if the gains depend on the automated generation process.
read the original abstract
Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProCUA-SFT, a dataset of 3.1M step-level SFT samples derived from 93K synthetic trajectories generated by a single VLM (Kimi-K2.5) across 2,484 application combinations using real-world content seeds. It reports that one-epoch SFT of UI-TARS 7B on this data yields 45.0% success on OSWorld (18.7 pp above the 26.3% base model and >35 pp above AgentNet-trained models), while human data causes negative transfer; a subset was used in Nemotron 3 Nano Omni.
Significance. If the synthetic data quality holds, the work demonstrates a scalable automated pipeline for CUA training data that outperforms limited human trajectories and addresses data scarcity. The scale, the concrete benchmark lift, and the downstream model incorporation are clear strengths. The result is empirically grounded with no circularity in the reported metrics.
major comments (3)
- [Abstract] Abstract: the central claim of a 45.0% OSWorld success rate (and the 18.7 pp improvement) is presented without error bars, standard deviations, or the number of evaluation runs, leaving the statistical reliability of the headline number unsupported.
- [Abstract / pipeline description] Abstract / pipeline description: the superiority claim over AgentNet requires that the single-VLM pipeline (goal generation, binary precondition checks, and full rollouts) introduces no systematic biases differing from human data, yet no human validation, inter-annotator agreement, or error-rate statistics on the generated tasks/trajectories are supplied.
- [Abstract] Abstract: no ablation results or distribution-shift diagnostics are reported (e.g., comparing synthetic vs. OSWorld task distributions or isolating the effect of precondition checking), which are load-bearing for attributing the performance gain to data quality rather than other factors.
minor comments (1)
- [Abstract] Abstract: the construction of step-prefix samples that 'exactly reproduce the context layout seen at inference time' would benefit from one additional sentence clarifying the exact prefixing procedure.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below, indicating where revisions will be made to improve statistical reporting and discussion of limitations while defending the core empirical claims based on the reported benchmark results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a 45.0% OSWorld success rate (and the 18.7 pp improvement) is presented without error bars, standard deviations, or the number of evaluation runs, leaving the statistical reliability of the headline number unsupported.
Authors: We agree that explicit reporting of evaluation runs and variability would strengthen the abstract. The 45.0% figure reflects a single standard OSWorld evaluation run following the benchmark protocol. In the revised manuscript we will state the number of runs explicitly and add results from additional independent evaluations (with standard deviation) if compute permits, or note the single-run limitation with justification. revision: yes
-
Referee: [Abstract / pipeline description] Abstract / pipeline description: the superiority claim over AgentNet requires that the single-VLM pipeline (goal generation, binary precondition checks, and full rollouts) introduces no systematic biases differing from human data, yet no human validation, inter-annotator agreement, or error-rate statistics on the generated tasks/trajectories are supplied.
Authors: The primary evidence for pipeline quality remains the large, consistent lift on the independent OSWorld benchmark (real desktop environments, tasks unseen during generation) together with the observed negative transfer from AgentNet human trajectories. The pipeline grounds tasks via real-world content seeds and binary precondition checks performed by the same VLM. We did not collect human annotations or compute inter-annotator agreement, as the process is fully automated. In revision we will add an explicit limitations paragraph discussing possible biases and how precondition checking reduces infeasible trajectories, while retaining the empirical benchmark comparison as the central support. revision: partial
-
Referee: [Abstract] Abstract: no ablation results or distribution-shift diagnostics are reported (e.g., comparing synthetic vs. OSWorld task distributions or isolating the effect of precondition checking), which are load-bearing for attributing the performance gain to data quality rather than other factors.
Authors: We acknowledge that targeted ablations and distribution diagnostics would allow finer attribution. The technical report prioritizes the end-to-end pipeline and its net effect on OSWorld; no explicit task-distribution comparisons or precondition-check ablations were performed. The negative transfer from AgentNet already indicates that simply adding more data does not explain the gain. In the revision we will expand the discussion section to include qualitative analysis of task diversity and precondition-checking rationale, while noting full ablations as future work rather than adding new experiments at this stage. revision: partial
Circularity Check
No circularity; all results are direct empirical measurements on external benchmark
full rationale
The paper describes an automated data-generation pipeline using a single VLM and reports measured OSWorld success rates after SFT. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citation chains or uniqueness theorems are invoked to justify core claims. The reported 45.0% result is a straightforward benchmark measurement, not a quantity derived from the inputs by construction. This is the normal case of an empirical technical report with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single VLM can serve as goal generator, precondition judge, and trajectory executor without capability gaps that would invalidate the resulting trajectories.
Reference graph
Works this paper leans on
-
[1]
Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,
Rogerio Bonatti, Dan Zhao, Francesco Bonez, Dillon Dupont, Sara Abdala, Yinheng Li, Yadong Shi, Justin Zhu, Kazuhito Zimmers, Jianwei Huang, et al. Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,
-
[2]
Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,
Apurva Gandhi and Graham Neubig. Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,
-
[3]
Yanheng He, Jiahe Long, Yixuan Ge, Xin Eric Wang, Ying Shan, and Jianlong Gu. PC Agent: While you sleep, AI works—a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,
-
[4]
Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,
Yanheng He, Jiahe Long, Yixuan Ge, Peng Cui, Xin Eric Wang, Ying Shan, and Jianlong Gu. Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,
-
[5]
CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,
12 ProCUA-SFT Technical Report Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Zhang, Juanzi Li, et al. CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,
-
[6]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
-
[7]
SpreadsheetBench: Towards challenging real world spreadsheet manipulation
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards challenging real world spreadsheet manipulation. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024a. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zh...
-
[8]
Vardaan Pahuja, Rishabh Agrawal, Jiayi He, Shubham Jain, Siqi Nair, Jesse Callaham, Anoop Deoras Joshi, and Ruhi Sarikaya. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents.arXiv preprint arXiv:2502.11357,
-
[9]
UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,
-
[10]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Walber, Derek Stefan Lam, et al. Android- World: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,
-
[11]
Junhong Shen, Yuhan Cheng, Kanzhi Chen, Zhiyong Wu, and Chengyou Jia. ScribeAgent: Towards specialized web agents using production-scale workflow data.arXiv preprint arXiv:2411.15004,
-
[12]
InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,
Brandon Trabucco, Izzeddin Gur, Natalie Deng, Wookhee Lee, Tatsunori Hashimoto, and Aleksandra Faust. InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,
-
[13]
Charles, Zhilin Yang, and Tao Yu
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...
-
[14]
Jingxu Xie, Muyang Li, Tianle Chen, Haotian Zhang, Han Cai, Ji Lin, and Song Han. AgentSynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205,
-
[15]
OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024a. URL https://arxiv.org/abs/2404.07972. Tianbao Xie, Danyang Z...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.