ProCUA-SFT Technical Report

Amala Sanjay Deshmukh; Andrew Tao; Brandon Cui; Hao Zhang; Jaehun Jung; Jan Kautz; Jin Xu; Karan Sapra; Mingjie Liu; Muhammad Khalifa

arxiv: 2606.17321 · v1 · pith:Q445V2DCnew · submitted 2026-06-15 · 💻 cs.LG · cs.CV

ProCUA-SFT Technical Report

Jaehun Jung , Ximing Lu , Brandon Cui , Muhammad Khalifa , Shaokun Zhang , Hao Zhang , Jin Xu , Amala Sanjay Deshmukh

show 6 more authors

Karan Sapra Andrew Tao Yejin Choi Jan Kautz Mingjie Liu Yi Dong

This is my paper

Pith reviewed 2026-06-27 03:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords computer use agentssynthetic trajectoriessupervised fine tuningOSWorldvision language modeldesktop environmentsagent training data

0 comments

The pith

Fine-tuning UI-TARS 7B on ProCUA-SFT raises OSWorld success from 26.3 percent to 45 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProCUA-SFT, a dataset of 3.1 million step-level samples created from 93,000 synthetic trajectories in desktop environments. It uses an automated pipeline with one vision-language model to generate tasks on real content, check preconditions, and produce trajectories, then converts them into training samples that match inference conditions. This approach addresses the negative transfer observed when using human-collected data like AgentNet, which drops performance below the base model. Fine-tuning on this dataset achieves 45 percent success on OSWorld, an 18.7 point gain over the base UI-TARS 7B and more than 35 points above AgentNet-trained versions.

Core claim

ProCUA-SFT is produced by a fully automated pipeline that synthesizes grounded tasks on live desktops seeded with real-world content from sources like SpreadsheetBench and Zenodo10K, verifies each task's feasibility through binary precondition checking, and uses a single VLM to act as goal generator, precondition judge, and trajectory executor. Each trajectory is expanded into step-prefix samples that reproduce the context layout at inference time. Fine-tuning the UI-TARS 7B model on this 3.1M sample dataset for one epoch yields 45.0 percent success on OSWorld.

What carries the argument

The single-VLM automated pipeline that generates tasks, judges preconditions, executes trajectories, and expands them into step-prefix SFT samples.

If this is right

The synthetic data avoids the negative transfer that occurs with AgentNet human trajectories.
Scaling to 2,484 application combinations with diverse real content seeds enables broad coverage.
A subset of the data contributed to training Nemotron 3 Nano Omni for computer-use tasks.
Step-prefix expansion ensures training matches the exact context seen during inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger volumes of synthetic trajectories may compensate for any quality differences from human data in agent training.
Similar single-model pipelines could be tested on other agent benchmarks beyond OSWorld.
The success suggests that capability gaps between planner and actor are reduced when one model handles all roles.

Load-bearing premise

A single VLM can generate feasible tasks, judge binary preconditions correctly, and execute trajectories without systematic errors or biases different from human data.

What would settle it

Evaluating the model after training on a version of the dataset where trajectories are filtered or corrected by humans to remove any VLM-specific errors would show if the gains depend on the automated generation process.

read the original abstract

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a 3.1M-sample synthetic dataset that lifts UI-TARS from 26% to 45% on OSWorld, but the single-VLM pipeline leaves the quality of the data unverified.

read the letter

The main result is straightforward: one epoch of SFT on ProCUA-SFT moves UI-TARS 7B from 26.3% to 45% success on OSWorld, while AgentNet training drops it to 8-10%. That gap is the thing a colleague should note first.

What is new is the fully automated pipeline that uses a single VLM (Kimi-K2.5) to generate tasks on live desktops seeded with real spreadsheets and presentations, run binary precondition checks, and produce full trajectories. The 93K trajectories are turned into 3.1M step-prefix samples that match the exact context layout seen at inference. This removes the planner-actor split and scales far beyond the 22.5K human trajectories in AgentNet. The work also shows the negative transfer problem with small human data and reports that a subset went into Nemotron 3 Nano Omni.

The empirical lift is the strongest part. The numbers are stated clearly and the pipeline description is concrete enough that another group could attempt replication.

The soft spot is the lack of any check on whether the synthetic data actually matches human distributions. The same model handles goal generation, precondition judgment, and execution, so any consistent bias in feasibility assessment or UI preference gets repeated across every sample. The abstract gives no human validation, inter-annotator numbers, or error rates on the generated trajectories, and there are no ablations or error bars on the 45% figure. That leaves open the possibility that the gain is partly distribution matching to Kimi-K2.5 rather than a general improvement in agent capability.

This is for labs working on computer-use agents who need large SFT corpora and are willing to test synthetic data themselves. A reader focused on scaling trajectories would find the pipeline details useful. The work is coherent enough and the result large enough that it deserves a serious referee, though the review should press on data validation.

Referee Report

3 major / 1 minor

Summary. The paper introduces ProCUA-SFT, a dataset of 3.1M step-level SFT samples derived from 93K synthetic trajectories generated by a single VLM (Kimi-K2.5) across 2,484 application combinations using real-world content seeds. It reports that one-epoch SFT of UI-TARS 7B on this data yields 45.0% success on OSWorld (18.7 pp above the 26.3% base model and >35 pp above AgentNet-trained models), while human data causes negative transfer; a subset was used in Nemotron 3 Nano Omni.

Significance. If the synthetic data quality holds, the work demonstrates a scalable automated pipeline for CUA training data that outperforms limited human trajectories and addresses data scarcity. The scale, the concrete benchmark lift, and the downstream model incorporation are clear strengths. The result is empirically grounded with no circularity in the reported metrics.

major comments (3)

[Abstract] Abstract: the central claim of a 45.0% OSWorld success rate (and the 18.7 pp improvement) is presented without error bars, standard deviations, or the number of evaluation runs, leaving the statistical reliability of the headline number unsupported.
[Abstract / pipeline description] Abstract / pipeline description: the superiority claim over AgentNet requires that the single-VLM pipeline (goal generation, binary precondition checks, and full rollouts) introduces no systematic biases differing from human data, yet no human validation, inter-annotator agreement, or error-rate statistics on the generated tasks/trajectories are supplied.
[Abstract] Abstract: no ablation results or distribution-shift diagnostics are reported (e.g., comparing synthetic vs. OSWorld task distributions or isolating the effect of precondition checking), which are load-bearing for attributing the performance gain to data quality rather than other factors.

minor comments (1)

[Abstract] Abstract: the construction of step-prefix samples that 'exactly reproduce the context layout seen at inference time' would benefit from one additional sentence clarifying the exact prefixing procedure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below, indicating where revisions will be made to improve statistical reporting and discussion of limitations while defending the core empirical claims based on the reported benchmark results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 45.0% OSWorld success rate (and the 18.7 pp improvement) is presented without error bars, standard deviations, or the number of evaluation runs, leaving the statistical reliability of the headline number unsupported.

Authors: We agree that explicit reporting of evaluation runs and variability would strengthen the abstract. The 45.0% figure reflects a single standard OSWorld evaluation run following the benchmark protocol. In the revised manuscript we will state the number of runs explicitly and add results from additional independent evaluations (with standard deviation) if compute permits, or note the single-run limitation with justification. revision: yes
Referee: [Abstract / pipeline description] Abstract / pipeline description: the superiority claim over AgentNet requires that the single-VLM pipeline (goal generation, binary precondition checks, and full rollouts) introduces no systematic biases differing from human data, yet no human validation, inter-annotator agreement, or error-rate statistics on the generated tasks/trajectories are supplied.

Authors: The primary evidence for pipeline quality remains the large, consistent lift on the independent OSWorld benchmark (real desktop environments, tasks unseen during generation) together with the observed negative transfer from AgentNet human trajectories. The pipeline grounds tasks via real-world content seeds and binary precondition checks performed by the same VLM. We did not collect human annotations or compute inter-annotator agreement, as the process is fully automated. In revision we will add an explicit limitations paragraph discussing possible biases and how precondition checking reduces infeasible trajectories, while retaining the empirical benchmark comparison as the central support. revision: partial
Referee: [Abstract] Abstract: no ablation results or distribution-shift diagnostics are reported (e.g., comparing synthetic vs. OSWorld task distributions or isolating the effect of precondition checking), which are load-bearing for attributing the performance gain to data quality rather than other factors.

Authors: We acknowledge that targeted ablations and distribution diagnostics would allow finer attribution. The technical report prioritizes the end-to-end pipeline and its net effect on OSWorld; no explicit task-distribution comparisons or precondition-check ablations were performed. The negative transfer from AgentNet already indicates that simply adding more data does not explain the gain. In the revision we will expand the discussion section to include qualitative analysis of task diversity and precondition-checking rationale, while noting full ablations as future work rather than adding new experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity; all results are direct empirical measurements on external benchmark

full rationale

The paper describes an automated data-generation pipeline using a single VLM and reports measured OSWorld success rates after SFT. No equations, fitted parameters, or predictions are defined in terms of themselves. No self-citation chains or uniqueness theorems are invoked to justify core claims. The reported 45.0% result is a straightforward benchmark measurement, not a quantity derived from the inputs by construction. This is the normal case of an empirical technical report with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that the VLM-based generation and verification process produces trajectories whose distribution supports positive transfer; no free parameters or invented entities are introduced.

axioms (1)

domain assumption A single VLM can serve as goal generator, precondition judge, and trajectory executor without capability gaps that would invalidate the resulting trajectories.
The pipeline description states that Kimi-K2.5 performs all three roles.

pith-pipeline@v0.9.1-grok · 5875 in / 1356 out tokens · 41440 ms · 2026-06-27T03:14:32.046705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 linked inside Pith

[1]

Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonez, Dillon Dupont, Sara Abdala, Yinheng Li, Yadong Shi, Justin Zhu, Kazuhito Zimmers, Jianwei Huang, et al. Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

arXiv
[2]

Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,

Apurva Gandhi and Graham Neubig. Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,

arXiv
[3]

PC Agent: While you sleep, AI works—a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

Yanheng He, Jiahe Long, Yixuan Ge, Xin Eric Wang, Ying Shan, and Jianlong Gu. PC Agent: While you sleep, AI works—a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

arXiv
[4]

Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,

Yanheng He, Jiahe Long, Yixuan Ge, Peng Cui, Xin Eric Wang, Ying Shan, and Jianlong Gu. Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,

arXiv
[5]

CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,

12 ProCUA-SFT Technical Report Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Zhang, Juanzi Li, et al. CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,

arXiv
[6]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

Pith/arXiv arXiv
[7]

SpreadsheetBench: Towards challenging real world spreadsheet manipulation

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards challenging real world spreadsheet manipulation. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024a. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zh...

arXiv
[8]

Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents.arXiv preprint arXiv:2502.11357,

Vardaan Pahuja, Rishabh Agrawal, Jiayi He, Shubham Jain, Siqi Nair, Jesse Callaham, Anoop Deoras Joshi, and Ruhi Sarikaya. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents.arXiv preprint arXiv:2502.11357,

arXiv
[9]

UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Pith/arXiv arXiv
[10]

Android- World: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Walber, Derek Stefan Lam, et al. Android- World: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

Pith/arXiv arXiv
[11]

ScribeAgent: Towards specialized web agents using production-scale workflow data.arXiv preprint arXiv:2411.15004,

Junhong Shen, Yuhan Cheng, Kanzhi Chen, Zhiyong Wu, and Chengyou Jia. ScribeAgent: Towards specialized web agents using production-scale workflow data.arXiv preprint arXiv:2411.15004,

arXiv
[12]

InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

Brandon Trabucco, Izzeddin Gur, Natalie Deng, Wookhee Lee, Tatsunori Hashimoto, and Aleksandra Faust. InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

arXiv
[13]

Charles, Zhilin Yang, and Tao Yu

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

arXiv
[14]

AgentSynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205,

Jingxu Xie, Muyang Li, Tianle Chen, Haotian Zhang, Han Cai, Ji Lin, and Song Han. AgentSynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205,

arXiv
[15]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024a. URL https://arxiv.org/abs/2404.07972. Tianbao Xie, Danyang Z...

Pith/arXiv arXiv

[1] [1]

Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonez, Dillon Dupont, Sara Abdala, Yinheng Li, Yadong Shi, Justin Zhu, Kazuhito Zimmers, Jianwei Huang, et al. Windows agent arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

arXiv

[2] [2]

Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,

Apurva Gandhi and Graham Neubig. Go-Browse: Training web agents with structured exploration.arXiv preprint arXiv:2506.03533,

arXiv

[3] [3]

PC Agent: While you sleep, AI works—a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

Yanheng He, Jiahe Long, Yixuan Ge, Xin Eric Wang, Ying Shan, and Jianlong Gu. PC Agent: While you sleep, AI works—a cognitive journey into digital world.arXiv preprint arXiv:2412.17589,

arXiv

[4] [4]

Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,

Yanheng He, Jiahe Long, Yixuan Ge, Peng Cui, Xin Eric Wang, Ying Shan, and Jianlong Gu. Efficient agent training for computer use.arXiv preprint arXiv:2505.13909,

arXiv

[5] [5]

CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,

12 ProCUA-SFT Technical Report Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Zhang, Juanzi Li, et al. CogAgent: A visual language model for GUI agents.arXiv preprint arXiv:2312.08914,

arXiv

[6] [6]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...

Pith/arXiv arXiv

[7] [7]

SpreadsheetBench: Towards challenging real world spreadsheet manipulation

Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards challenging real world spreadsheet manipulation. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024a. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zh...

arXiv

[8] [8]

Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents.arXiv preprint arXiv:2502.11357,

Vardaan Pahuja, Rishabh Agrawal, Jiayi He, Shubham Jain, Siqi Nair, Jesse Callaham, Anoop Deoras Joshi, and Ruhi Sarikaya. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents.arXiv preprint arXiv:2502.11357,

arXiv

[9] [9]

UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Pith/arXiv arXiv

[10] [10]

Android- World: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Walber, Derek Stefan Lam, et al. Android- World: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

Pith/arXiv arXiv

[11] [11]

ScribeAgent: Towards specialized web agents using production-scale workflow data.arXiv preprint arXiv:2411.15004,

Junhong Shen, Yuhan Cheng, Kanzhi Chen, Zhiyong Wu, and Chengyou Jia. ScribeAgent: Towards specialized web agents using production-scale workflow data.arXiv preprint arXiv:2411.15004,

arXiv

[12] [12]

InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

Brandon Trabucco, Izzeddin Gur, Natalie Deng, Wookhee Lee, Tatsunori Hashimoto, and Aleksandra Faust. InSTA: Towards internet-scale training for agents.arXiv preprint arXiv:2502.06776,

arXiv

[13] [13]

Charles, Zhilin Yang, and Tao Yu

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu,...

arXiv

[14] [14]

AgentSynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205,

Jingxu Xie, Muyang Li, Tianle Chen, Haotian Zhang, Han Cai, Ji Lin, and Song Han. AgentSynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205,

arXiv

[15] [15]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems, 2024a. URL https://arxiv.org/abs/2404.07972. Tianbao Xie, Danyang Z...

Pith/arXiv arXiv