CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Difei Gao; Haobo Hu; Haotian Liu; Libiao Jin; Qi Mao; Xiangwu Guo; Zhiheng Chen

arxiv: 2605.19484 · v1 · pith:EJLEKK62new · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.GR· cs.HC

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

Haobo Hu , Xiangwu Guo , Zhiheng Chen , Difei Gao , Haotian Liu , Libiao Jin , Qi Mao This is my paper

Pith reviewed 2026-05-20 07:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.HC

keywords GUI agentsbenchmarkmedia editingpost-productionvideo editingcompositional actionsscreen recordings

0 comments

The pith

Existing GUI agents succeed on only 36 percent of complex media post-production tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CutVerse, a benchmark of 186 tasks drawn from expert workflows in seven professional applications including Premiere Pro and Photoshop. A lightweight parser converts screen recordings and interaction logs into structured compositional action trajectories to enable scalable testing. Evaluations of current agents yield a 36 percent overall success rate, exposing shortfalls in managing long sequences of tightly coupled actions and domain-specific planning within dense multimodal interfaces. A sympathetic reader would care because professional creative tools demand precise, extended interactions that general agents have not yet mastered.

Core claim

The paper establishes that autonomous GUI agents face significant challenges in realistic media post-production environments, as shown by their 36 percent success rate across 186 expert-curated tasks that involve dense multimodal interfaces and long-horizon interaction sequences in applications such as Premiere Pro and Photoshop. The benchmark relies on a lightweight parser that transforms raw screen recordings and low-level logs into structured, compositional GUI action trajectories. While models exhibit promising spatial grounding, multimodal alignment, and coordinated execution, they remain limited in long-horizon reliability and domain-specific planning.

What carries the argument

The CutVerse benchmark together with its lightweight parser that converts screen recordings and interaction logs into structured compositional GUI action trajectories for evaluation.

If this is right

Agents must improve long-horizon reliability to handle extended editing sequences.
Domain-specific planning capabilities require targeted advances to support professional media workflows.
The structured action trajectories produced by the parser offer a repeatable basis for measuring progress in creative GUI tasks.
Limitations in current models highlight the need for better coordination across multimodal inputs in dense interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to measure how agents handle collaborative editing sessions involving multiple users.
Insights from the 36 percent baseline might inform training regimes that incorporate more realistic media-specific planning objectives.
If the parser's trajectories prove reusable across new applications, the same evaluation approach could scale to other professional software domains.

Load-bearing premise

The 186 tasks and the lightweight parser that converts screen recordings into compositional action trajectories accurately capture authentic professional editing workflows and enable fair, scalable evaluation.

What would settle it

A new agent or improved version that reaches substantially above 36 percent task success on the full set of 186 tasks without altering the benchmark itself would undermine the claim that complex long-horizon media workflows pose fundamental difficulties for current agents.

read the original abstract

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CutVerse is a new benchmark for GUI agents in media post-production showing 36% agent success, but its evaluation depends on an unvalidated parser.

read the letter

Hey, the punchline here is that CutVerse is a new benchmark targeting GUI agents in media post-production, with 186 tasks showing existing agents at just 36% success. This highlights the difficulties with long-horizon tasks in complex creative software. What the paper does well is curate tasks from expert demonstrations in seven professional applications, including Premiere Pro and Photoshop. These are grounded in real editing workflows that involve multimodal elements and sequential interactions. The lightweight parser for converting screen recordings and interaction logs into structured trajectories is a practical tool that could help with scalable assessment of agent performance. On the soft spots, the evaluation's reliability hinges on that parser being faithful. The abstract and description give no quantitative validation, such as error rates or inter-annotator checks, so we can't be sure the failure modes are accurately measured rather than artifacts of the extraction process. There's also limited information on the baselines used for comparison and the criteria for selecting the 186 tasks, which leaves some questions about how representative and fair the results are. This paper is for researchers in AI agents and human-computer interaction who are interested in moving benchmarks toward professional domains. A reader focused on multimodal agents or applications in content creation would get value from the new task collection and the emphasis on planning limitations. It deserves a serious referee because it introduces concrete evaluation material in an area that has been underexplored. The work engages directly with the literature on GUI agents by extending it to this domain. I'd recommend sending it for peer review, with the expectation that reviewers will want more details on the parser's accuracy and the experimental protocol to strengthen the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces CutVerse, a benchmark for GUI agents in realistic media post-production workflows. It curates 186 long-horizon tasks across seven professional applications (e.g., Premiere Pro, Photoshop) derived from expert demonstrations, develops a lightweight parser to convert screen recordings and interaction logs into structured compositional action trajectories, and reports that existing agents achieve only 36.0% task success, highlighting limitations in long-horizon reliability and domain-specific planning despite strengths in spatial grounding and multimodal alignment.

Significance. If the tasks faithfully represent professional editing workflows and the parser produces reliable ground-truth trajectories, the benchmark would fill a notable gap in GUI-agent evaluation by moving beyond web navigation and basic OS tasks into dense, multimodal creative domains. The 36% success rate, if robust, would provide a concrete, falsifiable signal of current limitations and could serve as a useful testbed for future agent research focused on compositional planning.

major comments (2)

[Parser / Trajectory Extraction] The description of the lightweight parser (which produces the compositional trajectories used as ground truth) provides no quantitative validation such as error rates on a held-out set, inter-annotator agreement, or systematic manual spot-checks. Because the headline 36% success rate is measured against these parsed trajectories, any systematic mis-grounding of steps in dense interfaces (timeline scrubbing + parameter panels) would directly undermine the claim that the failures reflect intrinsic agent limitations rather than evaluation artifacts.
[Experiments / Results] The evaluation section reports a 36.0% task success rate but supplies no details on the specific agents or models tested, the baselines chosen, failure-mode breakdown, statistical significance, or how tasks were selected or stratified by horizon length. Without these elements it is impossible to determine whether the result supports the stated conclusions about long-horizon reliability.

minor comments (1)

[Abstract] The abstract contains a typographical error: 'workflows.While current models' should read 'workflows. While current models'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our submission. We have carefully considered the major comments and provide point-by-point responses below. We indicate where we plan to make revisions to the manuscript.

read point-by-point responses

Referee: [Parser / Trajectory Extraction] The description of the lightweight parser (which produces the compositional trajectories used as ground truth) provides no quantitative validation such as error rates on a held-out set, inter-annotator agreement, or systematic manual spot-checks. Because the headline 36% success rate is measured against these parsed trajectories, any systematic mis-grounding of steps in dense interfaces (timeline scrubbing + parameter panels) would directly undermine the claim that the failures reflect intrinsic agent limitations rather than evaluation artifacts.

Authors: We agree with the referee that providing quantitative validation for the parser is essential to substantiate the reliability of the ground-truth trajectories. The manuscript currently describes the parser's design and its role in converting screen recordings and interaction logs into structured trajectories but does not include error rates or agreement metrics. In the revised version, we will add a dedicated subsection on parser validation, including results from systematic manual spot-checks on a held-out set of 20 trajectories, inter-annotator agreement scores from two expert annotators, and reported error rates for key actions such as timeline scrubbing and parameter adjustments. This will help confirm that the 36% success rate primarily reflects agent limitations rather than parsing artifacts. revision: yes
Referee: [Experiments / Results] The evaluation section reports a 36.0% task success rate but supplies no details on the specific agents or models tested, the baselines chosen, failure-mode breakdown, statistical significance, or how tasks were selected or stratified by horizon length. Without these elements it is impossible to determine whether the result supports the stated conclusions about long-horizon reliability.

Authors: We acknowledge that additional details are needed in the experiments section to fully support our conclusions. The current manuscript reports the overall 36.0% task success rate for existing agents but does not elaborate on the specific models, baselines, or breakdowns. We will revise the paper to include: a comprehensive list of the evaluated GUI agents and models with their configurations; a description of the baselines used for comparison; a detailed failure-mode analysis categorized by task horizon length and error types (e.g., planning failures vs. execution errors); statistical significance testing for the reported success rates; and information on how the 186 tasks were selected and stratified by complexity and horizon length. These enhancements will provide a clearer picture of the limitations in long-horizon reliability and domain-specific planning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result on newly curated tasks

full rationale

The paper introduces a new benchmark (CutVerse) consisting of 186 tasks curated from expert demonstrations across professional media applications, along with a lightweight parser to convert recordings into compositional trajectories. The headline result (36% task success for existing agents) is a direct empirical measurement of agent performance on this explicitly defined task set. No mathematical derivations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. The evaluation pipeline is presented as a novel contribution for scalable assessment rather than a self-referential loop, rendering the performance gap an observation on external agents rather than a forced outcome of the paper's own definitions or prior claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the benchmark construction itself is the contribution.

pith-pipeline@v0.9.0 · 5729 in / 1033 out tokens · 35597 ms · 2026-05-20T07:01:21.355985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

[1]

Claude 4.6 model card

Anthropic. Claude 4.6 model card. Technical report, 2026

work page 2026
[2]

Windows agent arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. InForty-second International Conference on Machine Learning, 2025

work page 2025
[3]

Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning, 2025

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, and Steven Hoi. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning, 2025

work page 2025
[4]

Ivebench: Modern benchmark suite for instruction-guided video editing assessment

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment. InThe FourteenthInternational Conference on Learning Representations, 2026

work page 2026
[5]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URLhttps://arxiv.org/abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyu Zheng, Shuyan Zhou, Samuel Stevens, et al. Mind2web: Towards a generalist agent for the web. 2023

work page 2023
[7]

Assistgui: Task-oriented desktop graphical user interface automation, 2024

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. Assistgui: Task-oriented desktop graphical user interface automation, 2024

work page 2024
[8]

Gemini 3 technical report

Gemini Team, Google DeepMind. Gemini 3 technical report. Technical report, 2026

work page 2026
[9]

Ui-venus technical report: Building high-performance ui agents with rft, 2025

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, and Weiqiang Wang. Ui-venus technical report: Building high-performance ...

work page 2025
[10]

Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion. PAMI, 2025

work page 2025
[11]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024

work page 2024
[12]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qinkai Zheng, Jiawei Liu, and Jianguo Zhu. Cogagent: A visual language model for gui agents. InCVPR, 2024

work page 2024
[13]

The dawn of gui agent: A preliminary case study with claude 3.5 computer use, 2024

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use, 2024

work page 2024
[14]

Os agents: A survey on mllm-based agents for general computing devices use, 2025

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page 2025
[15]

Filmaster: Bridging cinematic principles and generative ai for automated film generation, 2025

Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, and Xihui Liu. Filmaster: Bridging cinematic principles and generative ai for automated film generation, 2025. URLhttps: //arxiv.org/abs/2506.18899

work page arXiv 2025
[16]

Huang et al

Y. Huang et al. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. arXiv preprint arXiv:2409.01392, 2024

work page arXiv 2024
[17]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024
[18]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TransactionsonPattern Analysis and Machine Intellige...

work page doi:10.1109/tpami.2025.3633890 2025
[19]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhut- dinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

work page 2024
[20]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin...

work page doi:10.18653/v1/2024.acl-long.50 2024
[21]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments, 2025

work page 2025
[22]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URLhttps://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Meng Ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high-resolution computer use. InWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025
[24]

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, and Lihua Zhang. MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation . InCVPR, 2025

work page 2025
[25]

Anim- director: A large multimodal model powered agent for controllable animation video generation

Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, and Min Zhang. Anim- director: A large multimodal model powered agent for controllable animation video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[26]

Liang et al

G. Liang et al. Editval: Benchmarking diffusion based text-guided image editing methods. InICCV, 2023

work page 2023
[27]

VideoGUI: A benchmark for GUI automation from instructional videos

Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, and Mike Zheng Shou. VideoGUI: A benchmark for GUI automation from instructional videos. InThe Thirty-eight Conference on Neural InformationProcessing Systems Datasets and Benchmarks Track, 2024

work page 2024
[28]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern RecognitionConference, pages 19498–19508, 2025

work page 2025
[29]

Shotbench: Expert-level cinematic understanding in vision-language models

Hongbo Liu, Jingwen He, Yi Jinn, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models. InThe Thirty-ninth Annual Conference on Neural InformationProcessing Systems, 2025

work page 2025
[30]

ScaleCUA: Scaling open-source computer use agents with cross-platform data

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, and Wenhai Wang. ScaleCUA: Scaling open-source computer use agents with cross-platform data. InThe FourteenthInternationa...

work page 2026
[31]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025
[32]

Omniparser for pure vision based gui agent, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent, 2024

work page 2024
[33]

Rodriguez, Montek Kalsi, Nicolas Chapados, M

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. UI-vision: A desktop-centric GUI benchmark for visual perception and interaction. InForty-second International Conference on Machine Learni...

work page 2025
[34]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page doi:10.18653/v1/2025.findings-acl.1158 2025
[35]

Gpt-5 series models.https://platform.openai.com, 2025

OpenAI. Gpt-5 series models.https://platform.openai.com, 2025. Accessed: 2026

work page 2025
[36]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

YujiaQin, YiningYe, JunjieFang, HaomingWang, ShihaoLiang, ShizuoTian, JundaZhang, JiahaoLi, YunxinLi, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Android in the wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023

work page 2023
[38]

Androidworld: A dynamic benchmarking environment for autonomous agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William E Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Kenji Toyama, Robert James Berry, Divya Tyamagundlu, Timothy P Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. In The Thirteenth Interna...

work page 2025
[39]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools. In ICLR, 2023

work page 2023
[40]

Ui-tars-1.5.https://seed-tars.com/1.5, 2025

ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

work page 2025
[41]

From pixels to ui actions: learning to follow instructions via graphical user interfaces

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: learning to follow instructions via graphical user interfaces. InProceedings ofthe 37th InternationalConferenceon NeuralInformationProcessing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Cu...

work page 2023
[42]

Animaker: Multi-agent animated storytelling with mcts-driven clip generation

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Animaker: Multi-agent animated storytelling with mcts-driven clip generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025
[43]

Lave: Llm-powered agent assistance and language augmentation for video editing.Proceedings of the 29th International Conference on Intelligent User Interfaces, 2024

Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. Lave: Llm-powered agent assistance and language augmentation for video editing.Proceedings of the 29th International Conference on Intelligent User Interfaces, 2024

work page 2024
[44]

OpenCUA: Open foundations for computer-use agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Zheng Boyuan, LI PEIHANG, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Hu Jiarui, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Yipu Wang, Heng Wa...

work page 2025
[45]

Genartist: Multimodal llm as an agent for unified image generation and editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. volume 37, pages 128374–128395, 2024

work page 2024
[46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

work page 2022
[47]

Os-atlas: A foundation action model for generalist gui agents, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024

work page 2024
[48]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Weng Lam Tam, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. 2024

work page 2024
[49]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[50]

Aguvis: Unified pure vision agents for autonomous GUI interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. InForty-second International Conference on Machine Learning, 2025. 29

work page 2025
[51]

Aguvis: Unified pure vision agents for autonomous gui interaction, 2025

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URLhttps://arxiv.org/abs/2412.0 4454

work page 2025
[52]

Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

work page 2026
[53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

work page 2023
[55]

macOSWorld: A multilingual interactive benchmark for GUI agents

Pei Yang, Hai Ci, and Mike Zheng Shou. macOSWorld: A multilingual interactive benchmark for GUI agents. InThe Thirty-ninth Annual Conference on Neural InformationProcessing Systems, 2025

work page 2025
[56]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICCV, 2022

work page 2022
[57]

Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

work page 2025
[58]

Stage: Storyboard-anchored generation for cinematic multi-shot narrative, 2026

Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, and Boxin Shi. Stage: Storyboard-anchored generation for cinematic multi-shot narrative, 2026

work page 2026
[59]

Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026

work page 2026
[60]

Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, et al. Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention. arXiv preprint arXiv:2412.02259, 2024

work page arXiv 2024
[61]

Cml-bench: A framework for evaluating and enhancing llm-powered movie scripts generation.arXiv preprint arXiv:2510.06231, 2025

Mingzhe Zheng, Dingjie Song, et al. Cml-bench: A framework for evaluating and enhancing llm-powered movie scripts generation.arXiv preprint arXiv:2510.06231, 2025

work page arXiv 2025
[62]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024

work page 2024
[63]

Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 30

work page arXiv 2025

[1] [1]

Claude 4.6 model card

Anthropic. Claude 4.6 model card. Technical report, 2026

work page 2026

[2] [2]

Windows agent arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, and Zheng Hui. Windows agent arena: Evaluating multi-modal OS agents at scale. InForty-second International Conference on Machine Learning, 2025

work page 2025

[3] [3]

Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning, 2025

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, and Steven Hoi. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning, 2025

work page 2025

[4] [4]

Ivebench: Modern benchmark suite for instruction-guided video editing assessment

Yinan Chen, Jiangning Zhang, Teng Hu, Yuxiang Zeng, Zhucun Xue, Qingdong He, Chengjie Wang, Yong Liu, Xiaobin Hu, and Shuicheng Yan. Ivebench: Modern benchmark suite for instruction-guided video editing assessment. InThe FourteenthInternational Conference on Learning Representations, 2026

work page 2026

[5] [5]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URLhttps://arxiv.org/abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyu Zheng, Shuyan Zhou, Samuel Stevens, et al. Mind2web: Towards a generalist agent for the web. 2023

work page 2023

[7] [7]

Assistgui: Task-oriented desktop graphical user interface automation, 2024

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. Assistgui: Task-oriented desktop graphical user interface automation, 2024

work page 2024

[8] [8]

Gemini 3 technical report

Gemini Team, Google DeepMind. Gemini 3 technical report. Technical report, 2026

work page 2026

[9] [9]

Ui-venus technical report: Building high-performance ui agents with rft, 2025

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, and Weiqiang Wang. Ui-venus technical report: Building high-performance ...

work page 2025

[10] [10]

Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion

Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion. PAMI, 2025

work page 2025

[11] [11]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024

work page 2024

[12] [12]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qinkai Zheng, Jiawei Liu, and Jianguo Zhu. Cogagent: A visual language model for gui agents. InCVPR, 2024

work page 2024

[13] [13]

The dawn of gui agent: A preliminary case study with claude 3.5 computer use, 2024

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use, 2024

work page 2024

[14] [14]

Os agents: A survey on mllm-based agents for general computing devices use, 2025

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

work page 2025

[15] [15]

Filmaster: Bridging cinematic principles and generative ai for automated film generation, 2025

Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, and Xihui Liu. Filmaster: Bridging cinematic principles and generative ai for automated film generation, 2025. URLhttps: //arxiv.org/abs/2506.18899

work page arXiv 2025

[16] [16]

Huang et al

Y. Huang et al. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. arXiv preprint arXiv:2409.01392, 2024

work page arXiv 2024

[17] [17]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

work page 2024

[18] [18]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE TransactionsonPattern Analysis and Machine Intellige...

work page doi:10.1109/tpami.2025.3633890 2025

[19] [19]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhut- dinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

work page 2024

[20] [20]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Lin...

work page doi:10.18653/v1/2024.acl-long.50 2024

[21] [21]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments, 2025

work page 2025

[22] [22]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URLhttps://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Meng Ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high-resolution computer use. InWorkshop on Reasoning and Planning for Large Language Models, 2025

work page 2025

[24] [24]

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation

Mingcheng Li, Xiaolu Hou, Ziyang Liu, Dingkang Yang, Ziyun Qian, Jiawei Chen, Jinjie Wei, Yue Jiang, Qingyao Xu, and Lihua Zhang. MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation . InCVPR, 2025

work page 2025

[25] [25]

Anim- director: A large multimodal model powered agent for controllable animation video generation

Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, and Min Zhang. Anim- director: A large multimodal model powered agent for controllable animation video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024

[26] [26]

Liang et al

G. Liang et al. Editval: Benchmarking diffusion based text-guided image editing methods. InICCV, 2023

work page 2023

[27] [27]

VideoGUI: A benchmark for GUI automation from instructional videos

Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, and Mike Zheng Shou. VideoGUI: A benchmark for GUI automation from instructional videos. InThe Thirty-eight Conference on Neural InformationProcessing Systems Datasets and Benchmarks Track, 2024

work page 2024

[28] [28]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern RecognitionConference, pages 19498–19508, 2025

work page 2025

[29] [29]

Shotbench: Expert-level cinematic understanding in vision-language models

Hongbo Liu, Jingwen He, Yi Jinn, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, and Ziwei Liu. Shotbench: Expert-level cinematic understanding in vision-language models. InThe Thirty-ninth Annual Conference on Neural InformationProcessing Systems, 2025

work page 2025

[30] [30]

ScaleCUA: Scaling open-source computer use agents with cross-platform data

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, and Wenhai Wang. ScaleCUA: Scaling open-source computer use agents with cross-platform data. InThe FourteenthInternationa...

work page 2026

[31] [31]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

work page 2025

[32] [32]

Omniparser for pure vision based gui agent, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent, 2024

work page 2024

[33] [33]

Rodriguez, Montek Kalsi, Nicolas Chapados, M

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. UI-vision: A desktop-centric GUI benchmark for visual perception and interaction. InForty-second International Conference on Machine Learni...

work page 2025

[34] [34]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page doi:10.18653/v1/2025.findings-acl.1158 2025

[35] [35]

Gpt-5 series models.https://platform.openai.com, 2025

OpenAI. Gpt-5 series models.https://platform.openai.com, 2025. Accessed: 2026

work page 2025

[36] [36]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

YujiaQin, YiningYe, JunjieFang, HaomingWang, ShihaoLiang, ShizuoTian, JundaZhang, JiahaoLi, YunxinLi, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Android in the wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023

work page 2023

[38] [38]

Androidworld: A dynamic benchmarking environment for autonomous agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William E Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Kenji Toyama, Robert James Berry, Divya Tyamagundlu, Timothy P Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents. In The Thirteenth Interna...

work page 2025

[39] [39]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, et al. Toolformer: Language models can teach themselves to use tools. In ICLR, 2023

work page 2023

[40] [40]

Ui-tars-1.5.https://seed-tars.com/1.5, 2025

ByteDance Seed. Ui-tars-1.5.https://seed-tars.com/1.5, 2025

work page 2025

[41] [41]

From pixels to ui actions: learning to follow instructions via graphical user interfaces

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: learning to follow instructions via graphical user interfaces. InProceedings ofthe 37th InternationalConferenceon NeuralInformationProcessing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Cu...

work page 2023

[42] [42]

Animaker: Multi-agent animated storytelling with mcts-driven clip generation

Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, and Min Zhang. Animaker: Multi-agent animated storytelling with mcts-driven clip generation. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11, 2025

work page 2025

[43] [43]

Lave: Llm-powered agent assistance and language augmentation for video editing.Proceedings of the 29th International Conference on Intelligent User Interfaces, 2024

Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. Lave: Llm-powered agent assistance and language augmentation for video editing.Proceedings of the 29th International Conference on Intelligent User Interfaces, 2024

work page 2024

[44] [44]

OpenCUA: Open foundations for computer-use agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Zheng Boyuan, LI PEIHANG, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Hu Jiarui, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Yipu Wang, Heng Wa...

work page 2025

[45] [45]

Genartist: Multimodal llm as an agent for unified image generation and editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing. volume 37, pages 128374–128395, 2024

work page 2024

[46] [46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022

work page 2022

[47] [47]

Os-atlas: A foundation action model for generalist gui agents, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024

work page 2024

[48] [48]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Weng Lam Tam, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. 2024

work page 2024

[49] [49]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[50] [50]

Aguvis: Unified pure vision agents for autonomous GUI interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous GUI interaction. InForty-second International Conference on Machine Learning, 2025. 29

work page 2025

[51] [51]

Aguvis: Unified pure vision agents for autonomous gui interaction, 2025

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URLhttps://arxiv.org/abs/2412.0 4454

work page 2025

[52] [52]

Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

work page 2026

[53] [53]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023

work page 2023

[55] [55]

macOSWorld: A multilingual interactive benchmark for GUI agents

Pei Yang, Hai Ci, and Mike Zheng Shou. macOSWorld: A multilingual interactive benchmark for GUI agents. InThe Thirty-ninth Annual Conference on Neural InformationProcessing Systems, 2025

work page 2025

[56] [56]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InICCV, 2022

work page 2022

[57] [57]

Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents, 2025

work page 2025

[58] [58]

Stage: Storyboard-anchored generation for cinematic multi-shot narrative, 2026

Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, and Boxin Shi. Stage: Storyboard-anchored generation for cinematic multi-shot narrative, 2026

work page 2026

[59] [59]

Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026

work page 2026

[60] [60]

Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention, 2025

Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, et al. Videogen-of-thought: Step-by-step generating multi-shot video with minimal manual intervention. arXiv preprint arXiv:2412.02259, 2024

work page arXiv 2024

[61] [61]

Cml-bench: A framework for evaluating and enhancing llm-powered movie scripts generation.arXiv preprint arXiv:2510.06231, 2025

Mingzhe Zheng, Dingjie Song, et al. Cml-bench: A framework for evaluating and enhancing llm-powered movie scripts generation.arXiv preprint arXiv:2510.06231, 2025

work page arXiv 2025

[62] [62]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024

work page 2024

[63] [63]

Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025

Cailin Zhuang, Ailin Huang, Yaoqi Hu, Jingwei Wu, Wei Cheng, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, et al. Vistorybench: Comprehensive benchmark suite for story visualization.arXiv preprint arXiv:2505.24862, 2025. 30

work page arXiv 2025