Recognition: 3 theorem links
· Lean TheoremGUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Pith reviewed 2026-05-15 02:06 UTC · model grok-4.3
The pith
GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GUI-R1 is the first reinforcement learning framework that enhances the GUI capabilities of large vision-language models through unified action space rule modeling. By applying Group Relative Policy Optimization on a small curated dataset of 3K examples collected across five operating systems, the method surpasses previous supervised approaches such as OS-Atlas that used 13M examples, delivering higher success rates on eight benchmarks covering mobile, desktop, and web platforms.
What carries the argument
Unified action space rule modeling inside a Group Relative Policy Optimization loop that updates the vision-language model from a few thousand multi-platform demonstrations.
If this is right
- GUI agents can be trained for new platforms with orders-of-magnitude less labeled data.
- Policy optimization can replace or augment supervised fine-tuning for high-level interface tasks.
- A single model can handle mobile, desktop, and web environments after exposure to a modest shared dataset.
- Real-world deployment of GUI agents becomes feasible with smaller, platform-agnostic training corpora.
Where Pith is reading between the lines
- The same reinforcement recipe might extend to other embodied control domains that currently rely on massive supervised datasets.
- If the unified action space proves robust, future work could test whether it supports zero-shot transfer between entirely different interface styles.
- Developers might experiment with mixing the 3K seed set with synthetic trajectories generated by the model itself to further reduce human curation effort.
- The reported gains suggest that reasoning-style reinforcement loops can improve perception-action loops even when the input is a screenshot rather than text.
Load-bearing premise
A small set of carefully chosen high-quality examples plus a shared action vocabulary is enough for the model to handle new interfaces without large-scale supervised training.
What would settle it
A new benchmark interface where GUI-R1 accuracy falls below the accuracy of a supervised model trained on the same 3K examples or where the performance gap to OS-Atlas disappears.
read the original abstract
Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GUI-R1, the first R1-style reinforcement learning framework for GUI agents. It trains LVLMs via Group Relative Policy Optimization (GRPO) on a small curated dataset of 3K high-quality trajectories spanning Windows, Linux, macOS, Android, and Web platforms, using a unified action space rule model. The central claim is that this yields superior performance over prior SOTA methods such as OS-Atlas (trained on 13M examples) across eight benchmarks on mobile, desktop, and web, using only 0.02% of the data volume.
Significance. If the performance delta is reproducible and the RL contribution is isolated, the result would indicate that unified-action-space GRPO on carefully curated small data can outperform large-scale SFT baselines for GUI agents. This would support a shift toward data-efficient RL paradigms for real-world interface agents and reduce reliance on massive supervised datasets.
major comments (3)
- [Experiments] Experiments section: no ablation applies standard supervised fine-tuning to the identical 3K curated trajectories and unified action-space rules. Without this control, the headline claim that GRPO (rather than curation quality or base-model strength) drives the gains over OS-Atlas cannot be isolated and remains load-bearing for the central argument.
- [Results] Results and evaluation sections: benchmark definitions, exact task splits, statistical significance tests (e.g., confidence intervals or p-values), and precise baseline re-implementation details are not provided. This makes it impossible to verify the reported superiority across the eight benchmarks.
- [Method] Data curation and method sections: the process for selecting the 3K examples and formalizing the unified action-space rules is described only at high level. More concrete specification of selection criteria and rule encoding is required to assess reproducibility and rule out selection effects.
minor comments (1)
- [Abstract] Abstract and introduction: the 0.02% data claim should be accompanied by a precise citation or table entry for the 13M figure used by OS-Atlas.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements, which we believe will strengthen the clarity and rigor of our claims regarding the effectiveness of GRPO on curated small-scale data for GUI agents.
read point-by-point responses
-
Referee: [Experiments] Experiments section: no ablation applies standard supervised fine-tuning to the identical 3K curated trajectories and unified action-space rules. Without this control, the headline claim that GRPO (rather than curation quality or base-model strength) drives the gains over OS-Atlas cannot be isolated and remains load-bearing for the central argument.
Authors: We agree that directly comparing GRPO to standard SFT on the exact same 3K trajectories is essential to isolate the RL contribution. While our primary comparisons were against large-scale SFT baselines like OS-Atlas, we will add this ablation experiment in the revised manuscript. The new results will demonstrate performance differences attributable to the policy optimization step under identical data and action-space conditions. revision: yes
-
Referee: [Results] Results and evaluation sections: benchmark definitions, exact task splits, statistical significance tests (e.g., confidence intervals or p-values), and precise baseline re-implementation details are not provided. This makes it impossible to verify the reported superiority across the eight benchmarks.
Authors: We will expand the results and evaluation sections in the revision to include detailed benchmark definitions, exact task splits, re-implementation specifics for all baselines, and statistical measures such as confidence intervals. These additions will enable full verification and reproducibility of the reported performance gains across the eight benchmarks. revision: yes
-
Referee: [Method] Data curation and method sections: the process for selecting the 3K examples and formalizing the unified action-space rules is described only at high level. More concrete specification of selection criteria and rule encoding is required to assess reproducibility and rule out selection effects.
Authors: We acknowledge that the current description of data curation and unified action-space rule formalization is high-level. In the revised manuscript, we will provide concrete details on the selection criteria for the 3K trajectories (e.g., quality filters and platform coverage) and the precise encoding of the unified action-space rules to support reproducibility and address potential selection effects. revision: yes
Circularity Check
No significant circularity; empirical application of external RL methods
full rationale
The paper applies GRPO and RFT techniques cited from external DeepSeek-R1 work to a curated 3K-example GUI dataset with unified action space. No mathematical derivation chain exists that reduces by construction to self-defined inputs, fitted parameters renamed as predictions, or load-bearing self-citations. The performance claims rest on benchmark comparisons rather than any self-referential reduction. This is a standard empirical proposal with independent content from the cited external RL algorithms.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement fine-tuning with GRPO can efficiently improve LVLM action prediction on GUI screenshots without large-scale supervised data
Forward citations
Cited by 24 Pith papers
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
-
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiy- ong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language mod- els.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super gen- eralization ability in vision-language models with less than $3.https://github.com/ Deep-Agent/R1-V, 2025. Accessed: 2025-02-02
work page 2025
-
[8]
Haozhan Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model.https: //github.com/om-ai-lab/VLM-R1, 2025. Accessed: 2025-02-15
work page 2025
-
[9]
Ui-r1: Enhancing action prediction of gui agents by reinforcement learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Cognitive archi- tectures for language agents.Transactions on Machine Learning Research, 2023
Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive archi- tectures for language agents.Transactions on Machine Learning Research, 2023
work page 2023
-
[12]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[13]
Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,
Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. Corex: Pushing the boundaries of complex reasoning through multi-model collaboration.arXiv preprint arXiv:2310.00280, 2023
-
[14]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024
-
[15]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Training agents by reinforcing reasoning, 2025
Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei- Fei, Lijuan Wang, Yejin Choi, and Manling Li. Training agents by reinforcing reasoning, 2025
work page 2025
-
[19]
Code-r1: Reproducing r1 for code with reliable rewards.arXiv preprint arXiv:2503.18470, 3,
Zhenyu Pan and Han Liu. Metaspatial: Reinforcing 3d spatial reasoning in vlms for the meta- verse.arXiv preprint arXiv:2503.18470, 2025
-
[20]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Bo- tian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydl ´ıˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Le- andro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. InNeurIPS, pages 30811–30849, 2024
work page 2024
-
[22]
Uibert: Learning generic multimodal representations for ui under- standing, 2021
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera y Arcas. Uibert: Learning generic multimodal representations for ui under- standing, 2021
work page 2021
-
[23]
Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490, 2024
-
[24]
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences.arXiv preprint arXiv:2005.03776, 2020
-
[25]
Llamafactory: Unified efficient fine-tuning of 100+ language models
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InACL, 2024
work page 2024
-
[26]
Easyr1: An efficient, scalable, multi- modality rl training framework, 2025
Yaowei Zheng, Junting Lu, Shenzhi Wang, and Y Xiong. Easyr1: An efficient, scalable, multi- modality rl training framework, 2025
work page 2025
-
[27]
On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyama- gundlu, and Oriana Riva. On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
-
[28]
Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451, 2024
-
[29]
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use.Workshop on Reasoning and Planning for Large Language Models, 2025
work page 2025
-
[30]
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents.arXiv preprint arXiv:2406.11317, 2024
-
[31]
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Al- Shikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InECCV, pages 161–178. Springer, 2024. 11
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.