{"work":{"id":"4637c89b-db94-4e6f-8bf2-030bea2fdd6e","openalex_id":null,"doi":null,"arxiv_id":"2503.21620","raw_key":null,"title":"UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning","authors":null,"authors_text":"Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang","year":2025,"venue":"cs.AI","abstract":"The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.","external_url":"https://arxiv.org/abs/2503.21620","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-22T01:05:52.035531+00:00","pith_arxiv_id":"2503.21620","created_at":"2026-05-09T06:40:40.627703+00:00","updated_at":"2026-05-22T01:05:52.035531+00:00","title_quality_ok":true,"display_title":"UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning","render_title":"UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning"},"hub":{"state":{"work_id":"4637c89b-db94-4e6f-8bf2-030bea2fdd6e","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":26,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2025-03-27T17:59:51+00:00","last_pith_cited_at":"2026-05-19T08:38:44+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-30T14:31:09.238494+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":7},{"context_role":"dataset","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":7},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}