{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:D7L6RRL4ZNI4SWIVTKUBGYSMR3","short_pith_number":"pith:D7L6RRL4","schema_version":"1.0","canonical_sha256":"1fd7e8c57ccb51c959159aa813624c8ecbd2b1d5da4cfbd037db48d7752e4a17","source":{"kind":"arxiv","id":"2504.10458","version":4},"attestation_state":"computed","paper":{"title":"GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents","license":"http://creativecommons.org/licenses/by/4.0/","headline":"GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples.","cross_cats":["cs.CL","cs.HC"],"primary_cat":"cs.CV","authors_text":"Jiaming Li, Longze Chen, Lu Wang, Run Luo, Wanwei He, Xiaobo Xia","submitted_at":"2025-04-14T17:45:54Z","abstract_excerpt":"Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.10458","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-04-14T17:45:54Z","cross_cats_sorted":["cs.CL","cs.HC"],"title_canon_sha256":"9f87fd9acac35043bf980f040ecdea7c52fff6a27a16621dbdde637356fb3443","abstract_canon_sha256":"766fa50426543522f6bbe166c68311d89e93e4581ff4e5aef84c10e3f04d4d16"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.852274Z","signature_b64":"hJE1/ugY4XvpS7aOnf2HEaXICBwr4rxAFYeQp6mTkc7XjGnUuTVk7833qXhVaXW4jvcy6/x0zfJv8n0765JLCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"1fd7e8c57ccb51c959159aa813624c8ecbd2b1d5da4cfbd037db48d7752e4a17","last_reissued_at":"2026-05-17T23:38:53.851729Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.851729Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents","license":"http://creativecommons.org/licenses/by/4.0/","headline":"GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples.","cross_cats":["cs.CL","cs.HC"],"primary_cat":"cs.CV","authors_text":"Jiaming Li, Longze Chen, Lu Wang, Run Luo, Wanwei He, Xiaobo Xia","submitted_at":"2025-04-14T17:45:54Z","abstract_excerpt":"Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web).","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a small set of carefully curated high-quality data across platforms combined with unified action space rule modeling is sufficient for generalization to unseen interfaces without the need for extensive supervised fine-tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"cf01c443ccfd6e1f5335fe0d003c0b2ba0aa97cc37fbbbfd5ff174b1937d1577"},"source":{"id":"2504.10458","kind":"arxiv","version":4},"verdict":{"id":"91e5f0c7-d065-4c3d-8379-df7b8e3ea5d0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:06:38.256306Z","strongest_claim":"GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web).","one_line_summary":"GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a small set of carefully curated high-quality data across platforms combined with unified action space rule modeling is sufficient for generalization to unseen interfaces without the need for extensive supervised fine-tuning.","pith_extraction_headline":"GUI-R1 applies reinforcement learning to vision-language models so they act as GUI agents after training on only 3,000 examples."},"references":{"count":30,"sample":[{"doi":"","year":2024,"title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","work_id":"16e00be2-1641-403c-8835-c50a6628f483","ref_index":1,"cited_arxiv_id":"2410.23218","is_internal_anchor":true},{"doi":"","year":2025,"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","work_id":"0bbcf263-a46d-4525-a438-11fce3316568","ref_index":2,"cited_arxiv_id":"2501.12326","is_internal_anchor":true},{"doi":"","year":2024,"title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","work_id":"8fe50425-9d6d-4080-bd43-51b3d0d0e5f6","ref_index":3,"cited_arxiv_id":"2401.10935","is_internal_anchor":true},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":4,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2025,"title":"Visual-RFT: Visual Reinforcement Fine-Tuning","work_id":"872f09b5-998d-4a66-9a2f-f7ec2407cd62","ref_index":5,"cited_arxiv_id":"2503.01785","is_internal_anchor":true}],"resolved_work":30,"snapshot_sha256":"0fe3a6fd4793d98539a4dceaf15e08a54a4132617be30eb0262e63ca32eb0836","internal_anchors":12},"formal_canon":{"evidence_count":3,"snapshot_sha256":"ab6ec134e358be8cefc55e07c62998a98c88f8da8fcf46ec1a93f7e7aeb8d378"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.10458","created_at":"2026-05-17T23:38:53.851831+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.10458v4","created_at":"2026-05-17T23:38:53.851831+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.10458","created_at":"2026-05-17T23:38:53.851831+00:00"},{"alias_kind":"pith_short_12","alias_value":"D7L6RRL4ZNI4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"D7L6RRL4ZNI4SWIV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"D7L6RRL4","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":42,"internal_anchor_count":42,"sample":[{"citing_arxiv_id":"2605.10347","citing_title":"How Mobile World Model Guides GUI Agents?","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06534","citing_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15963","citing_title":"PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19538","citing_title":"CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16883","citing_title":"SE-GA: Memory-Augmented Self-Evolution for GUI Agents","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27859","citing_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2506.09373","citing_title":"LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2506.20332","citing_title":"Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2508.19679","citing_title":"InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2509.07553","citing_title":"VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2504.14239","citing_title":"InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21982","citing_title":"RISK: A Framework for GUI Agents in E-commerce Risk Management","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2511.12034","citing_title":"Calibrated Multimodal Representation Learning with Missing Modalities","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2507.05791","citing_title":"GTA1: GUI Test-time Scaling Agent","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03438","citing_title":"Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05044","citing_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05295","citing_title":"WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2603.23964","citing_title":"From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments","ref_index":213,"is_internal_anchor":true},{"citing_arxiv_id":"2507.21046","citing_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","ref_index":245,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12549","citing_title":"What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12481","citing_title":"ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27859","citing_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00642","citing_title":"Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding","ref_index":23,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3","json":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3.json","graph_json":"https://pith.science/api/pith-number/D7L6RRL4ZNI4SWIVTKUBGYSMR3/graph.json","events_json":"https://pith.science/api/pith-number/D7L6RRL4ZNI4SWIVTKUBGYSMR3/events.json","paper":"https://pith.science/paper/D7L6RRL4"},"agent_actions":{"view_html":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3","download_json":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3.json","view_paper":"https://pith.science/paper/D7L6RRL4","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.10458&json=true","fetch_graph":"https://pith.science/api/pith-number/D7L6RRL4ZNI4SWIVTKUBGYSMR3/graph.json","fetch_events":"https://pith.science/api/pith-number/D7L6RRL4ZNI4SWIVTKUBGYSMR3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3/action/storage_attestation","attest_author":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3/action/author_attestation","sign_citation":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3/action/citation_signature","submit_replication":"https://pith.science/pith/D7L6RRL4ZNI4SWIVTKUBGYSMR3/action/replication_record"}},"created_at":"2026-05-17T23:38:53.851831+00:00","updated_at":"2026-05-17T23:38:53.851831+00:00"}