{"work":{"id":"0bbcf263-a46d-4525-a438-11fce3316568","openalex_id":null,"doi":null,"arxiv_id":"2501.12326","raw_key":null,"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","authors":null,"authors_text":"Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian","year":2025,"venue":"cs.AI","abstract":"This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.","external_url":"https://arxiv.org/abs/2501.12326","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:00:27.169068+00:00","pith_arxiv_id":"2501.12326","created_at":"2026-05-09T06:10:36.970611+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","render_title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents"},"hub":{"state":{"work_id":"0bbcf263-a46d-4525-a438-11fce3316568","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":86,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2025-03-27T15:39:30+00:00","last_pith_cited_at":"2026-05-20T17:51:05+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-09T11:14:49.500772+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":16},{"context_role":"baseline","n":12},{"context_role":"dataset","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":15},{"context_polarity":"baseline","n":12},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T14:31:28.856526+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":19},{"title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","work_id":"16e00be2-1641-403c-8835-c50a6628f483","shared_citers":17},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":16},{"title":"UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning","work_id":"422846c6-e6e2-47c9-9065-85cc09c07cd6","shared_citers":16},{"title":"java21\" shown on the file path of the file manager. Text 1 between text Click once at the position before","work_id":"5345a78f-68f1-4e8c-b240-30b0bb230de3","shared_citers":13},{"title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","work_id":"c5116d19-d3d3-40fd-9620-f7489812a9ba","shared_citers":12},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":12},{"title":"Gui-r1: A generalist r1-style vision-language action model for gui agents","work_id":"5e82d316-7129-4f55-9c00-0d7fcbcea139","shared_citers":11},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning","work_id":"7f193350-1921-4b8c-83fe-ac769879cb5b","shared_citers":10},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":10},{"title":"Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners","work_id":"a7d6f6c7-8a57-4a83-9c4e-892cdc190280","shared_citers":10},{"title":"Agent S2: A compositional generalist-specialist framework for computer use agents","work_id":"56466266-2822-44dd-a6e9-6036285f881e","shared_citers":9},{"title":"Aguvis: Unified pure vision agents for autonomous gui interaction","work_id":"64030c49-afa6-4cf0-b64c-831d19d6674e","shared_citers":9},{"title":"Mobile-agent-v3: Fundamental agents for gui automation","work_id":"f8bf23c5-08e8-4ede-9e97-9d738493c422","shared_citers":9},{"title":"Opencua: Open foundations for computer-use agents","work_id":"1648be89-5611-4615-8ef6-f051a980641c","shared_citers":9},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":9},{"title":"Seed1.5-VL Technical Report","work_id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","shared_citers":9},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":9},{"title":"arXiv preprint arXiv:2410.05243 , year=","work_id":"9def1724-6fd2-4d5b-8339-4c1ee76e62f8","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"GUI-Actor: Coordinate-free visual grounding for GUI agents","work_id":"89d4aef6-1376-44bf-8327-9115705e719c","shared_citers":8},{"title":"Scaling computer-use grounding via user interface decomposition and synthesis","work_id":"0114b464-4839-4413-bd56-d15e0adf532c","shared_citers":8},{"title":"Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 1(2):3","work_id":"4637c89b-db94-4e6f-8bf2-030bea2fdd6e","shared_citers":8}],"time_series":[{"n":4,"year":2025},{"n":41,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T14:41:36.353855+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T14:31:33.054658+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","claims":[{"claim_text":"This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld bench","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks UI-TARS: Pioneering Automated GUI Interaction with Native Agents because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T14:41:32.070350+00:00"}},"summary":{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","claims":[{"claim_text":"This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld bench","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks UI-TARS: Pioneering Automated GUI Interaction with Native Agents because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":19},{"title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","work_id":"16e00be2-1641-403c-8835-c50a6628f483","shared_citers":17},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":16},{"title":"UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning","work_id":"422846c6-e6e2-47c9-9065-85cc09c07cd6","shared_citers":16},{"title":"java21\" shown on the file path of the file manager. Text 1 between text Click once at the position before","work_id":"5345a78f-68f1-4e8c-b240-30b0bb230de3","shared_citers":13},{"title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","work_id":"c5116d19-d3d3-40fd-9620-f7489812a9ba","shared_citers":12},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":12},{"title":"Gui-r1: A generalist r1-style vision-language action model for gui agents","work_id":"5e82d316-7129-4f55-9c00-0d7fcbcea139","shared_citers":11},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":10},{"title":"Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning","work_id":"7f193350-1921-4b8c-83fe-ac769879cb5b","shared_citers":10},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":10},{"title":"Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners","work_id":"a7d6f6c7-8a57-4a83-9c4e-892cdc190280","shared_citers":10},{"title":"Agent S2: A compositional generalist-specialist framework for computer use agents","work_id":"56466266-2822-44dd-a6e9-6036285f881e","shared_citers":9},{"title":"Aguvis: Unified pure vision agents for autonomous gui interaction","work_id":"64030c49-afa6-4cf0-b64c-831d19d6674e","shared_citers":9},{"title":"Mobile-agent-v3: Fundamental agents for gui automation","work_id":"f8bf23c5-08e8-4ede-9e97-9d738493c422","shared_citers":9},{"title":"Opencua: Open foundations for computer-use agents","work_id":"1648be89-5611-4615-8ef6-f051a980641c","shared_citers":9},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":9},{"title":"Seed1.5-VL Technical Report","work_id":"0e8e025f-ca1e-49cc-aee2-33f3a0201f3c","shared_citers":9},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":9},{"title":"arXiv preprint arXiv:2410.05243 , year=","work_id":"9def1724-6fd2-4d5b-8339-4c1ee76e62f8","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":8},{"title":"GUI-Actor: Coordinate-free visual grounding for GUI agents","work_id":"89d4aef6-1376-44bf-8327-9115705e719c","shared_citers":8},{"title":"Scaling computer-use grounding via user interface decomposition and synthesis","work_id":"0114b464-4839-4413-bd56-d15e0adf532c","shared_citers":8},{"title":"Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 1(2):3","work_id":"4637c89b-db94-4e6f-8bf2-030bea2fdd6e","shared_citers":8}],"time_series":[{"n":4,"year":2025},{"n":41,"year":2026}],"dependency_candidates":[]},"authors":[]}}