Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.
super hub Mixed citations
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Mixed citation behavior. Most common role is background (50%).
abstract
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld bench
authors
co-cited works
representative citing papers
MacArena is a new macOS GUI agent benchmark with 421 tasks showing inverted model rankings and a 26% performance drop on native tasks versus ported ones.
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
DPAgent is an agentic framework that detects 90.98% of AI-groomed deceptive samples and repairs 77% of deceptive interfaces while exploring 80% of pattern types with 10% of baseline page visits.
Introduces LivingScreen benchmark for living-screen-native GUI agents on short-video platforms; frontier models fail to match human cost-accuracy due to over- and under-observation.
HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.
CutVerse benchmark evaluates GUI agents on 186 complex media post-production tasks in seven apps and reports 36% success rate for existing models.
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
citing papers explorer
-
It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents
Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.
-
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
MacArena is a new macOS GUI agent benchmark with 421 tasks showing inverted model rankings and a 26% performance drop on native tasks versus ported ones.
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering
SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
-
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
-
DPAgent-in-the-Middle: Agentic Defense and Repair Against AI-Groomed Deceptive Patterns
DPAgent is an agentic framework that detects 90.98% of AI-groomed deceptive samples and repairs 77% of deceptive interfaces while exploring 80% of pattern types with 10% of baseline page visits.
-
Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
Introduces LivingScreen benchmark for living-screen-native GUI agents on short-video platforms; frontier models fail to match human cost-accuracy due to over- and under-observation.
-
HLL: Can Agents Cross Humanity's Last Line of Verification?
HLL is a new benchmark that evaluates eight frontier multimodal agents on closed-loop interactive CAPTCHA solving, showing sharp performance drops under realism stressors and trace validation.
-
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
-
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
-
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.
-
CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing
CutVerse benchmark evaluates GUI agents on 186 complex media post-production tasks in seven apps and reports 36% success rate for existing models.
-
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
-
Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces
UI traces of actions and timings from LLM browser agents enable identification of the underlying model with up to 96% F1 across 14 models and multiple tasks.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills packages multimodal procedural knowledge into state-conditioned skills with text, state cards, and multi-view keyframes, generated from public trajectories via an agentic process and used at inference via branch-loaded inspection to improve visual agents on GUI and game benchmarks.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
FlowEval: Reference-based Evaluation of Generated User Interfaces
FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
-
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
AI agents on OSWorld take 2.7-4.3 times more steps than human trajectories, with latency rising sharply due to repeated large model calls for planning and reflection.
-
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing
PropGen automates property generation for Android app testing via LLM synthesis from guided exploration and feedback refinement, yielding 912 valid properties and 25 previously unknown bugs across 12 apps.
-
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents
Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
-
One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
-
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
-
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
-
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials
Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.
-
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL trains a 4B visual web agent with online RL on live sites using 0.4K init trajectories and 2.2K RL tasks to reach 67% success on Online-Mind2Web and 64% on DeepShop, outperforming prior open agents.
-
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
STaR-KV is a training-free KV cache compression framework for GUI VLMs that uses subspace-aware scoring, temporal stability discounts, and entropy-based temperature adaptation to outperform prior methods at matched budgets while reducing peak memory by ~40% at 20% cache size.
-
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing
GUITestScape supplies an interactive benchmark for exploratory GUI testing and GUIJudge supplies an open-set process-aware evaluator that outperforms baselines on MLLM agents.
-
PhoneWorld: Scaling Phone-Use Agent Environments
PhoneWorld is a pipeline that converts real mobile trajectories into scalable controllable environments, yielding large gains on four benchmarks when used to supplement training data.
-
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.
-
Test-Time Deep Thinking to Explore Implicit Rules
TTExplore trains a 7B thinker via task-score RL to infer implicit rules at test time, raising agent success by 14-19 points on five embodied tasks.
-
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
-
Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay
VLMs and LAMs outperform RL baselines in voxel-wise brain encoding during gameplay, with LAMs showing prompt-asymmetric organization (27% unique action vs -5% unique reasoning) strongest in frontal-motor cortex.
-
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.
-
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
CA2: Code-Aware Agent for Automated Game Testing
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.