hub Canonical reference

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu · 2025 · cs.AI · arXiv 2509.02544

Canonical reference. 80% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 57 citing papers arXiv PDF

abstract

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 2

citation-polarity summary

background 12 baseline 2 unclear 1

representative citing papers

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

GPTNT benchmark demonstrates that state-of-the-art multimodal models cannot perform real-time collaborative bomb defusal in Keep Talking and Nobody Explodes, unlike human players.

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

AdaPlanBench introduces a multi-turn benchmark where LLM agents must adapt plans under progressively revealed dual constraints, with top models reaching only 67.75% accuracy.

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

cs.AI · 2026-05-25 · conditional · novelty 7.0

CUA-Gym generates 32,112 verified RLVR tuples across 110 mock environments, enabling trained models to reach 62.1% and 72.6% on OSWorld-Verified while transferring to WebArena.

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

Faithful Mobile GUI Agents with Guided Advantage Estimator

cs.AI · 2026-05-02 · unverdicted · novelty 7.0

Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

cs.AI · 2026-04-15 · unverdicted · novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

cs.SE · 2026-03-14 · unverdicted · novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

cs.AI · 2026-06-29 · unverdicted · novelty 6.0

GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

AsyncWebRL reports up to 2.9x training speedup and new SOTA on WebGym OOD split via async overlap plus constant normalizer in GRPO, with largest gains on harder tasks.

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.

citing papers explorer

Showing 50 of 57 citing papers.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 18 · internal anchor
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents cs.CR · 2026-01-26 · unverdicted · none · ref 25 · internal anchor
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Agent-Computer Observation Interfaces Enable Dynamic Computer Use cs.AI · 2026-06-28 · conditional · none · ref 17 · internal anchor
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes cs.AI · 2026-06-26 · unverdicted · none · ref 6 · internal anchor
GPTNT benchmark demonstrates that state-of-the-art multimodal models cannot perform real-time collaborative bomb defusal in Keep Talking and Nobody Explodes, unlike human players.
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields cs.AI · 2026-06-09 · unverdicted · none · ref 18 · internal anchor
Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks cs.AI · 2026-06-08 · unverdicted · none · ref 67 · internal anchor
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions cs.AI · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints cs.CL · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
AdaPlanBench introduces a multi-turn benchmark where LLM agents must adapt plans under progressively revealed dual constraints, with top models reaching only 67.75% accuracy.
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents cs.AI · 2026-05-25 · conditional · none · ref 25 · internal anchor
CUA-Gym generates 32,112 verified RLVR tuples across 110 mock environments, enabling trained models to reach 62.1% and 72.6% on OSWorld-Verified while transferring to WebArena.
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation cs.AI · 2026-05-24 · unverdicted · none · ref 17 · internal anchor
PANDO introduces an online skill-distillation method with a structured library, reflection, demotion, routing, compression, and cache-aware prompting that reaches 58.3% success on 910 VisualWebArena tasks using 58-61% fewer tokens than prior methods.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 110 · 2 links · internal anchor
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments cs.CV · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 34 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 20 · internal anchor
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Faithful Mobile GUI Agents with Guided Advantage Estimator cs.AI · 2026-05-02 · unverdicted · none · ref 15 · internal anchor
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management cs.AI · 2026-04-15 · unverdicted · none · ref 46 · internal anchor
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 51 · internal anchor
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging cs.SE · 2026-03-14 · unverdicted · none · ref 30 · internal anchor
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding cs.CV · 2026-06-29 · unverdicted · none · ref 81 · internal anchor
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots cs.AI · 2026-06-29 · unverdicted · none · ref 3 · internal anchor
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 82 · internal anchor
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents cs.LG · 2026-06-04 · unverdicted · none · ref 15 · internal anchor
AsyncWebRL reports up to 2.9x training speedup and new SOTA on WebGym OOD split via async overlap plus constant normalizer in GRPO, with largest gains on harder tasks.
Demo2Tutorial: From Human Experience to Multimodal Software Tutorials cs.CV · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
Demo2Tutorial distills human screen recordings into hierarchical image-text tutorials that outperform human-authored ones on a documentation-derived benchmark and improve downstream human task speed and GUI-agent planning.
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents cs.LG · 2026-06-01 · unverdicted · none · ref 41 · internal anchor
OpenWebRL trains a 4B visual web agent with online RL on live sites using 0.4K init trajectories and 2.2K RL tasks to reach 67% success on Online-Mind2Web and 64% on DeepShop, outperforming prior open agents.
What to Format and How: A Benchmark and Workflow Approach for Document Formatting cs.CL · 2026-06-01 · unverdicted · none · ref 35 · internal anchor
Presents DocFormBench benchmark and DocFormFlow workflow for content-aware LLM document formatting, claiming higher accuracy and lower token use via decoupled localization and modification.
PhoneWorld: Scaling Phone-Use Agent Environments cs.CL · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
PhoneWorld is a pipeline that converts real mobile trajectories into scalable controllable environments, yielding large gains on four benchmarks when used to supplement training data.
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents cs.LG · 2026-05-27 · unverdicted · none · ref 36 · internal anchor
LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems cs.CL · 2026-05-27 · unverdicted · none · ref 10 · internal anchor
MemTrace turns LLM memory operations into executable evolution graphs for error tracing, builds a benchmark across systems like RAG and Mem0, and uses attribution to optimize prompts, improving task performance by up to 7.62%.
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration cs.AI · 2026-05-26 · unverdicted · none · ref 17 · internal anchor
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay q-bio.NC · 2026-05-19 · unverdicted · none · ref 10 · internal anchor
VLMs and LAMs outperform RL baselines in voxel-wise brain encoding during gameplay, with LAMs showing prompt-asymmetric organization (27% unique action vs -5% unique reasoning) strongest in frontal-motor cortex.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction cs.CL · 2026-05-11 · unverdicted · none · ref 73 · 2 links · internal anchor
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents cs.CR · 2026-04-28 · unverdicted · none · ref 40 · internal anchor
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation cs.CL · 2026-04-23 · conditional · none · ref 61 · internal anchor
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents cs.HC · 2026-04-22 · unverdicted · none · ref 47 · internal anchor
AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection cs.CR · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces cs.AI · 2026-03-05 · unverdicted · none · ref 21 · internal anchor
WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management cs.AI · 2025-12-11 · conditional · none · ref 39 · internal anchor
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning cs.AI · 2026-06-08 · unverdicted · none · ref 24 · internal anchor
AliyunConsoleAgent-32B reaches 63.52% success on a 278-task cloud console benchmark, closing to 1.82pp of frontier models at 92% lower cost via SFT distillation and GRPO RL.
StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents cs.AI · 2026-06-05 · unverdicted · none · ref 28 · internal anchor
StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision cs.CV · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 226 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents cs.LG · 2026-05-16 · unverdicted · none · ref 36 · internal anchor
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections cs.CR · 2026-05-14 · unverdicted · none · ref 58 · internal anchor
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse cs.CV · 2026-05-11 · unverdicted · none · ref 176 · 2 links · internal anchor
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization cs.AI · 2026-05-09 · unverdicted · none · ref 18 · 2 links · internal anchor
An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 70 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 71 · internal anchor
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents cs.AI · 2026-04-19 · unverdicted · none · ref 24 · internal anchor
HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer