OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
hub Canonical reference
Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation
Canonical reference. 83% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
Crepe introduces a graph-query technique in a no-code Android app for flexible collection of targeted mobile screen data with privacy and consent controls.
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.
citing papers explorer
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.