hub Canonical reference

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang · 2024 · cs.CL · arXiv 2401.16158

Canonical reference. 91% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 91% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Mobile device agent based on Multimodal Large Language Models (MLLM) is becoming a popular application. In this paper, we introduce Mobile-Agent, an autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual perception tools to accurately identify and locate both the visual and textual elements within the app's front-end interface. Based on the perceived vision context, it then autonomously plans and decomposes the complex operation task, and navigates the mobile Apps through operations step by step. Different from previous solutions that rely on XML files of Apps or mobile system metadata, Mobile-Agent allows for greater adaptability across diverse mobile operating environments in a vision-centric way, thereby eliminating the necessity for system-specific customizations. To assess the performance of Mobile-Agent, we introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent. The experimental results indicate that Mobile-Agent achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. Code and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 1

citation-polarity summary

background 10 baseline 1

representative citing papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

PreAct compiles successful agent executions into verifiable state-machine programs for 8.5-13x faster replay on repeated tasks, with an independent evaluator check before storing each program.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Introduces LivingScreen benchmark for living-screen-native GUI agents on short-video platforms; frontier models fail to match human cost-accuracy due to over- and under-observation.

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

WinDeskGround is a parametrically generated benchmark of 1,356 instruction-target pairs that reveals accuracy declines in state-of-the-art MLLMs under partial occlusion in multi-window GUI settings.

Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

cs.HC · 2026-04-28 · unverdicted · novelty 7.0

VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.

AgenTEE: Confidential LLM Agent Execution on Edge Devices

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

cs.AI · 2026-02-24 · unverdicted · novelty 7.0

The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.

What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States

cs.CV · 2026-06-30 · unverdicted · novelty 6.0 · 2 refs

Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

cs.CR · 2026-06-10 · unverdicted · novelty 6.0

CAPED reduces incidental visual privacy leakage in mobile GUI agents from 0.766 to 0.268 on seeded AndroidWorld tasks by selectively exposing only task-relevant screen content.

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

AQuaUI uses adaptive quadtrees to cut visual tokens in GUI-agent LMMs by up to 29.52% at inference time while retaining 99.06% of full-token accuracy on grounding and navigation benchmarks.

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

ICRL uses joint RL training of solver and critic with distribution-calibration re-weighting and role-wise advantage estimation to internalize critique into unassisted LLM performance, yielding 6.4-point gains on agentic tasks and 7.0 on math reasoning with Qwen3 models.

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

CAVI framework uses character-guided token pruning, orthogonal feature modulation, and modality-adaptive role steering to resolve modality-role interference in multimodal RPAs.

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

cs.CL · 2026-04-23 · conditional · novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

cs.HC · 2026-04-20 · unverdicted · novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.

SkillDroid: Compile Once, Reuse Forever

cs.HC · 2026-04-16 · conditional · novelty 6.0

SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 rounds while the stateless baseline drops to 44%.

EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices

cs.OS · 2026-04-10 · unverdicted · novelty 6.0

EdgeFlow reduces mobile LLM cold-start latency up to 4.07x versus llama.cpp, MNN, and llm.npu by NPU-aware adaptive quantization, SIMD-friendly packing, and synergistic granular CPU-NPU pipelining at comparable accuracy.

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

cs.AI · 2025-12-11 · conditional · novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.

UMDAM: A Unified Data Layout and DRAM Address Mapping for Heterogenous NPU-PIM

cs.DC · 2025-11-05 · unverdicted · novelty 6.0

UMDAM introduces a column-major tile-based data layout and configurable DRAM mapping to enable efficient NPU-PIM co-execution for LLM inference, reducing TTFT by up to 3.0x and TTLT by 2.18x on OPT models without added memory overhead or bandwidth loss.

DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

cs.HC · 2025-05-06 · unverdicted · novelty 6.0

DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

cs.AI · 2026-06-03 · unverdicted · novelty 5.0

MIRAGE compresses explicit chain-of-thought into latent vectors and adds a generative world model to predict future interface states, matching explicit reasoning performance with 3-5x fewer tokens on Android benchmarks.

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

cs.AI · 2026-06-02 · unverdicted · novelty 5.0

PRPF uses a lightweight Multimodal Proactive Perceptor for intervention gating and context compression, activating the Proactive Agent Reasoner only when needed, reducing false trigger rates and improving efficiency on the ProactiveMobile benchmark.

SCOPE: Real-Time Natural Language Camera Agent at the Edge

cs.RO · 2026-06-01 · unverdicted · novelty 5.0

SCOPE introduces an edge-deployable natural-language PTZ camera agent, a simulation benchmark, and evaluations showing that stronger small language models reduce hallucinations while perception remains the main bottleneck.

Beyond Scaling: Agents Are Heading to the Edge

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models cs.CV · 2025-03-18 · unverdicted · none · ref 50 · internal anchor
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer