hub Canonical reference

Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation

Yan, A · 2023 · arXiv 2311.07562

Canonical reference. 83% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

cs.AI · 2026-06-29 · unverdicted · novelty 6.0

GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

cs.AI · 2025-12-11 · conditional · novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.

DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking

cs.HC · 2025-05-06 · unverdicted · novelty 6.0

DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

cs.RO · 2024-12-13 · conditional · novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.

Crepe: A Mobile Screen Data Collector Using Graph Query

cs.HC · 2024-06-23 · unverdicted · novelty 6.0

Crepe introduces a graph-query technique in a no-code Android app for flexible collection of targeted mobile screen data with privacy and consent controls.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

cs.AI · 2026-05-15 · unverdicted · novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

cs.AI · 2025-01-27 · unverdicted · novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

AppAgent: Multimodal Agents as Smartphone Users

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

Large Language Model-Brained GUI Agents: A Survey

cs.AI · 2024-11-27 · unverdicted · novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Rethinking Agentic Reinforcement Learning In Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 3.0 · 3 refs

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

cs.HC · 2024-01-10 · unverdicted · novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

LLM-Powered AI Agent Systems and Their Applications in Industry

cs.AI · 2025-05-22 · unverdicted · novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

citing papers explorer

Showing 2 of 2 citing papers after filters.

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction cs.CV · 2026-04-27 · unverdicted · none · ref 7
GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.
AppAgent: Multimodal Agents as Smartphone Users cs.CV · 2023-12-21 · unverdicted · none · ref 84
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer