pith. sign in

hub Canonical reference

Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation

Canonical reference. 83% of citing Pith papers cite this work as background.

15 Pith papers citing it
Background 83% of classified citations

hub tools

citation-role summary

background 5 method 1

citation-polarity summary

clear filters

representative citing papers

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

cs.AI · 2026-05-15 · unverdicted · novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.

AppAgent: Multimodal Agents as Smartphone Users

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.

citing papers explorer

Showing 2 of 2 citing papers after filters.