arxiv: 2401.10935 · v2 · pith:CVVQ5U6Znew · submitted 2024-01-17 · 💻 cs.HC · cs.AI

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu This is my paper

Pith reviewed 2026-05-17 10:04 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords GUI agentsvisual groundingscreenshot-based agentsGUI grounding benchmarkpre-trainingScreenSpottask automationvisual interfaces

0 comments

The pith

Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeeClick, a visual GUI agent that completes complex tasks on phones, desktops, and web browsers using only screenshots rather than structured data such as HTML. It identifies accurate localization of on-screen elements from instructions as the central bottleneck and solves it by pre-training on large amounts of automatically generated grounding examples. A new benchmark called ScreenSpot measures grounding across realistic mobile, desktop, and web interfaces, and three standard agent benchmarks show consistent gains once grounding improves. A sympathetic reader would care because this removes the need for accessible structured data and suggests that grounding skill is a transferable foundation for reliable visual automation.

Core claim

After GUI-grounding pre-training on automatically curated screenshot-instruction pairs, SeeClick achieves large gains on the ScreenSpot benchmark and the improvements transfer to higher success rates on downstream GUI-agent tasks in mobile, desktop, and web settings, establishing a direct correlation between grounding accuracy and overall agent performance.

What carries the argument

GUI grounding—the capacity to locate screen elements from instructions—which is strengthened by pre-training on automatically curated data and then transferred to full task sequences.

If this is right

Visual agents can operate without relying on extractable structured data such as HTML or accessibility trees.
Pre-training focused on element localization produces measurable gains on standard GUI-agent benchmarks.
Performance scales with grounding quality across mobile, desktop, and web platforms.
Automatic curation of grounding data provides a scalable route to better agents without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar grounding pre-training could be applied to other screenshot-based agents outside the GUI domain.
The same curation pipeline might be extended to generate even larger or more diverse grounding datasets for further gains.
If grounding remains the bottleneck, future agent work could prioritize localization objectives over end-to-end policy learning.

Load-bearing premise

The automatically generated GUI grounding examples are high-quality and representative enough to transfer to real agent tasks across different device environments.

What would settle it

Measure grounding accuracy and downstream task success on a new set of environments or tasks; if the correlation between the two disappears or if pre-training yields no transfer gain, the central claim is falsified.

read the original abstract

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeeClick adds a visual-only GUI agent with auto-curated grounding pre-training and the ScreenSpot benchmark, but the curation step lacks reported quality checks that would confirm the claimed transfer to agent tasks.

read the letter

SeeClick is a screenshot-only GUI agent that pre-trains for GUI grounding using automatically collected data and introduces ScreenSpot, a benchmark spanning mobile, desktop, and web. The release of model, data, and code is the clearest practical step forward here, since others can now test the approach directly on their own setups. The abstract reports gains on ScreenSpot and a correlation between grounding accuracy and performance on three existing agent benchmarks, which matches the reasonable expectation that locating elements correctly should help downstream control tasks. The soft spot is the automatic curation pipeline itself. No precision or recall numbers against human labels appear, nor any statistics on how well the curated examples match real user instructions or cover edge cases across environments. If the curation adds systematic noise or skips rare UI patterns, the observed correlation could be weaker than it looks. The stress-test note on this point holds up from the abstract alone. This is for researchers building visual automation tools or testing GUI agents without structured access. A reader who wants a ready benchmark and open code to experiment with will get immediate value. It should go to peer review because the idea is concrete, the resources are public, and the experiments can be checked and extended once the data-quality details are filled in.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SeeClick, a visual GUI agent that automates tasks on screenshots alone rather than structured data such as HTML. It identifies GUI grounding as the central challenge, introduces an automated method to curate grounding data for pre-training, and releases ScreenSpot, a new benchmark spanning mobile, desktop, and web environments. The authors report that pre-training yields gains on ScreenSpot relative to baselines and that these grounding improvements correlate with higher success rates on three downstream GUI agent benchmarks. Model, data, and code are publicly released.

Significance. If the empirical claims hold after validation, the work is significant because it supplies the first realistic multi-environment GUI grounding benchmark, demonstrates a practical link between grounding accuracy and agent task performance, and releases reproducible artifacts that can accelerate research on screenshot-based agents.

major comments (2)

[§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.
[§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.

minor comments (2)

[Abstract] Abstract: The phrase 'significant improvement' is used without accompanying numbers or baseline identifiers; adding the key deltas (e.g., ScreenSpot accuracy lift) would make the summary self-contained.
[§2] Notation: The term 'GUI grounding' is introduced without an explicit formal definition or equation; a short mathematical statement (e.g., mapping instruction to bounding-box coordinates) would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional quantitative validation for the data curation and stronger controls for the downstream evaluations will help substantiate the central claims. We respond to each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.

Authors: We agree that quantitative validation of the curation process is important to rule out label noise or selection bias. In the revised manuscript we will add a dedicated subsection to §3 reporting a manual validation study: a random sample of 500 curated examples will be independently labeled by two human annotators, with precision, recall, and inter-annotator agreement reported. We will also include distribution statistics (e.g., instruction length, element type frequencies) comparing the curated data to ScreenSpot and the three downstream benchmarks to quantify any shift. revision: yes
Referee: [§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.

Authors: We concur that additional controls are needed to strengthen the causal interpretation. We will revise §5 to include: (i) an ablation that removes the grounding pre-training stage while keeping total training compute comparable by extending the subsequent fine-tuning; (ii) a categorized error analysis of failure modes on the three downstream benchmarks, explicitly linking errors to grounding inaccuracies; and (iii) a compute-matched baseline agent trained with a generic vision-language pre-training objective instead of GUI grounding pre-training. These results will be presented alongside the existing correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper's chain consists of proposing a visual GUI agent, identifying GUI grounding as a challenge via preliminary study, automatically curating grounding data, pre-training SeeClick, releasing the ScreenSpot benchmark, and reporting empirical gains on ScreenSpot plus three downstream agent benchmarks. The central claim of correlation between grounding improvements and agent performance rests on these new evaluations and the released benchmark rather than any equation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs by construction. No load-bearing step exhibits self-definitional, fitted-input, or uniqueness-imported circularity; the work is externally falsifiable through the public model, data, and code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GUI grounding is the primary bottleneck and on standard machine-learning training procedures whose specific hyperparameters are not detailed in the abstract.

free parameters (1)

pre-training hyperparameters and data curation thresholds
Standard deep-learning choices required to produce the reported improvements but not enumerated in the abstract.

axioms (1)

domain assumption GUI grounding is the key challenge limiting visual GUI agents
Identified via preliminary study and used to motivate the pre-training approach.

pith-pipeline@v0.9.0 · 5530 in / 1112 out tokens · 88814 ms · 2026-05-17T10:04:24.837735+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SeeClick demonstrates significant improvement in ScreenSpot over various baselines... comprehensive evaluations on three widely used benchmarks consistently support our finding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
cs.CV 2025-04 unverdicted novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
cs.AI 2024-05 accept novelty 7.0

AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?
cs.HC 2026-04 accept novelty 6.0

VLMs achieve moderate alignment with human gaze on UIs that improves with longer viewing durations and varies by UI type, capturing exploratory rather than initial fixation patterns.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
cs.AI 2025-12 conditional novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
cs.AI 2025-10 unverdicted novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GTA1: GUI Test-time Scaling Agent
cs.AI 2025-07 unverdicted novelty 6.0

GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
cs.AI 2025-04 unverdicted novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
cs.CV 2026-04 unverdicted novelty 5.0

Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
cs.CV 2026-04 unverdicted novelty 5.0

Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
cs.LG 2026-02 unverdicted novelty 5.0

UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
cs.HC 2024-01 unverdicted novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 21 Pith papers · 24 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Journal of the ACM (JACM)28(1), 114–133 (1981) https://doi.org/10.1145/322234.322243 24

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

arXiv preprint arXiv:2311.11797 , year=

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents , author=. arXiv preprint arXiv:2311.11797 , year=

work page arXiv
[9]

International Conference on Machine Learning , pages=

World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , url=

work page 2017
[10]

International Conference on Learning Representations , year=

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

work page
[11]

International Conference on Learning Representations , year=

Learning to Navigate the Web , author=. International Conference on Learning Representations , year=

work page
[13]

NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

work page 2023
[16]

ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

Understanding HTML with Large Language Models , author=. ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

work page 2023
[20]

AppAgent: Multimodal Agents as Smartphone Users

AppAgent: Multimodal Agents as Smartphone Users , author=. arXiv preprint arXiv:2312.13771 , year=

work page internal anchor Pith review arXiv
[21]

Advances in Neural Information Processing Systems , year=

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces , author=. Advances in Neural Information Processing Systems , year=

work page
[23]

Neural Information Processing Systems , year =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. Neural Information Processing Systems , year =

work page
[26]

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition , author=. arXiv preprint arXiv:2309.15112 , year=

work page internal anchor Pith review arXiv
[28]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[34]

CogVLM: Visual Expert for Pretrained Language Models

Cogvlm: Visual expert for pretrained language models , author=. arXiv preprint arXiv:2311.03079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

The 34th Annual ACM Symposium on User Interface Software and Technology , pages=

Screen2words: Automatic mobile UI summarization with multimodal learning , author=. The 34th Annual ACM Symposium on User Interface Software and Technology , pages=. 2021 , url=

work page 2021
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , pages=

Actionbert: Leveraging user actions for semantic understanding of user interfaces , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=. 2021 , url=

work page 2021
[38]

arXiv preprint arXiv:2107.13731 , year=

Uibert: Learning generic multimodal representations for ui understanding , author=. arXiv preprint arXiv:2107.13731 , year=

work page arXiv
[40]

Proceedings of the 29th International Conference on Computational Linguistics , pages=

Towards Better Semantic Understanding of Mobile Interfaces , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=. 2022 , url=

work page 2022
[41]

proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

Object detection for graphical user interface: Old fashioned or deep learning or a combination? , author=. proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=. 2020 , url=

work page 2020
[43]

European Conference on Computer Vision , pages=

A dataset for interactive vision-language navigation with unknown command feasibility , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[44]

The Eleventh International Conference on Learning Representations , year=

Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus , author=. The Eleventh International Conference on Learning Representations , year=

work page
[46]

Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=

Rico: A mobile app dataset for building data-driven design applications , author=. Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=. 2017 , url=

work page 2017
[49]

International Conference on Learning Representations , year=

Pix2seq: A Language Modeling Framework for Object Detection , author=. International Conference on Learning Representations , year=

work page
[50]

International Conference on Machine Learning , pages=

Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[51]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

work page 2023
[52]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page
[53]

Introducing our Multimodal Models , url =

Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

work page
[58]

International Journal of Computer Vision , volume=

Top-down neural attention by excitation backprop , author=. International Journal of Computer Vision , volume=. 2018 , publisher=

work page 2018
[59]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Grounded language-image pre-training , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2022 , url=

work page 2022
[60]

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=. 2014 , url=

work page 2014
[66]

2024 , eprint=

OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=

work page 2024
[67]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. https://arxiv.org/pdf/2308.12966 Qwen-vl: A frontier large vision-language model with versatile abilities . arXiv preprint arXiv:2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa g nak Ta s rlar. 2023. https://www.adept.ai/blog/fuyu-8b Introducing our multimodal models

work page 2023
[69]

Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2022. https://arxiv.org/pdf/2202.02312 A dataset for interactive vision-language navigation with unknown command feasibility . In European Conference on Computer Vision, pages 312--328. Springer

work page arXiv 2022
[70]

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023 a . https://arxiv.org/pdf/2310.09478 Minigpt-v2: large language model as a unified interface for vision-language multi-task learning . arXiv preprint arXiv:2310.09478

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023 b . https://arxiv.org/pdf/2306.15195 Shikra: Unleashing multimodal llm's referential dialogue magic . arXiv preprint arXiv:2306.15195

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. 2021. https://arxiv.org/pdf/2109.10852 Pix2seq: A language modeling framework for object detection . In International Conference on Learning Representations

work page arXiv 2021
[73]

Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. https://dl.acm.org/doi/pdf/10.1145/3126594.3126651 Rico: A mobile app dataset for building data-driven design applications . In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845--854

work page doi:10.1145/3126594.3126651 2017
[74]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. https://arxiv.org/pdf/2306.06070 Mind2web: Towards a generalist agent for the web . arXiv preprint arXiv:2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. 2023. https://arxiv.org/pdf/2305.11854 Multimodal web navigation with instruction-finetuned foundation models . arXiv preprint arXiv:2305.11854

work page arXiv 2023
[77]

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. https://arxiv.org/pdf/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis . arXiv preprint arXiv:2307.12856

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. 2018. https://arxiv.org/pdf/1812.09195 Learning to navigate the web . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2018
[79]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2023. https://arxiv.org/pdf/2312.08914 Cogagent: A visual language model for gui agents . arXiv preprint arXiv:2312.08914

work page arXiv 2023
[80]

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. https://arxiv.org/pdf/2106.09685.pdf In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2021
[81]

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. https://arxiv.org/pdf/2303.17491 Language models can solve computer tasks . arXiv preprint arXiv:2303.17491

work page internal anchor Pith review arXiv 2023
[82]

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. https://arxiv.org/pdf/2305.03726 Otter: A multi-modal model with in-context instruction tuning . arXiv preprint arXiv:2305.03726

work page internal anchor Pith review Pith/arXiv arXiv 2023
[83]

Gang Li and Yang Li. 2022. https://arxiv.org/pdf/2209.14927 Spotlight: Mobile ui understanding using vision-language models with a focus . In The Eleventh International Conference on Learning Representations

work page arXiv 2022
[84]

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. http://openaccess.thecvf.com/content/CVPR2022/papers/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.pdf Grounded language-image pre-training . In Proceedings of the IEEE/CVF Conference on Compute...

work page 2022
[85]

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020 a . https://arxiv.org/pdf/2005.03776 Mapping natural language instructions to mobile ui action sequences . arXiv preprint arXiv:2005.03776

work page arXiv 2020
[86]

Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. 2020 b . https://arxiv.org/pdf/2010.04295 Widget captioning: Generating natural language description for mobile user interface elements . arXiv preprint arXiv:2010.04295

work page arXiv 2020
[87]

Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and Alexey Gritsenko. 2021. https://arxiv.org/pdf/2112.05692 Vut: Versatile ui transformer for multi-modal multi-task user interface modeling . arXiv preprint arXiv:2112.05692

work page arXiv 2021
[88]

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. https://arxiv.org/pdf/1802.08802 Reinforcement learning on web interfaces using workflow-guided exploration . In International Conference on Learning Representations

work page internal anchor Pith review Pith/arXiv arXiv 2018
[89]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . https://arxiv.org/pdf/2304.08485 Visual instruction tuning . In Neural Information Processing Systems

work page internal anchor Pith review Pith/arXiv arXiv 2023
[90]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023 b . https://arxiv.org/pdf/2307.06281 Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281

work page internal anchor Pith review Pith/arXiv arXiv 2023
[91]

OpenAI. 2023. http://arxiv.org/abs/2303.08774 GPT-4 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023
[92]

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. https://arxiv.org/pdf/2306.14824 Kosmos-2: Grounding multimodal large language models to the world . arXiv preprint arXiv:2306.14824

work page internal anchor Pith review Pith/arXiv arXiv 2023
[93]

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. https://arxiv.org/pdf/2307.10088 Android in the wild: A large-scale dataset for android device control . arXiv preprint arXiv:2307.10088

work page arXiv 2023
[94]

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. 2023. https://arxiv.org/abs/2306.00245 From pixels to ui actions: Learning to follow instructions via graphical user interfaces . In Advances in Neural Information Processing Systems

work page arXiv 2023
[95]

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. https://proceedings.mlr.press/v70/shi17a.html World of bits: An open-domain platform for web-based agents . In International Conference on Machine Learning, pages 3135--3144. PMLR

work page 2017
[96]

Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. 2023. https://arxiv.org/pdf/2310.00280 Corex: Pushing the boundaries of complex reasoning through multi-model collaboration . arXiv preprint arXiv:2310.00280

work page arXiv 2023
[97]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. https://arxiv.org/pdf/2302.13971 Llama: Open and efficient foundation language models . arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[98]

Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. https://dl.acm.org/doi/pdf/10.1145/3472749.3474765 Screen2words: Automatic mobile ui summarization with multimodal learning . In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498--510

work page doi:10.1145/3472749.3474765 2021
[99]

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2023. https://arxiv.org/pdf/2305.11175 Visionllm: Large language model is also an open-ended decoder for vision-centric tasks . arXiv preprint arXiv:2305.11175

work page arXiv 2023
[100]

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. http://arxiv.org/abs/2402.07456 Os-copilot: Towards generalist computer agents with self-improvement

work page arXiv 2024
[101]

Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu. 2023. https://arxiv.org/pdf/2311.09278 Symbol-llm: Towards foundational symbol-centric interface for large language models . arXiv preprint arXiv:2311.09278

work page arXiv 2023
[102]

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. https://arxiv.org/pdf/2311.07562 Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation . arXiv preprint arXiv:2311.07562

work page arXiv 2023
[103]

Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023 a . https://arxiv.org/pdf/2312.13108 Appagent: Multimodal agents as smartphone users . arXiv preprint arXiv:2312.13771

work page arXiv 2023
[104]

Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023 b . https://www.stableaiprompts.com/wp-content/uploads/2023/10/Chatgpt-Updates.pdf The dawn of lmms: Preliminary explorations with gpt-4v (ision) . arXiv preprint arXiv:2309.17421, 9(1):1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[105]

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. https://arxiv.org/pdf/2304.14178.pdf?trk=public_post_comment-text mplug-owl: Modularization empowers large language models with multimodality . arXiv preprint arXiv:2304.14178

work page internal anchor Pith review Pith/arXiv arXiv 2023
[106]

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. https://arxiv.org/pdf/2308.02490 Mm-vet: Evaluating large multimodal models for integrated capabilities . arXiv preprint arXiv:2308.02490

work page internal anchor Pith review Pith/arXiv arXiv 2023
[107]

Zhuosheng Zhan and Aston Zhang. 2023. https://arxiv.org/pdf/2309.11436 You only look at screens: Multimodal chain-of-action agents . arXiv preprint arXiv:2309.11436

work page arXiv 2023
[108]

Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and Yan Lu. 2023. https://arxiv.org/pdf/2310.04716 Reinforced ui instruction grounding: Towards a generic ui task automation api . arXiv preprint arXiv:2310.04716

work page arXiv 2023
[109]

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. https://arxiv.org/abs/2401.01614 Gpt-4v (ision) is a generalist web agent, if grounded . arXiv preprint arXiv:2401.01614

work page internal anchor Pith review Pith/arXiv arXiv 2024
[110]

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. 2023. https://ltzheng.github.io/Synapse/static/Synapse.pdf Synapse: Trajectory-as-exemplar prompting with memory for computer control . In NeurIPS 2023 Foundation Models for Decision Making Workshop

work page 2023
[111]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. https://arxiv.org/pdf/2307.13854 Webarena: A realistic web environment for building autonomous agents . arXiv preprint arXiv:2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.