SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Pith reviewed 2026-05-17 10:04 UTC · model grok-4.3
The pith
Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After GUI-grounding pre-training on automatically curated screenshot-instruction pairs, SeeClick achieves large gains on the ScreenSpot benchmark and the improvements transfer to higher success rates on downstream GUI-agent tasks in mobile, desktop, and web settings, establishing a direct correlation between grounding accuracy and overall agent performance.
What carries the argument
GUI grounding—the capacity to locate screen elements from instructions—which is strengthened by pre-training on automatically curated data and then transferred to full task sequences.
If this is right
- Visual agents can operate without relying on extractable structured data such as HTML or accessibility trees.
- Pre-training focused on element localization produces measurable gains on standard GUI-agent benchmarks.
- Performance scales with grounding quality across mobile, desktop, and web platforms.
- Automatic curation of grounding data provides a scalable route to better agents without manual annotation.
Where Pith is reading between the lines
- Similar grounding pre-training could be applied to other screenshot-based agents outside the GUI domain.
- The same curation pipeline might be extended to generate even larger or more diverse grounding datasets for further gains.
- If grounding remains the bottleneck, future agent work could prioritize localization objectives over end-to-end policy learning.
Load-bearing premise
The automatically generated GUI grounding examples are high-quality and representative enough to transfer to real agent tasks across different device environments.
What would settle it
Measure grounding accuracy and downstream task success on a new set of environments or tasks; if the correlation between the two disappears or if pre-training yields no transfer gain, the central claim is falsified.
read the original abstract
Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SeeClick, a visual GUI agent that automates tasks on screenshots alone rather than structured data such as HTML. It identifies GUI grounding as the central challenge, introduces an automated method to curate grounding data for pre-training, and releases ScreenSpot, a new benchmark spanning mobile, desktop, and web environments. The authors report that pre-training yields gains on ScreenSpot relative to baselines and that these grounding improvements correlate with higher success rates on three downstream GUI agent benchmarks. Model, data, and code are publicly released.
Significance. If the empirical claims hold after validation, the work is significant because it supplies the first realistic multi-environment GUI grounding benchmark, demonstrates a practical link between grounding accuracy and agent task performance, and releases reproducible artifacts that can accelerate research on screenshot-based agents.
major comments (2)
- [§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.
- [§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.
minor comments (2)
- [Abstract] Abstract: The phrase 'significant improvement' is used without accompanying numbers or baseline identifiers; adding the key deltas (e.g., ScreenSpot accuracy lift) would make the summary self-contained.
- [§2] Notation: The term 'GUI grounding' is introduced without an explicit formal definition or equation; a short mathematical statement (e.g., mapping instruction to bounding-box coordinates) would aid clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that additional quantitative validation for the data curation and stronger controls for the downstream evaluations will help substantiate the central claims. We respond to each major comment below and will incorporate the suggested revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.
Authors: We agree that quantitative validation of the curation process is important to rule out label noise or selection bias. In the revised manuscript we will add a dedicated subsection to §3 reporting a manual validation study: a random sample of 500 curated examples will be independently labeled by two human annotators, with precision, recall, and inter-annotator agreement reported. We will also include distribution statistics (e.g., instruction length, element type frequencies) comparing the curated data to ScreenSpot and the three downstream benchmarks to quantify any shift. revision: yes
-
Referee: [§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.
Authors: We concur that additional controls are needed to strengthen the causal interpretation. We will revise §5 to include: (i) an ablation that removes the grounding pre-training stage while keeping total training compute comparable by extending the subsequent fine-tuning; (ii) a categorized error analysis of failure modes on the three downstream benchmarks, explicitly linking errors to grounding inaccuracies; and (iii) a compute-matched baseline agent trained with a generic vision-language pre-training objective instead of GUI grounding pre-training. These results will be presented alongside the existing correlations. revision: yes
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper's chain consists of proposing a visual GUI agent, identifying GUI grounding as a challenge via preliminary study, automatically curating grounding data, pre-training SeeClick, releasing the ScreenSpot benchmark, and reporting empirical gains on ScreenSpot plus three downstream agent benchmarks. The central claim of correlation between grounding improvements and agent performance rests on these new evaluations and the released benchmark rather than any equation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs by construction. No load-bearing step exhibits self-definitional, fitted-input, or uniqueness-imported circularity; the work is externally falsifiable through the public model, data, and code.
Axiom & Free-Parameter Ledger
free parameters (1)
- pre-training hyperparameters and data curation thresholds
axioms (1)
- domain assumption GUI grounding is the key challenge limiting visual GUI agents
Lean theorems connected to this paper
-
Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SeeClick demonstrates significant improvement in ScreenSpot over various baselines... comprehensive evaluations on three widely used benchmarks consistently support our finding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
-
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.
-
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
-
UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?
VLMs achieve moderate alignment with human gaze on UIs that improves with longer viewing durations and varies by UI type, capturing exploratory rather than initial fixation patterns.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GTA1: GUI Test-time Scaling Agent
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
-
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
-
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
-
UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics
UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Reference graph
Works this paper leans on
- [1]
-
[2]
Publications Manual , year = "1983", publisher =
work page 1983
-
[3]
Journal of the ACM (JACM)28(1), 114–133 (1981) https://doi.org/10.1145/322234.322243 24
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [4]
-
[5]
Dan Gusfield , title =. 1997
work page 1997
-
[6]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[8]
arXiv preprint arXiv:2311.11797 , year=
Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents , author=. arXiv preprint arXiv:2311.11797 , year=
-
[9]
International Conference on Machine Learning , pages=
World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , url=
work page 2017
-
[10]
International Conference on Learning Representations , year=
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=
-
[11]
International Conference on Learning Representations , year=
Learning to Navigate the Web , author=. International Conference on Learning Representations , year=
-
[13]
NeurIPS 2023 Foundation Models for Decision Making Workshop , year=
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=
work page 2023
-
[16]
ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=
Understanding HTML with Large Language Models , author=. ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=
work page 2023
-
[20]
AppAgent: Multimodal Agents as Smartphone Users
AppAgent: Multimodal Agents as Smartphone Users , author=. arXiv preprint arXiv:2312.13771 , year=
work page internal anchor Pith review arXiv
-
[21]
Advances in Neural Information Processing Systems , year=
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces , author=. Advances in Neural Information Processing Systems , year=
-
[23]
Neural Information Processing Systems , year =
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. Neural Information Processing Systems , year =
-
[26]
Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition , author=. arXiv preprint arXiv:2309.15112 , year=
work page internal anchor Pith review arXiv
-
[28]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[34]
CogVLM: Visual Expert for Pretrained Language Models
Cogvlm: Visual expert for pretrained language models , author=. arXiv preprint arXiv:2311.03079 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
The 34th Annual ACM Symposium on User Interface Software and Technology , pages=
Screen2words: Automatic mobile UI summarization with multimodal learning , author=. The 34th Annual ACM Symposium on User Interface Software and Technology , pages=. 2021 , url=
work page 2021
-
[37]
Proceedings of the AAAI Conference on Artificial Intelligence , pages=
Actionbert: Leveraging user actions for semantic understanding of user interfaces , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=. 2021 , url=
work page 2021
-
[38]
arXiv preprint arXiv:2107.13731 , year=
Uibert: Learning generic multimodal representations for ui understanding , author=. arXiv preprint arXiv:2107.13731 , year=
-
[40]
Proceedings of the 29th International Conference on Computational Linguistics , pages=
Towards Better Semantic Understanding of Mobile Interfaces , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=. 2022 , url=
work page 2022
-
[41]
Object detection for graphical user interface: Old fashioned or deep learning or a combination? , author=. proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=. 2020 , url=
work page 2020
-
[43]
European Conference on Computer Vision , pages=
A dataset for interactive vision-language navigation with unknown command feasibility , author=. European Conference on Computer Vision , pages=. 2022 , organization=
work page 2022
-
[44]
The Eleventh International Conference on Learning Representations , year=
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus , author=. The Eleventh International Conference on Learning Representations , year=
-
[46]
Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=
Rico: A mobile app dataset for building data-driven design applications , author=. Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=. 2017 , url=
work page 2017
-
[49]
International Conference on Learning Representations , year=
Pix2seq: A Language Modeling Framework for Object Detection , author=. International Conference on Learning Representations , year=
-
[50]
International Conference on Machine Learning , pages=
Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[51]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
work page 2023
-
[52]
International Conference on Learning Representations , year=
LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
-
[53]
Introducing our Multimodal Models , url =
Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =
-
[58]
International Journal of Computer Vision , volume=
Top-down neural attention by excitation backprop , author=. International Journal of Computer Vision , volume=. 2018 , publisher=
work page 2018
-
[59]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Grounded language-image pre-training , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2022 , url=
work page 2022
-
[60]
Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=. 2014 , url=
work page 2014
-
[66]
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=
work page 2024
-
[67]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. https://arxiv.org/pdf/2308.12966 Qwen-vl: A frontier large vision-language model with versatile abilities . arXiv preprint arXiv:2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa g nak Ta s rlar. 2023. https://www.adept.ai/blog/fuyu-8b Introducing our multimodal models
work page 2023
-
[69]
Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2022. https://arxiv.org/pdf/2202.02312 A dataset for interactive vision-language navigation with unknown command feasibility . In European Conference on Computer Vision, pages 312--328. Springer
-
[70]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023 a . https://arxiv.org/pdf/2310.09478 Minigpt-v2: large language model as a unified interface for vision-language multi-task learning . arXiv preprint arXiv:2310.09478
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023 b . https://arxiv.org/pdf/2306.15195 Shikra: Unleashing multimodal llm's referential dialogue magic . arXiv preprint arXiv:2306.15195
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [72]
-
[73]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. https://dl.acm.org/doi/pdf/10.1145/3126594.3126651 Rico: A mobile app dataset for building data-driven design applications . In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845--854
-
[74]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. https://arxiv.org/pdf/2306.06070 Mind2web: Towards a generalist agent for the web . arXiv preprint arXiv:2306.06070
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [75]
-
[77]
Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. https://arxiv.org/pdf/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis . arXiv preprint arXiv:2307.12856
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. 2018. https://arxiv.org/pdf/1812.09195 Learning to navigate the web . In International Conference on Learning Representations
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [79]
-
[80]
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. https://arxiv.org/pdf/2106.09685.pdf In International Conference on Learning Representations
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[81]
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. https://arxiv.org/pdf/2303.17491 Language models can solve computer tasks . arXiv preprint arXiv:2303.17491
work page internal anchor Pith review arXiv 2023
-
[82]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. https://arxiv.org/pdf/2305.03726 Otter: A multi-modal model with in-context instruction tuning . arXiv preprint arXiv:2305.03726
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [83]
-
[84]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. http://openaccess.thecvf.com/content/CVPR2022/papers/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.pdf Grounded language-image pre-training . In Proceedings of the IEEE/CVF Conference on Compute...
work page 2022
- [85]
- [86]
- [87]
-
[88]
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. https://arxiv.org/pdf/1802.08802 Reinforcement learning on web interfaces using workflow-guided exploration . In International Conference on Learning Representations
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[89]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . https://arxiv.org/pdf/2304.08485 Visual instruction tuning . In Neural Information Processing Systems
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[90]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023 b . https://arxiv.org/pdf/2307.06281 Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[91]
OpenAI. 2023. http://arxiv.org/abs/2303.08774 GPT-4 technical report
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[92]
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. https://arxiv.org/pdf/2306.14824 Kosmos-2: Grounding multimodal large language models to the world . arXiv preprint arXiv:2306.14824
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [93]
-
[94]
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. 2023. https://arxiv.org/abs/2306.00245 From pixels to ui actions: Learning to follow instructions via graphical user interfaces . In Advances in Neural Information Processing Systems
-
[95]
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. https://proceedings.mlr.press/v70/shi17a.html World of bits: An open-domain platform for web-based agents . In International Conference on Machine Learning, pages 3135--3144. PMLR
work page 2017
- [96]
-
[97]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. https://arxiv.org/pdf/2302.13971 Llama: Open and efficient foundation language models . arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[98]
Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. https://dl.acm.org/doi/pdf/10.1145/3472749.3474765 Screen2words: Automatic mobile ui summarization with multimodal learning . In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498--510
- [99]
- [100]
- [101]
-
[102]
An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. https://arxiv.org/pdf/2311.07562 Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation . arXiv preprint arXiv:2311.07562
- [103]
-
[104]
Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023 b . https://www.stableaiprompts.com/wp-content/uploads/2023/10/Chatgpt-Updates.pdf The dawn of lmms: Preliminary explorations with gpt-4v (ision) . arXiv preprint arXiv:2309.17421, 9(1):1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[105]
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. https://arxiv.org/pdf/2304.14178.pdf?trk=public_post_comment-text mplug-owl: Modularization empowers large language models with multimodality . arXiv preprint arXiv:2304.14178
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[106]
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. https://arxiv.org/pdf/2308.02490 Mm-vet: Evaluating large multimodal models for integrated capabilities . arXiv preprint arXiv:2308.02490
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [107]
- [108]
-
[109]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. https://arxiv.org/abs/2401.01614 Gpt-4v (ision) is a generalist web agent, if grounded . arXiv preprint arXiv:2401.01614
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[110]
Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. 2023. https://ltzheng.github.io/Synapse/static/Synapse.pdf Synapse: Trajectory-as-exemplar prompting with memory for computer control . In NeurIPS 2023 Foundation Models for Decision Making Workshop
work page 2023
-
[111]
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. https://arxiv.org/pdf/2307.13854 Webarena: A realistic web environment for building autonomous agents . arXiv preprint arXiv:2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.