You only look at screens: Multimodal chain-of-action agents

Zhuosheng Zhan, Aston Zhang · 2023 · arXiv 2309.11436

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

cs.HC · 2024-01-10 · unverdicted · novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 4 of 4 citing papers.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 47
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents cs.HC · 2024-01-17 · unverdicted · none · ref 107
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 103
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 143
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

You only look at screens: Multimodal chain-of-action agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer