SeeClick (Cheng et al., 2024) focused on finetuning an LMM to solely leverage screenshots as inputs to interact Imagine you are a robot browsing the web, just like humans

extends WebArena with additional websites, tasks that focus on visual reasoning to facilitate research on vision-based web agents · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

cs.CL · 2024-01-25 · unverdicted · novelty 6.0

WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgments 85 percent of the time.

citing papers explorer

Showing 1 of 1 citing paper.

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models cs.CL · 2024-01-25 · unverdicted · none · ref 5
WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgments 85 percent of the time.

SeeClick (Cheng et al., 2024) focused on finetuning an LMM to solely leverage screenshots as inputs to interact Imagine you are a robot browsing the web, just like humans

fields

years

verdicts

representative citing papers

citing papers explorer