Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Y u Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Y u Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, Daniel Fried · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

cs.AI · 2026-05-16 · unverdicted · novelty 5.0

MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

citing papers explorer

Showing 1 of 1 citing paper.

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents cs.AI · 2026-05-16 · unverdicted · none · ref 24
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

fields

years

verdicts

representative citing papers

citing papers explorer