{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:JD2ZQ3EIWO2MYOVWEQJMBIJN7O","short_pith_number":"pith:JD2ZQ3EI","schema_version":"1.0","canonical_sha256":"48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c","source":{"kind":"arxiv","id":"2401.13649","version":2},"attestation_state":"computed","paper":{"title":"VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.LG","authors_text":"Daniel Fried, Graham Neubig, Jing Yu Koh, Lawrence Jang, Ming Chong Lim, Po-Yu Huang, Robert Lo, Ruslan Salakhutdinov, Shuyan Zhou, Vikram Duvvur","submitted_at":"2024-01-24T18:35:21Z","abstract_excerpt":"Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2401.13649","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2024-01-24T18:35:21Z","cross_cats_sorted":["cs.CL","cs.CV"],"title_canon_sha256":"e72169dc7b8a326afcf8786f234787d837b6ecd811d6f82b47c1099b40105905","abstract_canon_sha256":"09a974b7f3b516863a9fc0ccfb802d41251178a15031c78910b671d935ac6d7f"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.707465Z","signature_b64":"dsIQk03hyfHBS6Aocmq1iBX1SCnCkgUFYGPva3va5FWg1PTc3276Be3URFE7HSW3LQqp0zMuFxF45VOZmM4YBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"48f5986c88b3b4cc3ab62412c0a12dfb879cad22d6d6ea688bd1aba900c7a54c","last_reissued_at":"2026-05-17T23:38:13.706760Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.706760Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.LG","authors_text":"Daniel Fried, Graham Neubig, Jing Yu Koh, Lawrence Jang, Ming Chong Lim, Po-Yu Huang, Robert Lo, Ruslan Salakhutdinov, Shuyan Zhou, Vikram Duvvur","submitted_at":"2024-01-24T18:35:21Z","abstract_excerpt":"Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen websites and task templates are sufficiently representative of the visual and interaction challenges encountered in real-world web use.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b89ccf53e765443181b358931ae80485701319525bc8efbde018b682bd39cbc1"},"source":{"id":"2401.13649","kind":"arxiv","version":2},"verdict":{"id":"2c45d65f-5efe-472d-bece-749dde7a857e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T15:15:32.796807Z","strongest_claim":"Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents.","one_line_summary":"VisualWebArena benchmark demonstrates that state-of-the-art multimodal agents still exhibit significant limitations on visually grounded web tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen websites and task templates are sufficiently representative of the visual and interaction challenges encountered in real-world web use.","pith_extraction_headline":"VisualWebArena shows that multimodal agents still struggle with visually grounded web tasks."},"references":{"count":26,"sample":[{"doi":"","year":null,"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","ref_index":1,"cited_arxiv_id":"2210.11416","is_internal_anchor":true},{"doi":"","year":1996,"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","ref_index":2,"cited_arxiv_id":"2312.11805","is_internal_anchor":true},{"doi":"","year":null,"title":"Language models can solve computer tasks. NeurIPS. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi","work_id":"de63ec70-7c06-4692-b530-717ead70ef4c","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"Improved Baselines with Visual Instruction Tuning","work_id":"5baeaa33-5986-44a3-85a4-fcabd6fc1e8d","ref_index":4,"cited_arxiv_id":"2310.03744","is_internal_anchor":true},{"doi":"","year":2023,"title":"GAIA: a benchmark for General AI Assistants","work_id":"cf222b33-f7a3-4044-a570-ecfe25edb3f8","ref_index":5,"cited_arxiv_id":"2311.12983","is_internal_anchor":true}],"resolved_work":26,"snapshot_sha256":"f126cd73e6c90be0867058db5995fd496c4fcc8b669627362d7451f1717707df","internal_anchors":5},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d12cf10fc30b389f09f9e2bec561364ef1826a3adee0e8bc033a3971919238af"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2401.13649","created_at":"2026-05-17T23:38:13.706888+00:00"},{"alias_kind":"arxiv_version","alias_value":"2401.13649v2","created_at":"2026-05-17T23:38:13.706888+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2401.13649","created_at":"2026-05-17T23:38:13.706888+00:00"},{"alias_kind":"pith_short_12","alias_value":"JD2ZQ3EIWO2M","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"JD2ZQ3EIWO2MYOVW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"JD2ZQ3EI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2505.16120","citing_title":"LLM-Powered AI Agent Systems and Their Applications in Industry","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23678","citing_title":"Grounded Reinforcement Learning for Visual Reasoning","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13727","citing_title":"From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16116","citing_title":"ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2406.12373","citing_title":"WebCanvas: Benchmarking Web Agents in Online Environments","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18048","citing_title":"DocOS: Towards Proactive Document-Guided Actions in GUI Agents","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19538","citing_title":"CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2505.19662","citing_title":"FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2506.02387","citing_title":"VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10073","citing_title":"SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2503.09572","citing_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.12538","citing_title":"Agentic Reasoning for Large Language Models","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2408.10188","citing_title":"LongVILA: Scaling Long-Context Visual Language Models for Long Videos","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09571","citing_title":"Tuning Qwen2.5-VL to Improve Its Web Interaction Skills","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05044","citing_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04601","citing_title":"Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.21362","citing_title":"AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11212","citing_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14573","citing_title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11212","citing_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07972","citing_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10966","citing_title":"MMTB: Evaluating Terminal Agents on Multimedia-File Tasks","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06365","citing_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04777","citing_title":"Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18543","citing_title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O","json":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O.json","graph_json":"https://pith.science/api/pith-number/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/graph.json","events_json":"https://pith.science/api/pith-number/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/events.json","paper":"https://pith.science/paper/JD2ZQ3EI"},"agent_actions":{"view_html":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O","download_json":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O.json","view_paper":"https://pith.science/paper/JD2ZQ3EI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2401.13649&json=true","fetch_graph":"https://pith.science/api/pith-number/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/graph.json","fetch_events":"https://pith.science/api/pith-number/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/action/timestamp_anchor","attest_storage":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/action/storage_attestation","attest_author":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/action/author_attestation","sign_citation":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/action/citation_signature","submit_replication":"https://pith.science/pith/JD2ZQ3EIWO2MYOVWEQJMBIJN7O/action/replication_record"}},"created_at":"2026-05-17T23:38:13.706888+00:00","updated_at":"2026-05-17T23:38:13.706888+00:00"}