{"paper":{"title":"Mind2Web: Towards a Generalist Agent for the Web","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Boshi Wang, Boyuan Zheng, Huan Sun, Samuel Stevens, Shijie Chen, Xiang Deng, Yu Gu, Yu Su","submitted_at":"2023-06-09T17:44:31Z","abstract_excerpt":"We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Mind2Web provides three necessary ingredients for building generalist web agents: diverse domains, websites, and tasks; use of real-world websites instead of simulated ones; and a broad spectrum of user interaction patterns. LLMs with HTML filtering by a small LM achieve decent performance even on unseen websites or domains.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the crowdsourced action sequences collected from workers accurately capture the steps a typical user would take to complete each open-ended task on live websites, and that the 137 sites sufficiently represent the diversity needed for generalization to arbitrary new sites.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"37cc697a3a3c8bbddda49f537f6cacbf003710aca6f25fc168157319c752d256"},"source":{"id":"2306.06070","kind":"arxiv","version":3},"verdict":{"id":"f7056531-aa8f-45b7-8a78-128e1deaa68d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:02:09.262992Z","strongest_claim":"Mind2Web provides three necessary ingredients for building generalist web agents: diverse domains, websites, and tasks; use of real-world websites instead of simulated ones; and a broad spectrum of user interaction patterns. LLMs with HTML filtering by a small LM achieve decent performance even on unseen websites or domains.","one_line_summary":"Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the crowdsourced action sequences collected from workers accurately capture the steps a typical user would take to complete each open-ended task on live websites, and that the 137 sites sufficiently represent the diversity needed for generalization to arbitrary new sites.","pith_extraction_headline":"Mind2Web supplies over 2000 real-world tasks on 137 live websites so language models can act as generalist agents that follow instructions across unseen sites and domains."},"references":{"count":45,"sample":[{"doi":"","year":2021,"title":"Puppeteer headless chrome node.js api. https://github.com/puppeteer/puppeteer, 2021","work_id":"a5824c85-53a2-4ade-b04e-371de7ad9c44","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2204.01691","year":2022,"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","ref_index":2,"cited_arxiv_id":"2204.01691","is_internal_anchor":true},{"doi":"","year":2021,"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","ref_index":3,"cited_arxiv_id":"2108.07258","is_internal_anchor":true},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"6b6d3f79-d100-4af7-8cb8-c2670a73c7f5","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plum- mer. A dataset for interactive vision-language navigation with unknown command feasibility. In European Confe","work_id":"2738624f-90fc-4a0a-a90b-3dee81d465bd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":45,"snapshot_sha256":"a14791859106fb749491bda0ae494f97a48bea18ab70fca3459475b5baf44e28","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0fbfd475296b6e8dc07c9ce6b4886871895ad000fb5c614bdb58a4ee123b13df"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}