{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:B6N5RPC67O33FJY2I5FPMYLZZO","short_pith_number":"pith:B6N5RPC6","schema_version":"1.0","canonical_sha256":"0f9bd8bc5efbb7b2a71a474af66179cb8b53111d2184deeaedb5e532799e08ad","source":{"kind":"arxiv","id":"2403.07718","version":5},"attestation_state":"computed","paper":{"title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Alexandre Drouin, Alexandre Lacoste, David Vazquez, Issam H. Laradji, L\\'eo Boisvert, Manuel Del Verme, Massimo Caccia, Maxime Gasse, Megh Thakkar, Nicolas Chapados, Quentin Cappart, Tom Marty","submitted_at":"2024-03-12T14:58:45Z","abstract_excerpt":"We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while curren"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2403.07718","kind":"arxiv","version":5},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2024-03-12T14:58:45Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"6ac53eeabc9ba4a7957514da4595c3bd216575a61e7de3fd99f2fd3b9d5a0af2","abstract_canon_sha256":"241cc0cc95b853603ea2fb29976c470fc5f752468f33e0ea0bfdf7a31e2cb398"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.769963Z","signature_b64":"+ixWyl9YBe7Eg3koRFjX8ZiY7UGxG+rfZV8eogzqAjAEt5PORpnnA+to4h2IUGCnqggNIn9Mr4gnPh3y09zQCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0f9bd8bc5efbb7b2a71a474af66179cb8b53111d2184deeaedb5e532799e08ad","last_reissued_at":"2026-05-17T23:38:53.769379Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.769379Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Alexandre Drouin, Alexandre Lacoste, David Vazquez, Issam H. Laradji, L\\'eo Boisvert, Manuel Del Verme, Massimo Caccia, Maxime Gasse, Megh Thakkar, Nicolas Chapados, Quentin Cappart, Tom Marty","submitted_at":"2024-03-12T14:58:45Z","abstract_excerpt":"We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while curren"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 33 tasks chosen for WorkArena are representative of the typical daily work of knowledge workers utilizing enterprise software systems.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"99eae5455327ca52df7e426517b3caea16eaa209b0b3b877ef246bfc8ca5c8bc"},"source":{"id":"2403.07718","kind":"arxiv","version":5},"verdict":{"id":"59c8eea2-8330-41f8-b393-4a1b9036dca7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:44:17.853901Z","strongest_claim":"while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs","one_line_summary":"WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 33 tasks chosen for WorkArena are representative of the typical daily work of knowledge workers utilizing enterprise software systems.","pith_extraction_headline":"Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation"},"references":{"count":36,"sample":[{"doi":"","year":2023,"title":"The unsolved challenges of LLM s in open-ended web tasks: A case study","work_id":"5ff9f205-dfd5-47b6-bf24-94b6b13e38d2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2016,"title":"Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym, 2016","work_id":"83339e82-95f2-40c8-8327-942e29e38e35","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Mind2Web: Towards a Generalist Agent for the Web","work_id":"e26f5a00-c007-439d-83f6-7900f5687b6b","ref_index":3,"cited_arxiv_id":"2306.06070","is_internal_anchor":true},{"doi":"","year":2023,"title":"Multimodal web navigation with instruction-finetuned foundation models","work_id":"0f8b8630-9215-4cb8-9b7d-e58e6b1f7bbb","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Chrome devtools protocol, 2023","work_id":"2054d990-9253-491b-b01d-1f23d15eec34","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":36,"snapshot_sha256":"236e2f21fc6936dfb7b0b7ac3095c0be380ac0a3595b37d93338e84eda289677","internal_anchors":12},"formal_canon":{"evidence_count":1,"snapshot_sha256":"3f85100ae17c2c1ab72b396de0adc1a89c6521682565e5b41862715d03fd9db0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2403.07718","created_at":"2026-05-17T23:38:53.769470+00:00"},{"alias_kind":"arxiv_version","alias_value":"2403.07718v5","created_at":"2026-05-17T23:38:53.769470+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2403.07718","created_at":"2026-05-17T23:38:53.769470+00:00"},{"alias_kind":"pith_short_12","alias_value":"B6N5RPC67O33","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"B6N5RPC67O33FJY2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"B6N5RPC6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2605.17637","citing_title":"WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2603.14987","citing_title":"Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14678","citing_title":"$\\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14102","citing_title":"ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15777","citing_title":"SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16679","citing_title":"CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17637","citing_title":"WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2406.12373","citing_title":"WebCanvas: Benchmarking Web Agents in Online Environments","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17829","citing_title":"Interactive Evaluation Requires a Design Science","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19769","citing_title":"OpenComputer: Verifiable Software Worlds for Computer-Use Agents","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14678","citing_title":"$\\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.14528","citing_title":"Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2412.04454","citing_title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01847","citing_title":"NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14102","citing_title":"ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2409.07429","citing_title":"Agent Workflow Memory","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14573","citing_title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11882","citing_title":"On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07972","citing_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28139","citing_title":"Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10912","citing_title":"WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08761","citing_title":"Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10448","citing_title":"Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23781","citing_title":"ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06365","citing_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","ref_index":41,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO","json":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO.json","graph_json":"https://pith.science/api/pith-number/B6N5RPC67O33FJY2I5FPMYLZZO/graph.json","events_json":"https://pith.science/api/pith-number/B6N5RPC67O33FJY2I5FPMYLZZO/events.json","paper":"https://pith.science/paper/B6N5RPC6"},"agent_actions":{"view_html":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO","download_json":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO.json","view_paper":"https://pith.science/paper/B6N5RPC6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2403.07718&json=true","fetch_graph":"https://pith.science/api/pith-number/B6N5RPC67O33FJY2I5FPMYLZZO/graph.json","fetch_events":"https://pith.science/api/pith-number/B6N5RPC67O33FJY2I5FPMYLZZO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO/action/storage_attestation","attest_author":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO/action/author_attestation","sign_citation":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO/action/citation_signature","submit_replication":"https://pith.science/pith/B6N5RPC67O33FJY2I5FPMYLZZO/action/replication_record"}},"created_at":"2026-05-17T23:38:53.769470+00:00","updated_at":"2026-05-17T23:38:53.769470+00:00"}