{"total":17,"items":[{"citing_arxiv_id":"2606.29537","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks","primary_cat":"cs.AI","submitted_at":"2026-06-28T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20724","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration","primary_cat":"cs.AI","submitted_at":"2026-06-16T23:00:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Parallel WebBench reveals GRPO training raises web agent completion to 96% but leaves a large correctness gap from context-bound loops, premature termination, and synthesis collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":127,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"motion quality, caption recaptioning. Audio & Speech Generation LibriTTS [ 116], VCTK [ 117], Gi- gaSpeech [118], Emilia [119], Audio- Caps [ 120], WavCaps [ 121], Music- Caps [122] T, A (Sp) Text-to-speech, voice cloning, music and environmental sound generation. Interact Web Interaction WebShop [123], Mind2Web [124], We- bArena [125], VisualWebArena [126], WebLINX [127], WebV oyager [128] T, I (GUI) Goal-driven web navigation: searching, clicking, form filling on real/simulated websites. Mobile & Desktop GUI AITW [ 129], RICO [ 130], ScreenAI [ 131], SeeClick [ 132], OSWorld [ 133], Windows Agent Arena [134] T, I (GUI) Screenshot/UI-tree to action (tap, type, drag); covers mobile and OS environments. Embodied Interaction ALFWorld [ 135], BridgeData"},{"citing_arxiv_id":"2605.20291","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection","primary_cat":"cs.LG","submitted_at":"2026-05-19T09:19:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weasel is a trajectory selection method that improves out-of-domain generalization for web agents while achieving 9.7-12.5x training speedups via importance-diversity optimization, AXTree pruning, and rationale style matching.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16565","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skim: Speculative Execution for Fast and Efficient Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-15T19:12:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13292","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages","primary_cat":"cs.CL","submitted_at":"2026-05-13T10:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A parallel multi-turn medical dialogue dataset spanning English and nine Indic languages is created from synthetic consultations to enable personalized AI healthcare interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21375","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tablished a complementary Windows-only suite showing a similar gap. More re- cent benchmarks target specific domains or platforms, such as Spider2-V [13] for 4 Q. Han, H. Tu et al. enterprise data-science workflows, ScreenSpot [18] for visual grounding, and ma- cOSWorld [72] for macOS-specific tasks. Parallel efforts extend evaluation to mo- bile [15,21,50,51] and web settings [19,22-24,34,43,54,75,78,83], building upon classic web-interaction benchmarks [41,46,52]. Beyond task-completion bench- marks, recent work evaluates multimodal model robustness and reliability more broadly, including safety and attribute evaluations under out-of-distribution vi- sual inputs [17,37,58], vision-language reward and reinforce learning [16,59]. Initial results across these benchmarks consistently fall far behind human ex-"},{"citing_arxiv_id":"2604.19905","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViBR: Automated Bug Replay from Video-based Reports using Vision-Language Models","primary_cat":"cs.SE","submitted_at":"2026-04-21T18:28:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ViBR reproduces 72% of bugs from video reports by segmenting actions with CLIP similarity and using VLMs for region-aware GUI state comparison, outperforming prior heuristics-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18543","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:36:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08516","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"web environments [7, 59-62], desktop environments [63], and multi-turn dialogue navigation datasets [64] where the answer is known or verifiable using oracle knowledge of environment state. Recently, several 13 benchmarks have proposed evaluating on live websites. While some use automatic verifiers [65, 66] or simple text answers that are unlikely to change over time [67], other use a VLM-as-a-judge to verify task completion correctness [20, 23, 24, 68]. A VLM-judge (typically a frontier model such as GPT-4o [69]) takes the instruction, screenshots, and the final answer produced by the agent, along with a prompt specifying the success criteria, and outputs a success or failure decision, along with a rationale for that decision."},{"citing_arxiv_id":"2603.05295","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces","primary_cat":"cs.AI","submitted_at":"2026-03-05T15:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.08136","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments","primary_cat":"cs.CL","submitted_at":"2025-06-09T18:39:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EconWebArena is a new benchmark with 360 curated economic tasks across 82 authoritative websites for evaluating multimodal web agents on navigation, grounding, and data extraction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16120","ref_index":109,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Powered AI Agent Systems and Their Applications in Industry","primary_cat":"cs.AI","submitted_at":"2025-05-22T01:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a benchmark for general ai assistants,\" inThe Twelfth International Conference on Learning Representations, 2023. [108] J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, \"Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,\"arXiv preprint arXiv:2401.13649, 2024. [109] X. H. L `u, Z. Kasner, and S. Reddy, \"Weblinx: Real-world website navigation with multi-turn dialogue,\"arXiv preprint arXiv:2402.05930, 2024. [110] J. Xie, K. Zhang, J. Chen, T. Zhu, R. Lou, Y . Tian, Y . Xiao, and Y . Su, \"Travelplanner: A benchmark for real-world planning with language agents,\"arXiv preprint arXiv:2402.01622, 2024. [111] A. Yan, Z."},{"citing_arxiv_id":"2412.04454","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","primary_cat":"cs.CL","submitted_at":"2024-12-05T18:58:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.12373","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebCanvas: Benchmarking Web Agents in Online Environments","primary_cat":"cs.CL","submitted_at":"2024-06-18T07:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"tasks from an intermediate initial state (Intermediate Init. State), and the number of execution-based evaluation functions (# Exec.-based Eval. Func.). # Instances(# Templates) Control.Exec. Env.?EnvironmentScalability?MultimodalSupport? Cross-App? IntermediateInit. State? # Exec.-basedEval. Func. GAIA [36] 466 ✗ - ✗ ✗ ✗ 0 MIND2WEB[9] 2350 ✗ - ✓ ✗ ✓ 0 WEBLINX [33] 2337 ✗ - ✓ ✗ ✓ 0 PIXELHELP[27] 187 ✗ - ✓ ✗ ✗ 0 METAGUI [47] 1125 ✗ - ✓ ✗ ✗ 0 AITW [40] 30 k ✗ - ✓ ✗ ✓ 0 OMNIACT[21] 9802 ✗ - ✓ ✗ ✓ 0 AGENTBENCH[32] 1091 Multi-isolated ✗ ✗ ✗ ✗ 7 INTERCODE[57] 1350 (3) Code ✗ ✗ ✗ ✗ 3 MINIWOB++ [30] 125 Web ✗ ✓ ✗ ✗ 125 WEBSHOP[58] 12 k(1) Web ✗ ✓ ✗ ✗ 1 WEBARENA[66] 812 (241) Web ✗ ✓ ✗ ✗ 5 VWEBARENA[22] 910 (314) Web ✗ ✓ ✗ ✗ 6"},{"citing_arxiv_id":"2403.07718","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","primary_cat":"cs.LG","submitted_at":"2024-03-12T14:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}