{"total":30,"items":[{"citing_arxiv_id":"2606.06708","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Signal-Driven Observation for Long-Horizon Web Agents","primary_cat":"cs.CL","submitted_at":"2026-06-04T20:48:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25343","ref_index":126,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Toward Native Multimodal Modeling: A Roadmap","primary_cat":"cs.CV","submitted_at":"2026-05-25T01:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19538","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18048","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DocOS: Towards Proactive Document-Guided Actions in GUI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-18T08:36:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16116","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-15T16:00:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11212","ref_index":6,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","primary_cat":"cs.CL","submitted_at":"2026-05-11T20:27:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10966","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMTB: Evaluating Terminal Agents on Multimedia-File Tasks","primary_cat":"cs.MM","submitted_at":"2026-05-08T10:57:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"increase the required time and discard important information during conversion. Quantitatively, we observe that the standalone terminal agent baseline, Codex CLI with GPT-5.2, solves only 16.2% of MMTB tasks, revealing the limitations of conventional terminal agents on multimedia-file tasks. To address these limitations, we introduce TERMINUS-MM, a multimedia terminal-agent harness that extends Terminus-KIRA [19] with audio and video perception. In addition, TERMINUS-MM adapts its perception interface to each workspace by exposing tools matched to the available multimedia files. Using this workspace-aware design, we compare audio-only, video-only, and combined audio-video access to analyze how different forms of multimedia perception affect task outcomes and which"},{"citing_arxiv_id":"2605.06365","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work","primary_cat":"cs.AI","submitted_at":"2026-05-07T14:39:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"prove how a single pipeline is programmed, but they generally do not make dependency materialization, replay iden- tity, or partial downstream invalidation the organizing abstraction of the whole system. 2.7 Benchmarks and Reproducible Agent Environments Another relevant literature studies how to evaluate agents in realistic yet reproducible environments. Benchmarks such as AgentBench [38], WebArena [39], VisualWebArena [40], WorkArena [41], AndroidWorld [42], OSWorld [43], AppWorld [44], GAIA [45], and LifelongAgentBench [46] increasingly evaluate agents in stateful, tool-using settings rather than static prompts. Recent 2025-2026 benchmark work sharpens the memory and persistence angle in particular. MemoryAgentBench [47], Evo-Memory [48], and Mem2ActBench [49] all argue that static one-shot evaluation misses the harder problem"},{"citing_arxiv_id":"2605.06707","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking","primary_cat":"cs.SE","submitted_at":"2026-05-06T18:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Claude outperformed other LLM families in generating functional single-file HTML under fixed public conditions, but neither technical variables nor prompt details reliably predicted 24-hour social media impressions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04777","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents","primary_cat":"cs.MA","submitted_at":"2026-05-06T11:30:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21375","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tablished a complementary Windows-only suite showing a similar gap. More re- cent benchmarks target specific domains or platforms, such as Spider2-V [13] for 4 Q. Han, H. Tu et al. enterprise data-science workflows, ScreenSpot [18] for visual grounding, and ma- cOSWorld [72] for macOS-specific tasks. Parallel efforts extend evaluation to mo- bile [15,21,50,51] and web settings [19,22-24,34,43,54,75,78,83], building upon classic web-interaction benchmarks [41,46,52]. Beyond task-completion bench- marks, recent work evaluates multimodal model robustness and reliability more broadly, including safety and attribute evaluations under out-of-distribution vi- sual inputs [17,37,58], vision-language reward and reinforce learning [16,59]. Initial results across these benchmarks consistently fall far behind human ex-"},{"citing_arxiv_id":"2604.18543","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:36:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14262","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models","primary_cat":"cs.LG","submitted_at":"2026-04-15T16:39:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"GUI-Perturbed† is web-only; cross-platform extension is future work. Data Paradigm Dataset Annotation Source Platform Size (Base Tasks) Scene Variability Variations per Task Fixed-scene OSWorld Human Desktop, Web, Mobile 369 Fixed -ScreenSpot-v2 Human Desktop, Web, Mobile 1,272 Fixed -ScreenSpot-Pro Human Desktop 1,585 Fixed -Mind2Web-2 [12] Human Web 130 Fixed -VisualWebArena [13] Human Web 910 Fixed -MiniWoB++ [14] Programmatic Web (simulated) 100+ Fixed -OSWorld-G [15] Human + LLM Desktop 564 Fixed - Live-scene Online-Mind2Web Human Web 300 Live - Perturbation-basedGUI-Robust Human + MLLM Desktop, Web 5,318 Perturbed Single-axis (Anomalies)WorldGUI Human Desktop 315 Perturbed Single-axis (Initial state)GUI-Perturbed† Programmatic Web 3,120 (390×8) Systematic Multi-axis (visual & inst."},{"citing_arxiv_id":"2604.08516","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Evaluating web agents is challenging. Early evaluation work focused on sandboxed web environments [7, 59-62], desktop environments [63], and multi-turn dialogue navigation datasets [64] where the answer is known or verifiable using oracle knowledge of environment state. Recently, several 13 benchmarks have proposed evaluating on live websites. While some use automatic verifiers [65, 66] or simple text answers that are unlikely to change over time [67], other use a VLM-as-a-judge to verify task completion correctness [20, 23, 24, 68]. A VLM-judge (typically a frontier model such as GPT-4o [69]) takes the instruction, screenshots, and the final answer produced by the agent, along with a prompt specifying the success criteria, and outputs a success or failure decision, along with a rationale for that decision."},{"citing_arxiv_id":"2603.21362","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning","primary_cat":"cs.AI","submitted_at":"2026-03-22T18:47:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downstream task success by 6.8-8.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05044","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","primary_cat":"cs.AI","submitted_at":"2026-03-05T10:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.04601","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development","primary_cat":"cs.SE","submitted_at":"2026-03-04T21:00:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09571","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tuning Qwen2.5-VL to Improve Its Web Interaction Skills","primary_cat":"cs.HC","submitted_at":"2026-02-20T13:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Two-stage fine-tuning of Qwen2.5-VL-32B improves success rates on single-click web tasks from 86% to 94%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12538","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Reasoning for Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-01-18T18:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"constraints, and feedback loops, motivating diverse system designs [43, 44] that integrate planning, tool use, search, reflection, memory mechanisms, and multi-agent coordination. On the other hand, the benchmark landscape has emerged to evaluate agentic reasoning, ranging from targeted tests that isolate individual agentic capabilities to application-specific benchmarks that assess end-to-end behavior in domain-specific environments and scenarios [45, 46, 47, 48, 20, 21, 49, 50]. Together, this survey synthesizes agentic reasoning methods into a unified roadmap that bridges reasoning and acting. We systematically characterize these methods across the complementary scopes of foundational, self-evolving, and collective reasoning, while distinguishing between in-context and post-training optimiza- tion modes. We further contextualize this roadmap through representative applications and evaluation"},{"citing_arxiv_id":"2510.13727","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails","primary_cat":"cs.AI","submitted_at":"2025-10-15T16:30:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10073","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents","primary_cat":"cs.CR","submitted_at":"2025-10-11T07:18:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02387","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments","primary_cat":"cs.AI","submitted_at":"2025-06-03T02:57:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vision Language Models (VLMs) have recently unlocked impressive capabilities in open-world per- ception, multimodal reasoning, and interactive problem-solving [5, 39, 89]. Driven by these advance- ments, evaluations of VLMs have progressed beyond static tasks such as image captioning [15] and visual reasoning [3, 85] toward dynamic agent benchmarks including software engineering [13, 82], computer use [30, 80], game environments [75, 87], and embodied control [25, 68, 83]. However, existing VLM benchmarks mainly focus on single-agent settings, where one agent reasons and acts in isolation. The real world, by contrast, is inherently a multi-agent environment that involves cooperation, competition, and mixed-motive interactions between agents [ 20, 77]."},{"citing_arxiv_id":"2505.23678","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounded Reinforcement Learning for Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:20:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19662","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks","primary_cat":"cs.AI","submitted_at":"2025-05-26T08:21:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16120","ref_index":108,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM-Powered AI Agent Systems and Their Applications in Industry","primary_cat":"cs.AI","submitted_at":"2025-05-22T01:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"agents that matter,\"arXiv preprint arXiv:2407.01502, 2024. [106] Y . Wu, X. Tang, T. M. Mitchell, and Y . Li, \"Smartplay: A benchmark for llms as intelligent agents,\"arXiv preprint arXiv:2310.01557, 2023. [107] G. Mialon, C. Fourrier, T. Wolf, Y . LeCun, and T. Scialom, \"Gaia: a benchmark for general ai assistants,\" inThe Twelfth International Conference on Learning Representations, 2023. [108] J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, \"Visualwebarena: Evaluating multimodal agents on realistic visual web tasks,\"arXiv preprint arXiv:2401.13649, 2024. [109] X. H. L `u, Z. Kasner, and S. Reddy, \"Weblinx: Real-world website navigation with multi-turn dialogue,\"arXiv preprint arXiv:2402."},{"citing_arxiv_id":"2503.09572","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","primary_cat":"cs.CL","submitted_at":"2025-03-12T17:40:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.10188","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LongVILA: Scaling Long-Context Visual Language Models for Long Videos","primary_cat":"cs.CV","submitted_at":"2024-08-19T17:48:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.12373","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebCanvas: Benchmarking Web Agents in Online Environments","primary_cat":"cs.CL","submitted_at":"2024-06-18T07:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14573","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2024-05-23T13:48:54+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"MIND2WEB[9] 2350 ✗ - ✓ ✗ ✓ 0 WEBLINX [33] 2337 ✗ - ✓ ✗ ✓ 0 PIXELHELP[27] 187 ✗ - ✓ ✗ ✗ 0 METAGUI [47] 1125 ✗ - ✓ ✗ ✗ 0 AITW [40] 30 k ✗ - ✓ ✗ ✓ 0 OMNIACT[21] 9802 ✗ - ✓ ✗ ✓ 0 AGENTBENCH[32] 1091 Multi-isolated ✗ ✗ ✗ ✗ 7 INTERCODE[57] 1350 (3) Code ✗ ✗ ✗ ✗ 3 MINIWOB++ [30] 125 Web ✗ ✓ ✗ ✗ 125 WEBSHOP[58] 12 k(1) Web ✗ ✓ ✗ ✗ 1 WEBARENA[66] 812 (241) Web ✗ ✓ ✗ ✗ 5 VWEBARENA[22] 910 (314) Web ✗ ✓ ✗ ✗ 6 WORKARENA[10] 23 k(29) Web ✗ ✓ ✗ ✓ 7 WIKIHOW[61] 150 (16) Mobile ✗ ✓ ✗ ✗ 16 ASSISTGUI [13] 100 ✗ ✗ ✓ ✗ ✓ 2 OSWORLD 369 Computer ✓ ✓ ✓ ✓ 134 0 100 200 300 400 500 600 700 800 900Human Operation Time (s) Ours median: 111.94s WebArena median: 35.38s WebArena Ours 30 40 50 60 70 80 90Accuracy (%) Figure 4: Human operation time and accuracy on"}],"limit":50,"offset":0}