{"total":32,"items":[{"citing_arxiv_id":"2605.28607","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution","primary_cat":"cs.AI","submitted_at":"2026-05-27T15:23:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multimodal multi-agent system constructs a fixed topological knowledge base offline from logs and applies adaptive RAG with collaborative verification for automatic workflow execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18652","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:57:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16565","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Skim: Speculative Execution for Fast and Efficient Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-15T19:12:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14311","ref_index":130,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-14T03:23:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14290","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Web Agents Should Adopt the Plan-Then-Execute Paradigm","primary_cat":"cs.CR","submitted_at":"2026-05-14T02:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13527","ref_index":37,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:40:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12755","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"State-Centric Decision Process","primary_cat":"cs.AI","submitted_at":"2026-05-12T21:09:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 3: Results on AssistantBench [56]. We report overall Accuracy, Precision, Exact Match (EM), and Accuracy stratified by difficulty (Easy/Medium/Hard). Accuracy by Difficulty Method LLM Accuracy Precision EM Easy Medium Hard Infogent [32] GPT-4o 14.5 20.4 5.5 63.9 19.3 8.4 RALM-1S→CB [42] GPT-4T 19.5 21.0 6.1 81.3 35.0 7.3 CB-1S [30] GPT-4T 22.2 24.8 8.3 67.8 49.7 4.2 SeeAct→CB [61] GPT-4T 23.4 26.1 9.4 82.0 47.7 7.1 SPA-CB [56] GPT-4T 25.2 27.5 9.9 80.7 42.7 12.4 Magentic-One [12] GPT-4o 25.3 25.3 11.0 69.9 35.616.9 ACP-Domain Agents [5] GPT-4o 28.3 30.0 11.0 67.8 48.5 15.5 SDP (Ours)GPT-4o31.8 32.0 14.9 92.8 54.916.0 ScienceWorld.Table 4 reports results on ScienceWorld [46]. Among training-free methods on the standard 30-task GPT-4 protocol, SDP achieves the highest overall score (59."},{"citing_arxiv_id":"2605.12501","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025. [26] Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026. [27] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. 11 [28] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023."},{"citing_arxiv_id":"2605.11212","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","primary_cat":"cs.CL","submitted_at":"2026-05-11T20:27:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06761","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Weblica: Scalable and Reproducible Training Environments for Visual Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Since early VLMs had limited grounding capabilities, initial approaches augmented screenshots with set-of-marks [43] overlays. These overlay numbered bounding boxes on interactive elements to simplify action prediction. However, this introduces dependencies on accu- rate element detection and adds visual clutter that does not reflect natural web perception [48]. More recent work removes these aids entirely, building agents that operate on raw screenshots and predict actions as pixel coordinates [28, 34, 1, 5, 9, 24, 8]. We follow this direction and train visual web agents with screenshot input and coordinate-based actions. Data and Environments for Web Agents.Severaleffortscollectsupervisedfine-tuning(SFT)trajectoriesthrough"},{"citing_arxiv_id":"2605.05509","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WAAA! Web Adversaries Against Agentic Browsers","primary_cat":"cs.CR","submitted_at":"2026-05-06T23:19:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26148","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations","primary_cat":"cs.HC","submitted_at":"2026-04-28T22:15:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23772","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PageGuide: Browser extension to assist users in navigating a webpage and locating information","primary_cat":"cs.HC","submitted_at":"2026-04-26T15:49:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21375","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"perts, revealing failure patterns that motivate the design of VLAA-GUI. 2.2 GUI Agents: Models and Frameworks. End-to-end models trained for GUI interaction-UI-TARS [49], AGUVIS [68], ShowUI [40], CogAgent [33], OS-Atlas [66], among others [65,74]-achieve strong grounding without HTML or accessibility trees. Screen-based agents [9,45,56, 62] further explore pixel-space control, and web agents [30,32,81] investigate long-horizon decision making in browser environments. Frontier providers have followed with commercial APIs: Claude Computer Use [4], OpenAI CUA [47], and Seed [12], the latter serving as our dedicated grounding model. Complementary modular frameworks compose MLLMs with planning, mem- ory,andtools.TheAgentSfamily[1,2]coupleshierarchicalplanningwithexperi-"},{"citing_arxiv_id":"2604.09781","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents","primary_cat":"cs.CV","submitted_at":"2026-04-10T18:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09442","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UIPress: Bringing Optical Token Compression to UI-to-Code Generation","primary_cat":"cs.CL","submitted_at":"2026-04-10T15:58:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08523","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawBench: Can AI Agents Complete Everyday Online Tasks?","primary_cat":"cs.CL","submitted_at":"2026-04-09T17:57:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04399","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis","primary_cat":"cs.AI","submitted_at":"2026-04-06T03:58:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18758","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments","primary_cat":"cs.HC","submitted_at":"2026-04-03T08:57:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OmniGUI is the first step-level benchmark supplying interleaved image, audio, and video inputs across 709 expert episodes in 29 smartphone apps to evaluate multimodal GUI agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05295","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces","primary_cat":"cs.AI","submitted_at":"2026-03-05T15:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.22942","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClawMobile: Rethinking Smartphone-Native Agentic Systems","primary_cat":"cs.MA","submitted_at":"2026-02-26T12:34:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.15832","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains","primary_cat":"cs.CL","submitted_at":"2025-08-18T21:58:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper proposes Amazon-Bench, a functionality-grounded benchmark for web agents in e-commerce that generates diverse task queries from webpage elements and evaluates both task performance and safety risks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04227","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mobile GUI Agents under Real-world Threats: Are We There Yet?","primary_cat":"cs.CR","submitted_at":"2025-07-06T03:31:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12382","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Exploring the Secondary Risks of Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-06-14T07:31:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02387","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments","primary_cat":"cs.AI","submitted_at":"2025-06-03T02:57:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Proagent: building proactive cooperative agents with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17591-17599, 2024. [87] Chi Zhang, Penglin Cai, Yuhui Fu, Haoqi Yuan, and Zongqing Lu. Creative agents: Empowering agents with imagination for creative tasks. arXiv preprint arXiv:2312.02519, 2023. [88] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024. [89] Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models."},{"citing_arxiv_id":"2505.10978","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Group-in-Group Policy Optimization for LLM Agent Training","primary_cat":"cs.LG","submitted_at":"2025-05-16T08:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLMs for embodied decision making.Advances in Neural Information Processing Systems, 37:100428-100534, 2024. [7] Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixi- ang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. InThe Twelfth International Conference on Learning Representations, 2024. [8] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V (ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024. [9] Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025."},{"citing_arxiv_id":"2411.18279","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 3 This new paradigm enables users to control general software systems with conversational commands [16]. By reducing the cognitive load of multi-step GUI operations, LLM-powered agents make complex systems accessible to non-technical users and streamline workflows across diverse domains. Notable examples include SeeAct [17] for web navi- gation, AppAgent [18] for mobile interactions, and UFO [19] for Windows OS applications. These agents resemble a \"virtual assistant\" [20] akin to J.A.R.V.I.S. from Iron Man-an intuitive, adaptive system capable of understanding user goals and autonomously performing actions across applications. The futuristic concept of an AI-powered operating system that"},{"citing_arxiv_id":"2406.12373","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebCanvas: Benchmarking Web Agents in Online Environments","primary_cat":"cs.CL","submitted_at":"2024-06-18T07:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"reproducible assessment within the OSW ORLD environment, 9 authors with computer science backgrounds carefully annotate each example with an initial state setup configuration to simulate human work in progress and a custom execution-based evaluation script to verify task completion. Our benchmark has a total of 134 unique evaluation functions, which are orders of magnitude larger than prior work [66], showcasing the complexity, diversity, and evaluation challenges of tasks in our benchmark. The human performance study indicates that task examples from OSW ORLD are more time-consuming and challenging compared to those in prior work. We extensively evaluate state-of-the-art LLM and VLM-based agent baselines, including the GPT-4V series [39], the Gemini series [ 49, 41], the Claude-3 Opus [ 3] and the Qwen-Max [ 5], as well as"},{"citing_arxiv_id":"2401.16158","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception","primary_cat":"cs.CL","submitted_at":"2024-01-29T13:46:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on the introduced Mobile-Eval benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10935","ref_index":109,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05459","ref_index":108,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","primary_cat":"cs.HC","submitted_at":"2024-01-10T09:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a growing number of projects employed visual language agents for UI action grounding and navigation [102, 103]. One trend involves leveraging powerful LMMs such as GPT-4V to comprehend GUIs and select UI elements [104, 105, 106, 107]. Another line of research is to customize open-sourced LMMs by fine-tuning on large-scale datasets for GUI-related tasks [108, 109, 110]. 16 While UI-based task automation has the potential to achieve a more flexible personal agent framework compared to API-based automation, its research is still in the early stages. It remains challenging to accomplish more complex user commands. Besides, the privacy and security issues have not been fully addressed [94, 99]. It also remains controversial"}],"limit":50,"offset":0}