{"total":13,"items":[{"citing_arxiv_id":"2607.00007","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BaRA: Budget-constrained and Reliable Web Data Collection Agent","primary_cat":"cs.IR","submitted_at":"2026-05-02T08:09:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"BaRA improves valid link discovery and multimodal artifact extraction in budget-constrained web data collection via BFS liveness checks, rule-based validation, and self-reflection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26020","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Training Computer Use Agents to Assess the Usability of Graphical User Interfaces","primary_cat":"cs.CL","submitted_at":"2026-04-28T18:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We also describe how we use this approach to train uxCUA and our agent formulation. 4.1 Problem Formulation uxCUA is a CUA designed to perform GUI usability assessment in the same way humans do, through visual perception and clicking. While research has shown that having access to hidden semantic information like the HTML or DOM allows web agents [16, 24, 71] 2We removed outliers based on z-score with 𝑧> 3in cases where the participant may have left the study tab open before coming back to complete a preference rating. Training Computer Use Agents to Assess the Usability of Graphical User Interfaces Conduct a usability t est of t his w ebsit e...y ou ha v e a budget of 50 st eps... CUA Stat e P olicy"},{"citing_arxiv_id":"2603.05044","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","primary_cat":"cs.AI","submitted_at":"2026-03-05T10:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.23883","ref_index":231,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","primary_cat":"cs.AI","submitted_at":"2025-10-27T21:48:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":", risks arising from autonomous action, tool use, and long-horizon interaction, rather than from static chat completion. We discuss these next. Web agent safety in enterprise contexts.ST-WebAgentBenchis an online, enterprise-focused benchmark for testing whether web agents avoid unsafe actions (e.g., destructive operations in business systems) while pursuing goals [231]. Unlike legacy suites that only score end-task success, ST-WebAgentBench emphasizestrustworthinessunder realistic web front ends (such as DevOps workflows, e-commerce, and enterprise CRM). The authors also propose novel evaluation metrics such as (1)Completion Under Policy (CuP)(task completions that adhere to acceptable policies) and (2)Risk Ratio(quantifies security breaches)."},{"citing_arxiv_id":"2510.22102","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mitigating Coordinate Prediction Bias from Positional Encoding Failures","primary_cat":"cs.CV","submitted_at":"2025-10-25T00:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VPSG corrects predictable directional coordinate biases in MLLMs by shuffling visual positional encodings to isolate unconditioned tendencies and steering digit decoding with a lightweight finite-state machine, yielding accuracy gains on ScreenSpot-Pro without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09572","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","primary_cat":"cs.CL","submitted_at":"2025-03-12T17:40:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":156,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A key milestone was WebAgent [152] (2023), which, alongside WebGUM [153] (2023), pioneered real-world web navigation using LLMs. These advancements paved the way for further develop- ments [17], [154], [155], utilizing more specialized LLMs to enhance web-based interactions. 4.3.2 Mobile Devices The integration of LLMs into mobile devices began with AutoDroid [156] (2023), which combined LLMs with domain- specific knowledge for smartphone automation. Additional contributions like MM-Navigator [157] (2023), AppAgent [18] (2023), and Mobile-Agent [158] (2023) enabled refined control over smartphone applications. Research has continued to improve accuracy for mobile GUI automation through model fine-tuning [159], [160] (2024)."},{"citing_arxiv_id":"2406.12373","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebCanvas: Benchmarking Web Agents in Online Environments","primary_cat":"cs.CL","submitted_at":"2024-06-18T07:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[10] Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024. [11] D. Dupont. GPT-4V-Act: GPT-4 Variant for Active Learning. GitHub repository, 2023. URL https://github.com/ddupont808/GPT-4V-Act. [12] Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023. [13] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical"},{"citing_arxiv_id":"2403.07718","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","primary_cat":"cs.LG","submitted_at":"2024-03-12T14:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10935","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05459","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","primary_cat":"cs.HC","submitted_at":"2024-01-10T09:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ChatGPT itself can be viewed as an intelligent personal assistant that assist users by returning information in text responses. Inspired by the capabilities of LLMs, researchers have attempted to let LLMs use tools [46] autonomously to accomplish complex tasks. For instance, such as controlling browsers [47, 48] for information retrieval and summarization, invoking robot programming interfaces for robot behavior control [49, 50, 51], and calling code interpreters for complex data processing [52, 53, 54, 55], among others. It is a natural idea to integrate these capabilities into intelligent personal assistants, enabling more intelligent ways to manipulate personal data, personal devices and personalized services. 8 There are already some commercial products that have attempted to integrate LLM with IPA."},{"citing_arxiv_id":"2312.13771","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AppAgent: Multimodal Agents as Smartphone Users","primary_cat":"cs.CV","submitted_at":"2023-12-21T11:52:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}