{"total":15,"items":[{"citing_arxiv_id":"2606.29705","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots","primary_cat":"cs.AI","submitted_at":"2026-06-29T02:16:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15542","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding","primary_cat":"cs.AI","submitted_at":"2026-05-15T02:27:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without any training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27859","ref_index":108,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-30T13:43:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2504.08066(2025). [107] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian J. McAuley, Zicheng Gao, Lijuan Liu, and Lijuan Wang. 2023. GPT-4V in wonderland: Large multimodal models for zero-shot smartphone gui navigation.CoRRabs/2311.07562 (2023). https://doi.org/10.48550/arXiv.2311.07562 [108] Chuanhao Yan, Fengdi Che, Xu Huang, Xu Xu, Xin Li, Yizhi Li, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, and et al. 2025a. Re: Form-reducing human priors in scalable formal software verification with rl in llms: A preliminary study on dafny.arXiv preprint arXiv:2507.16331(2025a). [109] Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zonggen Li, Xiaowen Ma, Hinrich Schütze, Volker Tresp, and Yunpu Ma."},{"citing_arxiv_id":"2604.23941","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction","primary_cat":"cs.CV","submitted_at":"2026-04-27T01:29:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GoClick is a compact 230M-parameter encoder-decoder VLM for GUI element grounding that matches larger models' accuracy via a Progressive Data Refinement pipeline yielding a 3.8M-sample core set.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10371","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management","primary_cat":"cs.AI","submitted_at":"2025-12-11T07:37:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16120","ref_index":111,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Powered AI Agent Systems and Their Applications in Industry","primary_cat":"cs.AI","submitted_at":"2025-05-22T01:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"One promising approach is to create task-oriented metrics that assess performance based on goal achievement and interaction quality, such as task success rate, dialogue coherence, and response accuracy [106]-[110]. Additional, employing human- centric evaluation methods - such as user satisfaction surveys and feedback - can capture qualitative aspects that automated metrics might miss [111]. Another strategy is to develop simulation environments that mimic real-world tasks, allowing for controlled and reproducible testing of agent performance. Benchmarking competitions and shared datasets curated for multi-agent interactions [8] and real-world scenarios [112] can also help establish community standards. By combining quan- titative metrics with qualitative assessments, the evaluation"},{"citing_arxiv_id":"2505.03364","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking","primary_cat":"cs.HC","submitted_at":"2025-05-06T09:37:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16150","ref_index":175,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[173] Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. 2024. UFO: A UI-Focused Agent for Windows OS Interaction. https://doi.org/10.48550/arXiv.2402.07939 [174] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. AppAgent: Multimodal Agents as Smartphone Users. https://doi.org/10.48550/arXiv.2312.13771 [175] Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. 2024. MobileEnv: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction. https://doi.org/10.48550/arXiv.2305.08144 [176] Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang."},{"citing_arxiv_id":"2412.10345","ref_index":99,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies","primary_cat":"cs.RO","submitted_at":"2024-12-13T18:40:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AutoDroid [156] (2023), which combined LLMs with domain- specific knowledge for smartphone automation. Additional contributions like MM-Navigator [157] (2023), AppAgent [18] (2023), and Mobile-Agent [158] (2023) enabled refined control over smartphone applications. Research has continued to improve accuracy for mobile GUI automation through model fine-tuning [159], [160] (2024). 4.3.3 Computer Systems For desktop applications, UFO [19] (2024) was one of the first systems to leverage GPT -4 with visual capabilities to fulfill user commands in Windows environments. Cradle [161] (2024) extended these capabilities to software applications and games, while Wu et al., [162] (2024) provided interaction across diverse desktop applications, including web browsers,"},{"citing_arxiv_id":"2406.16173","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Crepe: A Mobile Screen Data Collector Using Graph Query","primary_cat":"cs.HC","submitted_at":"2024-06-23T17:53:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Crepe introduces a graph-query technique in a no-code Android app for flexible collection of targeted mobile screen data with privacy and consent controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"the-art LLM and VLM agent baselines on OSW ORLD benchmark, as well as their performance. 4.1 LLM and VLM Agent Baselines We adopt state-of-the-art LLM and VLM from open-source representatives such as Mixtral [ 19], CogAgent [17] and Llama-3 [35], and closed-source ones from GPT, Gemini, Claude and Qwen families on OSW ORLD , to serve as the foundation of agent. We also explore methods such as the Set-of-Marks aided approach [56, 11], which has been demonstrated to improve spatial capabilities for visual reasoning. Our prior experiments following VisualWebArena [22] adopt few-shot prompting, 9 Table 5: Success rates of baseline LLM and VLM agents on OSW ORLD , grouped by task categories: OS, Office (LibreOffice Calc, Impress, Writer), Daily (Chrome, VLC Player, Thunderbird), Profes-"},{"citing_arxiv_id":"2401.10935","ref_index":102,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05459","ref_index":106,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","primary_cat":"cs.HC","submitted_at":"2024-01-10T09:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[104] An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023. [105] Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023. [106] Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024. [107] Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, and Mike Zheng Shou. Assistgui: Task-oriented desktop"},{"citing_arxiv_id":"2312.13771","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AppAgent: Multimodal Agents as Smartphone Users","primary_cat":"cs.CV","submitted_at":"2023-12-21T11:52:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}