{"total":24,"items":[{"citing_arxiv_id":"2605.19538","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:38:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents CaptchaBench benchmark and CaptchaMind RL solver achieving 82.9% success on benchmark tasks and 71% on real-world CAPTCHAs via explicit reasoning process supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16883","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SE-GA: Memory-Augmented Self-Evolution for GUI Agents","primary_cat":"cs.LG","submitted_at":"2026-05-16T08:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15963","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control","primary_cat":"cs.AI","submitted_at":"2026-05-15T13:55:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14311","ref_index":93,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-14T03:23:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12501","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Model Date SS-pro UI-V∆ CUActSpot GUI Text Table Canvas Image Overall Phi-Ground-4B-16C†[12] 2025-07 38.0 24.5 13.5 5.3 6.2 6.2 4.7 2.4 5.0 Uground-V1-2B∗[10] 2024-10 27.1 12.8 14.3 10.5 0.0 9.4 6.2 0.0 5.2 Uground-V1-7B∗[10] 2024-10 31.1 12.9 18.2 18.4 0.0 3.1 9.4 2.4 6.7 OS-Atlas-Base-7B∗[11] 2024-10 18.9 9.0 9.9 15.8 0.0 12.5 10.9 0.0 7.8 InfiGUI-R1-3B [18] 2025-04 45.2 22.0 23.2 23.7 3.1 9.4 7.8 0.0 8.8 UI-Venus-Ground-7B [19] 2025-08 50.8 26.5 24.3 23.7 3.1 18.8 9.4 0.0 11.0 GUI-G2-7B [20] 2025-07 47.5 26.4 21.1 23.7 6.2 15.6 7.8 4.8 11.6 MAI-UI-2B†[22] 2025-12 57.4 30.3 27.1 18.4 3.1 18.8 12.5 9.5 12.5 GUI-Owl-1.5-8B-Think [23] 2026-02 57.6 33.2 24.4 23.7 9.4 18.8 10.9 7.1 14.0 MAI-UI-8B†[22] 2025-12 65."},{"citing_arxiv_id":"2605.10347","ref_index":35,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Mobile World Model Guides GUI Agents?","primary_cat":"cs.AI","submitted_at":"2026-05-11T10:49:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"World models trained on delta text, full text, diffusion images, and renderable code achieve SoTA on two benchmarks and improve downstream GUI agent performance on three mobile datasets with modality-specific strengths.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[33] Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025. [34] Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025. [35] Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025. [36] Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al."},{"citing_arxiv_id":"2605.07505","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-08T09:38:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Ground-truth Guidance Variants Let A∗ t denote the human-verified valid action set for the current GUI state. The teacher-side guidance gt is constructed fromA ∗ t . For Single-GT Guided OPD, one valid action is randomly sampled: gt =a ∗(k) t , k∼Uniform{1, . . . , K}.(20) For Multi-GT Guided OPD, the full valid action set is provided: gt =A ∗ t .(21) For Guided OPD, the guidance is selected according to: gt =a † t = arg max a∗∈A∗ t S(ˆat, a∗).(22) All three variants use the same student inputx t and differ only in the privileged teacher context. A.2 Action Matching and Multi-solution Reward Details We define the base GUI action matcher ϕgui(ˆyt, a∗) as a normalized score in [0,1] . If the model"},{"citing_arxiv_id":"2605.06664","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BAMI: Training-Free Bias Mitigation in GUI Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-07T17:59:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Given the fine-grained nature of GUI localization, instruc- tion fine-tuning alone is often insufficient for achieving high precision. DeepSeek-R1 [8] introduced the GRPO method, demonstrating the potential of reinforcement learning in enhancing spatial reasoning for GUI grounding tasks. Fol- lowing this, UI-R1 [18] and GUI-R1 [20] were among the first to apply GRPO in GUI tasks. InfiGUI-R1 [16] focused on reward function design, emphasizing IoU-based metrics to improve localization accuracy. GUI-G1 [37] introduced box-attribute constraints to regulate bounding-box geome- try, while GUI-G2 [26] modeled spatial distributions using Gaussian functions. TianXi-Action [27] focused on generat- ing high-quality reinforcement learning data. Collectively,"},{"citing_arxiv_id":"2605.06534","ref_index":39,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","primary_cat":"cs.DC","submitted_at":"2026-05-07T16:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"side of the weight transfer engine are implemented inside ROLL.ROSEuses Megatron-LM [ 57] for training, vLLM for rollout/serving with request migration, and ROLL's native environment runtime to manage environments. Serving engine.The pull side of the transfer engine and the co-serving executor are built atop vLLM 0.10.0 [28]. We also implement a Ray-based [39] load-balancing scheduling policy to route serving requests. Relay worker.We use Mooncake v0.3.8 in the relay worker, allowing ROLL to publish updated weights and vLLM to pull them. We extend Mooncake with shard awareness and sparsity awareness to reduce communication overhead. Environment runtime.Environments run in CPU-only containers on a separate Kubernetes cluster, communicating"},{"citing_arxiv_id":"2605.02630","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-04T14:18:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"615.3 7.5 13.510.3 2.2 6.610.8 2.67.7CogAgent-18B [11]14.9 0.7 8.09.6 0.0 5.67.1 3.1 6.122.2 1.8 13.413.0 0.0 6.55.6 0.0 3.112.0 0.87.7Aria-UI [38] 16.2 0.0 8.423.7 2.1 14.77.6 1.6 6.127.1 6.4 18.120.3 1.9 16.14.7 0.0 2.617.1 2.011.3OS-Atlas-7B [35]33.1 1.4 17.728.8 2.8 17.912.2 4.7 10.337.5 7.3 24.433.9 5.7 27.427.1 4.5 16.828.1 4.018.9InfiGUI-R1-3B [20]51.3 12.4 32.444.9 7.0 29.033.0 14.1 28.458.3 20.0 41.765.5 28.3 57.043.9 12.4 29.649.1 14.135.7InfiGUI-G1-3B [21]64.9 20.0 -51.5 16.8 -50.8 25.0 -68.8 32.7 -70.6 32.1 -49.5 15.7 -- -45.2SE-GUI-3B [44]55.8 7.6 35.147.0 4.9 29.038.1 12.5 31.861.8 16.4 43.359.9 24.5 50.940.2 12.4 25.550.4 11.835.9Jedi-3B [36] 61.0 13.8 38.153.5 8.4 34.627.4 9.4 23.054."},{"citing_arxiv_id":"2605.00642","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding","primary_cat":"cs.AI","submitted_at":"2026-05-01T13:23:26+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Autonomous GUI agents have emerged as a promising direction for human-computer interaction, where GUI grounding serves as the fundamental capability of mapping natural language instructions to visual coordinates of target elements [4, 7]. To this end, a growing body of work [2, 6, 39, 51] has adopted reinforcement learning for GUI grounding, among which GRPO-based methods [ 19, 55] have become the dominant paradigm as shown in Figure 1(a). Specifically, given a user instruction, GRPO [8, 27] encourages the policy model to explore diverse solutions by sampling multiple rollouts, and evaluates each with a designed verifiable reward, such as binary [19], distance-constrained [46], or gaussian-based feedback [ 32]. The advantage of each rollout is then computed relative to the"},{"citing_arxiv_id":"2604.27859","ref_index":51,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Agentic Reinforcement Learning In Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-04-30T13:43:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24348","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents","primary_cat":"cs.CL","submitted_at":"2026-04-27T11:44:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"systems (computers, smartphones, and tablets) by performing actions such as clicks, swipes, and text input based on graph- ical user interfaces, in response to user instructions. Existing works have approached the construction of OS agents through various methods, including pre-training [19], [20], mid-training [21], [22], supervised fine-tuning [23], [24], reinforcement learning [25]-[28], prompt engineering [29], and multi-agent systems [30], [31]. These approaches have enhanced the OS agents' capabilities in grounding, reasoning, and task completion from different perspectives. However, in order to evolve OS agents from mere tools to trustworthy partners, it is essential to consider not only their task completion performance but also their safety [32]-"},{"citing_arxiv_id":"2604.22558","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-24T13:53:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13019","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PrecisionCUA: Iterative Visual Refinement for Pixel-Precise Cursor Grounding in Code Editors","primary_cat":"cs.CV","submitted_at":"2026-04-14T17:55:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07831","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection","primary_cat":"cs.CR","submitted_at":"2026-04-09T05:32:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.24168","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MGA: Memory-Driven GUI Agent for Observation-Centric Interaction","primary_cat":"cs.AI","submitted_at":"2025-10-28T08:19:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21982","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RISK: A Framework for GUI Agents in E-commerce Risk Management","primary_cat":"cs.AI","submitted_at":"2025-09-26T07:05:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21816","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation","primary_cat":"cs.SE","submitted_at":"2025-09-26T03:21:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.07553","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents","primary_cat":"cs.CL","submitted_at":"2025-09-09T09:46:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.19679","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning","primary_cat":"cs.AI","submitted_at":"2025-08-27T08:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.21046","ref_index":246,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence","primary_cat":"cs.AI","submitted_at":"2025-07-28T17:59:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.05791","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GTA1: GUI Test-time Scaling Agent","primary_cat":"cs.AI","submitted_at":"2025-07-08T08:52:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.20332","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training","primary_cat":"cs.AI","submitted_at":"2025-06-25T11:34:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile agents, plus a new Chinese GUI dataset and benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}