{"total":13,"items":[{"citing_arxiv_id":"2606.12191","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application","primary_cat":"cs.CL","submitted_at":"2026-06-10T15:15:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26546","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration","primary_cat":"cs.AI","submitted_at":"2026-05-26T04:53:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16883","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SE-GA: Memory-Augmented Self-Evolution for GUI Agents","primary_cat":"cs.LG","submitted_at":"2026-05-16T08:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13527","ref_index":25,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:40:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08526","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:17:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(26) For any predictive distribution πtsk(Y|[g ω(z);c ∗;B] ), the conditional cross-entropy upper-bounds the conditional entropy: H(Y|z,c ∗)≤E (M,Y)∼p(·,·|c ∗) z∼qθ (·|M,c ∗) [−logπ tsk(Y|[g ω(z);c ∗;B] )] . (27) Substituting Equation (27) into Equation (26) gives I(z;Y|c ∗)≥H(Y|c ∗) +E (M,Y)∼p(·,·|c ∗) z∼qθ (·|M,c ∗) [logπ tsk(Y|[g ω(z);c ∗;B] )] . (28) Finally, combining Equations (25) and (28) with the definition of Lz in Equation (9), we obtain Lz =I((X,M);z|c ∗)−β zI(z;Y|c ∗) ≤E (M,Y)∼p(·,·|c ∗) z∼qθ (·|M,c ∗) \u0014 log qθ(z|M,c ∗) rϕ(z|c ∗) −β z logπ tsk(Y|[g ω(z);c ∗;B] ) \u0015 −β zH(Y|c ∗) =J z(θ,ϕ;c ∗)−β zH(Y|c ∗), (29) which proves the claim. 18"},{"citing_arxiv_id":"2605.05765","ref_index":12,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-07T06:58:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Describes X-OmniClaw, a multimodal mobile agent architecture using Omni Perception, Memory, and Action modules with behavior cloning for Android task execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26148","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations","primary_cat":"cs.HC","submitted_at":"2026-04-28T22:15:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21375","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","primary_cat":"cs.CL","submitted_at":"2026-04-23T07:42:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tablished a complementary Windows-only suite showing a similar gap. More re- cent benchmarks target specific domains or platforms, such as Spider2-V [13] for 4 Q. Han, H. Tu et al. enterprise data-science workflows, ScreenSpot [18] for visual grounding, and ma- cOSWorld [72] for macOS-specific tasks. Parallel efforts extend evaluation to mo- bile [15,21,50,51] and web settings [19,22-24,34,43,54,75,78,83], building upon classic web-interaction benchmarks [41,46,52]. Beyond task-completion bench- marks, recent work evaluates multimodal model robustness and reliability more broadly, including safety and attribute evaluations under out-of-distribution vi- sual inputs [17,37,58], vision-language reward and reinforce learning [16,59]."},{"citing_arxiv_id":"2507.04227","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mobile GUI Agents under Real-world Threats: Are We There Yet?","primary_cat":"cs.CR","submitted_at":"2025-07-06T03:31:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16150","ref_index":130,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"S denotes the state and O the observation space, respectively. For example, 𝑜𝑡 could be a screenshot of the current screen, only showing the foreground application, whereas 𝑠𝑡 would encompass all running computer processes. Based on 𝑜𝑡 and instruction 𝑖, the ACU selects anaction 𝑎𝑡 ∈ A (action space), such as a mouse click, keypress, or a higher-level command [130, 155]. In practice, ACUs oftensimplifyobservations, denoted 𝑜𝑡 →𝑜 ∗ 𝑡 , to reduce complexity by, for example, downscaling or cropping UI screenshots [16]. Besides using simplified observation, ACUs can also predict abstract actions. Such actions must be converted in agroundingprocess 𝑎∗ 𝑡 →𝑎 𝑡 from abstract actions 𝑎∗ 𝑡 into executable actions 𝑎𝑡 ∈ A ."},{"citing_arxiv_id":"2405.14573","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2024-05-23T13:48:54+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.07972","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","primary_cat":"cs.AI","submitted_at":"2024-04-11T17:56:05+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945, 2024. [39] R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13, 2023. [40] Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023. [41] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. [42] Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and"},{"citing_arxiv_id":"2401.10935","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}