{"total":13,"items":[{"citing_arxiv_id":"2606.29472","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent-Computer Observation Interfaces Enable Dynamic Computer Use","primary_cat":"cs.AI","submitted_at":"2026-06-28T15:59:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11078","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A History-Aware Visually Grounded Critic for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-06-09T16:39:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02031","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents","primary_cat":"cs.LG","submitted_at":"2026-06-01T10:20:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenWebRL trains a 4B visual web agent with online RL on live sites using 0.4K init trajectories and 2.2K RL tasks to reach 67% success on Online-Mind2Web and 64% on DeepShop, outperforming prior open agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01533","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Agent Computer Use","primary_cat":"cs.MA","submitted_at":"2026-06-01T01:29:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28775","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents","primary_cat":"cs.LG","submitted_at":"2026-05-27T17:37:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19219","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-19T00:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18181","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scalable Environments Drive Generalizable Agents","primary_cat":"cs.AI","submitted_at":"2026-05-18T10:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11212","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","primary_cat":"cs.CL","submitted_at":"2026-05-11T20:27:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08965","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-09T14:18:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure faithful reasoning and that decision sensitivity best matches human preferences.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Rationale Extraction.For each image-message-label triple in the training split, we apply all seven prompt templates from Appendix A, obtaining rationales that differ in both evidence polarity and visual granularity. We construct two teacher-generated rationale sets:Qwen, generated by Qwen2.5-VL-72B-Instruct [27], andPhi, generated by Phi-4-reasoning-vision-15B [ 28]. After 4 Table 1: Persuasiveness classification performance of student models on the PVP test split.Our method(last row) fine-tunes the student using supervision on teacher-generated rationales from diverse reasoning perspectives.Boldindicates the best result within each student column. Qwen2.5-VL-7B-Instruct Phi-3.5-Vision-Instruct Setup Bal. Acc."},{"citing_arxiv_id":"2605.06761","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weblica: Scalable and Reproducible Training Environments for Visual Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"work removes these aids entirely, building agents that operate on raw screenshots and predict actions as pixel coordinates [28, 34, 1, 5, 9, 24, 8]. We follow this direction and train visual web agents with screenshot input and coordinate-based actions. Data and Environments for Web Agents.Severaleffortscollectsupervisedfine-tuning(SFT)trajectoriesthrough human annotation or model-generated rollouts. Fara [5] develops a multi-agent data generation system that produces 145K trajectories across 70K domains. MolmoWeb [9] combines over 100K synthetic task 2 trajectories with 30K+ human demonstrations and GUI perception data. OpenCUA [36] and AgentTrek [40] similarly collect demonstration data for web tasks. While valuable, SFT data alone provides limited support"},{"citing_arxiv_id":"2604.28181","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Synthetic Computers at Scale for Long-Horizon Productivity Simulation","primary_cat":"cs.AI","submitted_at":"2026-04-30T17:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The authors built 1,000 synthetic computers and ran long-horizon agent simulations on them, generating experiential data that improves AI performance on both similar and new productivity tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08516","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoWeb: Open Visual Web Agent and Open Data for the Open Web","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:54:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"perception data consisting of referring expression grounding and screenshot question-answering examples. Despite their relatively compact size at 4B or 8B scale,MolmoWebagents are highly performant.Without distilling from other visual web agents,MolmoWebagents are competitive with or outperform open-weights agents of comparable size, including Fara-7B [17] and Holo1-7B [18]. More notably, they also outperform set-of-marks (SoM) prompting based web agents [19] built on much larger closed frontier models like GPT-4o that take both AxTree and SoM-annotated screenshots as inputs. This result is particularly striking because these closed-model baselines enjoy substantially richer input representationsandorders-of-magnitude more"},{"citing_arxiv_id":"2601.18842","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents","primary_cat":"cs.CR","submitted_at":"2026-01-26T11:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"replacement. Category Model Semantic Consistency Score (0-4) Black Mask Mosaic Mask Random Blocks Text Box Replace Android Avg. PC Avg. Overall Avg. GUI Agent UI-Tars-1.5-7B [52] 1.16 1.44 0.88 1.01 1.27 0.95 1.11 Fara-7B [104] 2.02 1.94 2.11 1.95 2.34 1.69 2.04 GUI-Owl-7B [103] 2.56 2.50 2.85 2.32 2.63 2.47 2.55 Open-source VLM Qwen3-VL-235B- A22B [105] 2.71 2.92 2.63 2.63 2.92 2.52 2.73 Closed-source VLM Claude-Sonnet-4.5 [22] 2.87 2.74 3.00 3.03 3.44 2.42 2.96 GPT-5.2 [102] 3.33 3.30 3.44 3.17 3.33 3.28 3.31 Gemini-3-Pro [33] 3.32 3.42 3.40 3.22 3.45 3.24 3.35 Fidelity of Protection Methods 2.57 2.612.622.48 / / / 6.2 Evaluation Result As shown in Table 1, LLM-as-Judge compares the planning results of four general-purpose models"}],"limit":50,"offset":0}