GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
hub
Fara-7b: An efficient agentic model for computer use
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13representative citing papers
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
OpenWebRL trains a 4B visual web agent with online RL on live sites using 0.4K init trajectories and 2.2K RL tasks to reach 67% success on Online-Mind2Web and 64% on DeepShop, outperforming prior open agents.
A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.
LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure faithful reasoning and that decision sensitivity best matches human preferences.
The authors built 1,000 synthetic computers and ran long-horizon agent simulations on them, generating experiential data that improves AI performance on both similar and new productivity tasks.
Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.
citing papers explorer
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
Agent-Computer Observation Interfaces Enable Dynamic Computer Use
AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
-
A History-Aware Visually Grounded Critic for Computer Use Agents
HiViG is a test-time critic that combines macro-action history summarization with visual grounding of execution coordinates to reduce short-sighted and visually erroneous actions in long-horizon GUI agents.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
OpenWebRL trains a 4B visual web agent with online RL on live sites using 0.4K init trajectories and 2.2K RL tasks to reach 67% success on Online-Mind2Web and 64% on DeepShop, outperforming prior open agents.
-
Multi-Agent Computer Use
A manager-driven DAG decomposition with parallel subagents improves computer use agent success rates by 3.4-25.5% and reduces wall-clock time on long-horizon benchmarks.
-
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
LearnWeak specializes small CUAs via weakness detection by a reference agent, targeted task synthesis, and error-aware training, delivering 11+ point gains on OSWorld.
-
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces token usage by 46% and improves success rate by 3% on OSWorld, WebTailBench, and AgentNetBench by removing redundant visual patches from 5-history trajectories with Qwen2.5-VL-7B.
-
Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning
Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure faithful reasoning and that decision sensitivity best matches human preferences.
-
Synthetic Computers at Scale for Long-Horizon Productivity Simulation
The authors built 1,000 synthetic computers and ran long-horizon agent simulations on them, generating experiential data that improves AI performance on both similar and new productivity tasks.
-
Scalable Environments Drive Generalizable Agents
Generalizable agents require environment scaling via diverse executable rule-sets, distinguished from trajectory and task scaling in a new taxonomy.