GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
hub
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 11roles
background 4polarities
background 4representative citing papers
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
PersonalHomeBench is a new benchmark showing that AI agents suffer systematic performance drops in personalized smart homes as task complexity rises, especially in counterfactual reasoning and partial observability.
DeltaBox achieves millisecond-level checkpoint (14ms) and rollback (5ms) for AI agent sandboxes by layering file states and using incremental process dumps to exploit similarity between consecutive checkpoints.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
citing papers explorer
-
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
PersonalHomeBench: Evaluating Agents in Personalized Smart Homes
PersonalHomeBench is a new benchmark showing that AI agents suffer systematic performance drops in personalized smart homes as task complexity rises, especially in counterfactual reasoning and partial observability.
-
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
DeltaBox achieves millisecond-level checkpoint (14ms) and rollback (5ms) for AI agent sandboxes by layering file states and using incremental process dumps to exploit similarity between consecutive checkpoints.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
-
Sema: Semantic Transport for Real-Time Multimodal Agents
Sema reduces uplink bandwidth by 64x for audio and 130-210x for screenshots while keeping multimodal agent task accuracy within 0.7 percentage points of raw baselines in WAN simulations.
-
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
- AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation