Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.
Step-gui technical report
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 11roles
background 3polarities
background 3representative citing papers
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
citing papers explorer
-
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Workflow-GYM is a new benchmark for long-horizon professional GUI agent tasks where state-of-the-art models reach only slightly above 30% success.
-
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
-
ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
MobileExplorer reduces on-device GUI agent reasoning steps and latency by 23% via parallel UI exploration, structured memory, and a two-level rollback while maintaining or improving task success rates.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Xiaomi-GUI-0 Technical Report
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.