ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.
Os-genesis: Automating gui agent trajectory construction via reverse task synthesis
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
background 1representative citing papers
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.
UI-Copilot adds a selective copilot for memory and math to GUI agents and trains tool use with separate single-turn and multi-turn optimization, yielding SOTA results on MemGUI-Bench and a 17.1% gain on AndroidWorld.
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile agents, plus a new Chinese GUI dataset and benchmark.
LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on benchmarks and online tests.
AliyunConsoleAgent-32B reaches 63.52% success on a 278-task cloud console benchmark, closing to 1.82pp of frontier models at 92% lower cost via SFT distillation and GRPO RL.
SynthAgent uses dual refinement of synthetic tasks and trajectories to produce higher-quality training data that improves web agent adaptation to target environments.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
citing papers explorer
-
ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.
-
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems
FedGUI is the first comprehensive benchmark for federated GUI agents that studies cross-platform, cross-device, cross-OS, and cross-source heterogeneity, with experiments showing performance gains from cross-platform collaboration and identifying platform and OS as the most influential factors.
-
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.
-
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
UI-Copilot adds a selective copilot for memory and math to GUI agents and trains tool use with separate single-turn and multi-turn optimization, yielding SOTA results on MemGUI-Bench and a 17.1% gain on AndroidWorld.
-
AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning
AliyunConsoleAgent-32B reaches 63.52% success on a 278-task cloud console benchmark, closing to 1.82pp of frontier models at 92% lower cost via SFT distillation and GRPO RL.