Cradle: Empowering Foundation Agents Towards General Computer Control

Bo An; Bohan Zhou; B\"orje F. Karlsson; Boyu Li; Ceyao Zhang; Chaojie Wang; Chuqiao Zong; Haochong Xia; Jiechuan Jiang; Junpeng Yue

arxiv: 2403.03186 · v3 · pith:DT2MBQDJnew · submitted 2024-03-05 · 💻 cs.AI

Cradle: Empowering Foundation Agents Towards General Computer Control

Weihao Tan , Wentao Zhang , Xinrun Xu , Haochong Xia , Ziluo Ding , Boyu Li , Bohan Zhou , Junpeng Yue

show 20 more authors

Jiechuan Jiang Yewen Li Ruyi An Molei Qin Chuqiao Zong Longtao Zheng Yujie Wu Xiaoqiang Chai Yifei Bi Tianbao Xie Pengjie Gu Xiyun Li Ceyao Zhang Long Tian Chaojie Wang Xinrun Wang B\"orje F. Karlsson Bo An Shuicheng Yan Zongqing Lu

This is my paper

classification 💻 cs.AI

keywords cradleagentsfoundationsoftwarecomplexcontrolacrosscomplete

0 comments

read the original abstract

Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
cs.SE 2026-06 unverdicted novelty 8.0

JAMER releases the first large-scale dataset and benchmark for project-level code frameworks on the Godot game engine, derived from game jam repositories and evaluated on frontier models showing sharp performance drop...
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Agent-Computer Observation Interfaces Enable Dynamic Computer Use
cs.AI 2026-06 conditional novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.
AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications
cs.CV 2026-05 unverdicted novelty 7.0

AndroidDaily supplies 350 verifiable tasks on 94 closed-source Android apps evaluated by GRADE (87.37% human agreement), with the strongest model achieving 62% success.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics
cs.CV 2026-06 unverdicted novelty 6.0

OmniGameArena is a unified UE5 benchmark with 12 games and the IDC harness for cold-start scores and improvement dynamics of VLM agents.
AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents
cs.CL 2026-05 unverdicted novelty 6.0

Introduces AgentOdyssey, a procedural generator of open-ended long-horizon text games, to evaluate test-time continual learning agents and diagnose limits in exploration, memory, and planning.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
cs.CV 2026-05 conditional novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents
cs.CV 2026-05 unverdicted novelty 6.0

SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Long-Term Memory for VLA-based Agents in Open-World Task Execution
cs.RO 2026-04 unverdicted novelty 6.0

ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
SkillDroid: Compile Once, Reuse Forever
cs.HC 2026-04 conditional novelty 6.0

SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 r...
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
cs.CV 2026-04 unverdicted novelty 6.0

GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
cs.AI 2025-12 conditional novelty 6.0

AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 5.0

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
cs.AI 2025-01 unverdicted novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 3.0

A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.