CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Dantong Niu; Ethan Kou; Fei-Fei Li; Guanya Shi; Guanzhi Wang; Haoru Xue; Huang Huang; Jiajun Wu; Justin Yu; Karim El-Refai

arxiv: 2603.22435 · v2 · pith:OPU5XVSMnew · submitted 2026-03-23 · 💻 cs.RO · cs.AI

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Letian Fu , Justin Yu , Karim El-Refai , Ethan Kou , Haoru Xue , Huang Huang , Wenli Xiao , Guanzhi Wang

show 8 more authors

Dantong Niu Fei-Fei Li Guanya Shi Jiajun Wu Shankar Sastry Yuke Zhu Ken Goldberg Linxi "Jim" Fan

This is my paper

classification 💻 cs.RO cs.AI

keywords agentsmanipulationcap-xframeworkimprovesacrosscap-benchcode-as-policy

0 comments

read the original abstract

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
cs.CV 2026-05 unverdicted novelty 8.0

Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
Improving Robotic Generalist Policies via Flow Reversal Steering
cs.RO 2026-06 unverdicted novelty 7.0

Flow Reversal Steering steers flow matching generalist policies by reversing suboptimal actions to nearby better modes, enabling improved zero-shot control, quick distillation, and RL bootstrapping in robotic manipulation.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or too...
Sequential Planning via Anchored Robotic Keypoints
cs.RO 2026-06 unverdicted novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
cs.RO 2026-06 unverdicted novelty 6.0

APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping
cs.RO 2026-05 unverdicted novelty 6.0

GraspGen-X extends diffusion 6-DOF grasping to cross-embodiment via swept-volume gripper encoding, trained on procedural grippers and 2B grasps, claiming best zero-shot generalization to novel grippers in sim and real tests.
From Question Answering to Task Completion: A Survey on Agent System and Harness Design
cs.AI 2026-06 unverdicted novelty 4.0

Survey framing LLM agents as model-plus-harness systems, decomposing harness responsibilities, mapping them to tasks, and highlighting open challenges in evaluation, safety, and co-evolution.
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
cs.CV 2026-04 unverdicted novelty 4.0

ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...