Guava: An Effective and Universal Harness for Embodied Manipulation
Pith reviewed 2026-06-27 00:20 UTC · model grok-4.3
The pith
A well-designed harness enables compact open-source models to achieve frontier-level embodied manipulation with minimal simulation data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guava is a harness framework developed by exploring agent workflows, action spaces, and observation spaces. It shows that iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations are the three key ingredients for effective embodied agents. With an end-to-end training pipeline distilling these into a 4B open-source model using under 2K simulation trajectories, the system achieves strong performance in simulation and real-world environments, matching frontier models while generalizing to unseen objects, novel instructions, and long-horizon tasks.
What carries the argument
The Guava harness framework, which serves as a model-agnostic interface combining high-level reasoning from language models with external modules for perception, planning, and control.
If this is right
- Compact models can exhibit strong emergent embodied capabilities when paired with the right harness.
- Simulation-collected trajectories suffice for training effective real-world agents.
- Performance and generalization hold across unseen objects and long-horizon tasks.
- Harness design principles apply universally even to small models.
Where Pith is reading between the lines
- Open-source robotics could advance faster by focusing on harness interfaces rather than scaling model size.
- Similar harness approaches might extend to other embodied domains like navigation or multi-agent tasks.
- Reduced data needs could make custom robot training more accessible to smaller labs.
Load-bearing premise
The three ingredients are universal across model sizes and simulation trajectories transfer to real-world performance without major issues.
What would settle it
A controlled experiment showing that ablating one of the three ingredients causes a significant drop in real-world task success rates for the 4B model.
read the original abstract
Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Guava, a harness framework for embodied tool use in manipulation tasks. Through systematic exploration of agent workflows, action spaces, and observation spaces, it identifies three key ingredients: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. The authors develop an end-to-end distillation pipeline that transfers these capabilities into a 4B open-source model using fewer than 2K simulation trajectories. Experimental results are reported to show performance comparable to frontier proprietary models in both simulation and real-world settings, along with strong generalization to unseen objects, novel instructions, and long-horizon tasks.
Significance. If the experimental claims hold, the work is significant because it demonstrates a model-agnostic interface that unlocks embodied capabilities in compact open-source models with minimal data. The identification of the three ingredients and the simulation-based distillation pipeline provide concrete, potentially reusable design principles. Strengths include the focus on reproducibility via simulation trajectories and the emphasis on generalization, which could reduce dependence on large proprietary systems if the sim-to-real transfer is robustly validated.
minor comments (3)
- The abstract states that results are 'comparable to frontier proprietary models' without naming the specific baselines or reporting quantitative metrics; adding a brief table of key numbers in the abstract or introduction would improve clarity.
- Section describing the distillation pipeline should explicitly state the training objective, batch sizes, and any regularization terms used when fine-tuning the 4B model to aid reproducibility.
- Figure captions for the real-world experiments should include the number of trials per condition and any failure-mode categorization to make the generalization claims easier to evaluate at a glance.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and the recommendation for minor revision. The assessment correctly identifies the core contributions of Guava as a model-agnostic harness and the distillation pipeline.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents Guava as the outcome of systematic empirical exploration of agent workflows, action spaces, and observation spaces, followed by an end-to-end distillation pipeline evaluated on held-out simulation and real-world tasks. No equations, fitted parameters renamed as predictions, or self-citations are invoked as load-bearing premises for the core claims. The three ingredients are reported as findings from the exploration rather than presupposed by definition, and performance results are presented as external validation rather than tautological outputs. The derivation remains self-contained against the reported experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simulation trajectories suffice to train models that generalize to real-world manipulation without substantial domain gap
Reference graph
Works this paper leans on
-
[1]
SAM 3: Segment Anything with Concepts
Accessed: 2026-05-28. Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl P...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.15607/rss.2025.xxi.010 2026
-
[2]
Danny Driess, Fei Xia, Mehdi S
https://openreview.net/forum?id= Ey8KcabBpB. Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mord...
-
[3]
Molmoact2: Action reasoning models for real-world deployment, 2026.https://arxiv.org/abs/2605.02881
Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Ren, Ali...
Pith/arXiv arXiv 2026
-
[4]
Accessed: 2026-05-28. Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry, Yuke Zhu, Ken Goldberg, and Linxi "Jim" Fan. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation, 2026.https://arxiv.org/abs/2603.22435. Tianrui Guan, Fu...
arXiv 2026
-
[5]
Accessed: 2026-05-28. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal,...
2026
-
[6]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa
https://proceedings.mlr.press/v270/kim25c.html. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.https://arxiv.org/abs/2205.11916. Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Ac...
Pith/arXiv arXiv 2023
-
[7]
Jacky Liang, Wenlong Huang, F
https://openreview.net/forum? id=h7aQxzKbq6. Jacky Liang, Wenlong Huang, F. Xia, Peng Xu, Karol Hausman, Brian Ichter, Peter R. Florence, and Andy Zeng. Code as policies: Language model programs for embodied control.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022.https://api.semanticscholar.org/CorpusID:25235554...
2023
-
[8]
GitHub repository, accessed 2026-06-04. Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian GE, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, and Ping Luo. Robocodex: multimodal code generation for robotic behavior synthesis. InProceed...
2026
-
[9]
Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a
OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a. Accessed: 2026-05-28. OpenAI. Harness engineering: Leveraging Codex in an agent-first world, February 2026b.https://openai.com/index/ harness-engineering/. OpenAI. Codex: AI coding partner from OpenAI.https://openai.com/codex/, 2026c. Accessed: 2026-05-28. Qwen Team. Qwen3.5...
2026
-
[10]
Ranjan Sapkota, Yang Cao, Konstantinos I
Accessed: 2026-05-28. Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis, and Manoj Karkee. Vision-language-action (vla) models: Concepts, progress, applications and challenges, 2026.https://arxiv.org/abs/2505.04769. 11 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. D...
arXiv 2026
-
[11]
doi: 10.1007/s10514-023-10135-3
ISSN 0929-5593. doi: 10.1007/s10514-023-10135-3. https://doi.org/10.1007/s10514-023-10135-3. Peter Steinberger and The OpenClaw Community. OpenClaw: Your own personal ai assistant.https://github.com/ openclaw/openclaw,
-
[12]
Accessed: 2026-05-28. Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. LlamaV-o1: Rethinking step-by-step visual reasoning in LLMs. In Wanxiang Che, Joyce Na...
2026
-
[13]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.1247. https://aclanthology.org/2025.findings-acl.1247/. Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.findings-acl.1247 2025
-
[14]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.https://arxiv.org/abs/2210.03629. Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model...
Pith/arXiv arXiv 2023
-
[15]
name": "tool_name
12 Contents 1 Introduction 1 2 Related W ork 2 3 Gauva: Harnessing VLM for Embodied Manipulation 3 3.1 Designing Effective Harness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Learning Efficient and Generalizable Agentic Embodied Reasoning . . . . . . . . . . . . . . . 4 4 Experiments 5 4.1 Setup . . . . . . . . . . . . ...
2020
-
[16]
C.2 RL Training Unlike the standard post-training recipe (Zhou et al., 2025; Shao et al.,
as our training pipeline. C.2 RL Training Unlike the standard post-training recipe (Zhou et al., 2025; Shao et al.,
2025
-
[17]
Efficiency.Compared to Guava’s harness framework with GPT-5.4 (OpenAI, 2026a),Guava-Agent-4Buse less tokens as shown in Figure
andClaude-Sonnet-4.6 (Qwen Team, 2026), while Qwen3.5-2B (OpenAI, 2026a), due to its small size, has poor performance due to poor instruction following and tool calling ability, as it frequently hallucinates wrong reasoning and invalid tool calls. Efficiency.Compared to Guava’s harness framework with GPT-5.4 (OpenAI, 2026a),Guava-Agent-4Buse less tokens a...
2026
-
[18]
Best results are markedbold
16 Table 3 Results of Guava harness with additional base models.Values are success rates (%) evaluated over 15 episodes. Best results are markedbold. Success Rate (%) Task Name Gemini-3.1-Pro(Fu et al., 2026)Claude-Sonnet-4.6(Qwen Team, 2026)Qwen3.5-2B(OpenAI, 2026a) ID place can in box 100.0 100.0 0.0 arrange by size 100.0100.00.0 remove cube from tray 1...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.