pith. sign in

arxiv: 2606.18363 · v1 · pith:JBRPT56Mnew · submitted 2026-06-16 · 💻 cs.RO · cs.AI

Guava: An Effective and Universal Harness for Embodied Manipulation

Pith reviewed 2026-06-27 00:20 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied manipulationharnesstool uselanguage modelssimulation to realopen-source modelsgeneralization
0
0 comments X

The pith

A well-designed harness enables compact open-source models to achieve frontier-level embodied manipulation with minimal simulation data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores what makes an effective harness for using language models in embodied manipulation tasks. It identifies three key ingredients through systematic design: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. Using these, the authors build Guava and distill capabilities into a 4B parameter model with fewer than 2,000 simulation trajectories. This approach yields performance comparable to large proprietary models in both simulation and real-world settings, with good generalization. A sympathetic reader would care because it points to a scalable way to create capable embodied agents without needing massive models or real-world data collection.

Core claim

Guava is a harness framework developed by exploring agent workflows, action spaces, and observation spaces. It shows that iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations are the three key ingredients for effective embodied agents. With an end-to-end training pipeline distilling these into a 4B open-source model using under 2K simulation trajectories, the system achieves strong performance in simulation and real-world environments, matching frontier models while generalizing to unseen objects, novel instructions, and long-horizon tasks.

What carries the argument

The Guava harness framework, which serves as a model-agnostic interface combining high-level reasoning from language models with external modules for perception, planning, and control.

If this is right

  • Compact models can exhibit strong emergent embodied capabilities when paired with the right harness.
  • Simulation-collected trajectories suffice for training effective real-world agents.
  • Performance and generalization hold across unseen objects and long-horizon tasks.
  • Harness design principles apply universally even to small models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-source robotics could advance faster by focusing on harness interfaces rather than scaling model size.
  • Similar harness approaches might extend to other embodied domains like navigation or multi-agent tasks.
  • Reduced data needs could make custom robot training more accessible to smaller labs.

Load-bearing premise

The three ingredients are universal across model sizes and simulation trajectories transfer to real-world performance without major issues.

What would settle it

A controlled experiment showing that ablating one of the three ingredients causes a significant drop in real-world task success rates for the 4B model.

read the original abstract

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Guava, a harness framework for embodied tool use in manipulation tasks. Through systematic exploration of agent workflows, action spaces, and observation spaces, it identifies three key ingredients: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. The authors develop an end-to-end distillation pipeline that transfers these capabilities into a 4B open-source model using fewer than 2K simulation trajectories. Experimental results are reported to show performance comparable to frontier proprietary models in both simulation and real-world settings, along with strong generalization to unseen objects, novel instructions, and long-horizon tasks.

Significance. If the experimental claims hold, the work is significant because it demonstrates a model-agnostic interface that unlocks embodied capabilities in compact open-source models with minimal data. The identification of the three ingredients and the simulation-based distillation pipeline provide concrete, potentially reusable design principles. Strengths include the focus on reproducibility via simulation trajectories and the emphasis on generalization, which could reduce dependence on large proprietary systems if the sim-to-real transfer is robustly validated.

minor comments (3)
  1. The abstract states that results are 'comparable to frontier proprietary models' without naming the specific baselines or reporting quantitative metrics; adding a brief table of key numbers in the abstract or introduction would improve clarity.
  2. Section describing the distillation pipeline should explicitly state the training objective, batch sizes, and any regularization terms used when fine-tuning the 4B model to aid reproducibility.
  3. Figure captions for the real-world experiments should include the number of trials per condition and any failure-mode categorization to make the generalization claims easier to evaluate at a glance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and the recommendation for minor revision. The assessment correctly identifies the core contributions of Guava as a model-agnostic harness and the distillation pipeline.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents Guava as the outcome of systematic empirical exploration of agent workflows, action spaces, and observation spaces, followed by an end-to-end distillation pipeline evaluated on held-out simulation and real-world tasks. No equations, fitted parameters renamed as predictions, or self-citations are invoked as load-bearing premises for the core claims. The three ingredients are reported as findings from the exploration rather than presupposed by definition, and performance results are presented as external validation rather than tautological outputs. The derivation remains self-contained against the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only so ledger entries are inferred at high level from stated approach; main unverified premises are sim-to-real transfer and universality of the three design ingredients.

axioms (1)
  • domain assumption Simulation trajectories suffice to train models that generalize to real-world manipulation without substantial domain gap
    Paper trains exclusively in simulation yet claims real-world performance and generalization.

pith-pipeline@v0.9.1-grok · 5790 in / 1294 out tokens · 33833 ms · 2026-06-27T00:20:11.382811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    SAM 3: Segment Anything with Concepts

    Accessed: 2026-05-28. Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl P...

  2. [2]

    Danny Driess, Fei Xia, Mehdi S

    https://openreview.net/forum?id= Ey8KcabBpB. Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mord...

  3. [3]

    Molmoact2: Action reasoning models for real-world deployment, 2026.https://arxiv.org/abs/2605.02881

    Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Ren, Ali...

  4. [4]

    Accessed: 2026-05-28. Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry, Yuke Zhu, Ken Goldberg, and Linxi "Jim" Fan. Cap-x: A framework for benchmarking and improving coding agents for robot manipulation, 2026.https://arxiv.org/abs/2603.22435. Tianrui Guan, Fu...

  5. [5]

    Accessed: 2026-05-28. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal,...

  6. [6]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa

    https://proceedings.mlr.press/v270/kim25c.html. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.https://arxiv.org/abs/2205.11916. Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Ac...

  7. [7]

    Jacky Liang, Wenlong Huang, F

    https://openreview.net/forum? id=h7aQxzKbq6. Jacky Liang, Wenlong Huang, F. Xia, Peng Xu, Karol Hausman, Brian Ichter, Peter R. Florence, and Andy Zeng. Code as policies: Language model programs for embodied control.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, 2022.https://api.semanticscholar.org/CorpusID:25235554...

  8. [8]

    GitHub repository, accessed 2026-06-04. Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian GE, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, and Ping Luo. Robocodex: multimodal code generation for robotic behavior synthesis. InProceed...

  9. [9]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a

    OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026a. Accessed: 2026-05-28. OpenAI. Harness engineering: Leveraging Codex in an agent-first world, February 2026b.https://openai.com/index/ harness-engineering/. OpenAI. Codex: AI coding partner from OpenAI.https://openai.com/codex/, 2026c. Accessed: 2026-05-28. Qwen Team. Qwen3.5...

  10. [10]

    Ranjan Sapkota, Yang Cao, Konstantinos I

    Accessed: 2026-05-28. Ranjan Sapkota, Yang Cao, Konstantinos I. Roumeliotis, and Manoj Karkee. Vision-language-action (vla) models: Concepts, progress, applications and challenges, 2026.https://arxiv.org/abs/2505.04769. 11 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. D...

  11. [11]

    doi: 10.1007/s10514-023-10135-3

    ISSN 0929-5593. doi: 10.1007/s10514-023-10135-3. https://doi.org/10.1007/s10514-023-10135-3. Peter Steinberger and The OpenClaw Community. OpenClaw: Your own personal ai assistant.https://github.com/ openclaw/openclaw,

  12. [12]

    Accessed: 2026-05-28. Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, and Salman Khan. LlamaV-o1: Rethinking step-by-step visual reasoning in LLMs. In Wanxiang Che, Joyce Na...

  13. [13]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.1247. https://aclanthology.org/2025.findings-acl.1247/. Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. ...

  14. [14]

    aha moment

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023.https://arxiv.org/abs/2210.03629. Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model...

  15. [15]

    name": "tool_name

    12 Contents 1 Introduction 1 2 Related W ork 2 3 Gauva: Harnessing VLM for Embodied Manipulation 3 3.1 Designing Effective Harness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Learning Efficient and Generalizable Agentic Embodied Reasoning . . . . . . . . . . . . . . . 4 4 Experiments 5 4.1 Setup . . . . . . . . . . . . ...

  16. [16]

    C.2 RL Training Unlike the standard post-training recipe (Zhou et al., 2025; Shao et al.,

    as our training pipeline. C.2 RL Training Unlike the standard post-training recipe (Zhou et al., 2025; Shao et al.,

  17. [17]

    Efficiency.Compared to Guava’s harness framework with GPT-5.4 (OpenAI, 2026a),Guava-Agent-4Buse less tokens as shown in Figure

    andClaude-Sonnet-4.6 (Qwen Team, 2026), while Qwen3.5-2B (OpenAI, 2026a), due to its small size, has poor performance due to poor instruction following and tool calling ability, as it frequently hallucinates wrong reasoning and invalid tool calls. Efficiency.Compared to Guava’s harness framework with GPT-5.4 (OpenAI, 2026a),Guava-Agent-4Buse less tokens a...

  18. [18]

    Best results are markedbold

    16 Table 3 Results of Guava harness with additional base models.Values are success rates (%) evaluated over 15 episodes. Best results are markedbold. Success Rate (%) Task Name Gemini-3.1-Pro(Fu et al., 2026)Claude-Sonnet-4.6(Qwen Team, 2026)Qwen3.5-2B(OpenAI, 2026a) ID place can in box 100.0 100.0 0.0 arrange by size 100.0100.00.0 remove cube from tray 1...