pith. sign in

arxiv: 2508.13998 · v2 · submitted 2025-08-19 · 💻 cs.RO · cs.AI· cs.LG

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords embodied reasoningrobotic manipulationvision-language modelreinforced fine-tuningzero-shot generalizationpointing representationperception-action gapSIMPLEREnv
0
0 comments X p. Extension

The pith

Defining four core embodied pointing abilities and training a 3B model with reinforced fine-tuning bridges vision-language understanding to robot actions for zero-shot generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the seeing-to-doing gap in embodied AI that comes from scarce specialized data and differences across robot bodies. It establishes pointing as a single intermediate format that works for any robot by defining four specific abilities that turn visual and language inputs into action steps. A large dataset of 200K pointing examples is assembled from embodied and general sources, then used to train Embodied-R1 through a two-stage reinforced fine-tuning process with multi-task rewards. A sympathetic reader would care because success here would mean robots can handle new tasks and physical setups using existing vision-language models without retraining for each case.

Core claim

Embodied-R1 pioneers pointing as an embodiment-agnostic intermediate representation through four core abilities that link high-level vision-language comprehension to low-level action primitives. The 3B model is trained with a two-stage reinforced fine-tuning curriculum on the Embodied-Points-200K dataset and reaches state-of-the-art results on 11 embodied spatial and pointing benchmarks. It further shows robust zero-shot generalization with 56.2 percent success in the SIMPLEREnv and 87.5 percent across eight real-world XArm tasks with no task-specific fine-tuning, a 62 percent gain over strong baselines, plus high robustness to visual disturbances.

What carries the argument

pointing as a unified embodiment-agnostic intermediate representation consisting of four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives

If this is right

  • The model reaches state-of-the-art performance on 11 embodied spatial and pointing benchmarks.
  • It delivers 56.2 percent success in simulation and 87.5 percent on real XArm tasks without any task-specific fine-tuning.
  • Performance improves 62 percent over strong baselines on the real-world tasks.
  • The model remains robust under diverse visual disturbances.
  • A pointing-centric representation paired with reinforced fine-tuning supplies a generalizable route to closing the perception-action gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pointing representation might transfer to other embodied domains such as navigation or object rearrangement without major redesign.
  • Increasing the size of the Embodied-Points dataset or the model could produce further gains on more complex multi-step tasks.
  • Direct comparison of the four abilities against alternative intermediate representations on the same robot hardware would clarify their relative contribution.

Load-bearing premise

That the four core embodied pointing abilities are sufficient to connect high-level vision-language comprehension to the low-level action primitives needed by heterogeneous robot embodiments and unseen tasks.

What would settle it

Running the model on a new manipulation task whose required actions fall outside the four defined pointing abilities and observing success rates drop well below the reported 56 percent in simulation or 87 percent in real settings would falsify the claim.

Figures

Figures reproduced from arXiv: 2508.13998 by Fei Ni, Haiqin Cui, Hongyao Tang, Jianye Hao, Pengyi Li, Yan Zheng, Yaoting Huang, Yibin Chen, Yifu Yuan, Zibin Dong.

Figure 1
Figure 1. Figure 1: The Embodied-R1 framework for zero-shot robotic manipulation through “pointing”. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of four embodied pointing abilities. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of training data: In stage 1, we focus on improving the model’s spatial reasoning capability, while incorporating a small amount of general reasoning data. In stage 2, we train the model’s embodied pointing capabilities, which comprise four distinct capability items. 2.2. Enhancing the Embodied Reasoning Abilities of VLM To develop general embodied pointing capabilities, we train Embodied-R1 on th… view at source ↗
Figure 4
Figure 4. Figure 4: Visualizing Embodied-R1’s Performance on Various Pointing Tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The process of Embodied-R1 performing real-world tasks. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The process of Embodied-R1 performing Task 6 under different visual disturbances. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Analysis: Embodied-R1 possesses embodied reasoning capabilities. It can progressively locate relevant objects and infer spatial relationships according to task instructions, and ultimately provide coordinates through pointing based on embodied scene analysis. to some performance drop. Models utilizing RL consistently outperformed those without (SFT), indicating that RL plays a key role in OOD generali… view at source ↗
Figure 8
Figure 8. Figure 8: Embodied-R1 exhibits strong generalization capabilities. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of Embodied-R1 and the SFT baseline [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualizing Embodied-R1’s Prediction on VTG Tasks across Various Scenarios [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Embodied-R1, a 3B vision-language model for embodied reasoning that defines four core embodied pointing abilities as an embodiment-agnostic intermediate representation bridging high-level vision-language comprehension and low-level action primitives. It constructs the Embodied-Points-200K dataset from embodied and general visual reasoning sources, then trains the model via a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward. The model reports state-of-the-art results on 11 embodied spatial and pointing benchmarks and demonstrates zero-shot generalization, achieving 56.2% success in SIMPLEREnv and 87.5% across 8 real-world XArm tasks without task-specific fine-tuning (a claimed 62% relative improvement over strong baselines), plus robustness to visual disturbances.

Significance. If the central empirical claims hold after verification, the work could meaningfully advance embodied AI and robotics by offering a scalable pointing-centric pathway to close the perception-action gap across heterogeneous embodiments. The combination of large-scale dataset curation, RFT training, and real-robot zero-shot results on XArm tasks represents a concrete strength if baselines and ablations are properly documented; this could influence future generalizable manipulation systems.

major comments (3)
  1. [Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.
  2. [Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.
  3. [Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.
minor comments (1)
  1. [Abstract] Abstract: The specific list of the 11 embodied spatial and pointing benchmarks and their individual performance numbers are not provided, only the aggregate SOTA claim and the two generalization tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below in a point-by-point manner and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.

    Authors: We agree that additional analysis would better substantiate the role of the four embodied pointing abilities. In the revised manuscript, we have added a dedicated ablation study (new Section 4.3) that isolates the performance contribution of each ability by training variants with individual abilities removed. We have also included a failure-case breakdown in the supplementary material that categorizes errors into pointing inaccuracies versus downstream execution issues. Finally, we have explicitly verified and stated in Section 5.2 that the eight real-world XArm tasks were held out from the Embodied-Points-200K dataset and represent novel scenarios, thereby supporting the zero-shot generalization claim. revision: yes

  2. Referee: [Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.

    Authors: We acknowledge the importance of demonstrating robustness to the reward weight choices. The weights in the original work were selected via preliminary validation experiments. To address the concern, the revised manuscript now includes a sensitivity analysis (new Figure 5 and expanded text in Section 3.2) in which we vary the reward weights over a range of values and report the resulting performance on both SIMPLEREnv and real-robot tasks. The analysis shows that generalization remains stable within the tested range, indicating that the reported results are not overly sensitive to precise weight tuning. revision: yes

  3. Referee: [Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.

    Authors: We appreciate this request for greater transparency. Baseline implementation details, including how competing methods were adapted for fair comparison, are already described in Section 4.1; we have now expanded this description with additional implementation specifics. We have added statistical significance testing (paired t-tests with reported p-values) to the main results tables. Data exclusion criteria used during dataset curation are detailed in Section 3.1. As noted in our response to the first comment, we have also incorporated further ablation studies. These additions should enable full verification of the reported numbers and the 62% relative improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation pipeline

full rationale

The paper defines four embodied pointing abilities as an intermediate representation, constructs the Embodied-Points-200K dataset from external embodied and visual reasoning sources, applies a two-stage RFT training procedure with a multi-task reward, and reports measured success rates on separate benchmarks plus held-out SIMPLEREnv and real XArm tasks. These outcomes are direct empirical results from training and evaluation, not quantities that reduce to the inputs by construction via any equation or self-citation. No load-bearing step equates a claimed prediction or generalization to a fitted parameter or prior self-result; the derivation chain remains a standard ML pipeline with independent test distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the pointing representation and the RFT training procedure; these are supported only by the reported empirical outcomes and the assumption that the chosen datasets and reward design capture the necessary variations.

free parameters (1)
  • multi-task reward weights
    The specialized multi-task reward design in the RFT curriculum requires choosing relative weights and functional forms for different pointing and reasoning objectives.
axioms (1)
  • domain assumption Pointing constitutes an embodiment-agnostic intermediate representation that bridges vision-language comprehension and low-level action primitives.
    This premise is invoked to justify the four core abilities and the overall architecture.
invented entities (1)
  • Embodied pointing abilities no independent evidence
    purpose: To structure the model's reasoning and training targets.
    Four abilities are newly defined for this work.

pith-pipeline@v0.9.0 · 5811 in / 1440 out tokens · 65294 ms · 2026-05-18T22:00:18.920789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

    cs.RO 2026-05 unverdicted novelty 5.0

    ForceFlow improves success rates by 37% on six real-world contact-rich tasks over ForceVLA by treating force as a global regulatory signal in a flow-matching policy with hierarchical vision-to-force decomposition.

  2. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 20 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, HumenZhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, ZherenFu, YihengXu, JiaboYe, XiZhang, TianbaoXie, ZesenCheng, HangZhang, ZhiboYang, HaiyangXu, and Junyang Lin. Qwen2.5-vl technical report, 2025a. URLh...

  2. [2]

    Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

    Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

  3. [3]

    Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,

    Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,

  4. [4]

    Spatialrgpt: Grounded spatial reasoning in vision language model.arXiv preprint arXiv:2406.01584,

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model.arXiv preprint arXiv:2406.01584,

  5. [5]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

  6. [6]

    Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729,

  7. [7]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146,

  8. [8]

    Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels. arXivpreprintarXiv:2406.05756,

  9. [9]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  11. [11]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

  12. [12]

    A3vlm: Actionable articulation-aware vision language model,

    Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hong- sheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024a. Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic afford...

  13. [13]

    ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024c. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o syste...

  14. [14]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257,

  15. [15]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    URLhttps://arxiv.org/ abs/2310.06770. Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s "up" with vision-language models? investigating their struggle with spatial reasoning,

  16. [16]

    arXiv preprint arXiv:2310.19785 , year=

    URLhttps://arxiv.org/abs/2310.19785. Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rup- precht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831,

  17. [17]

    ReferItGame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October

  18. [18]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Association for Computational Linguistics. doi: 10.3115/v1/ D14-1086. URL https://aclanthology.org/D14-1086. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,

  19. [19]

    MolmoAct: Action Reasoning Models that can Reason in Space

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

  20. [20]

    Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a. Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Superchar...

  21. [21]

    Laso: Language-guided affordance segmentation on 3d object

    Yicong Li, Na Zhao, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-seng Chua. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024e. Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit qua...

  22. [22]

    Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

  23. [23]

    Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a

    Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a. URLhttps://arxiv.org/abs/2403.03174. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl.githu...

  24. [24]

    Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

    Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

  25. [25]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Jake O’Neill, Abraham Arthurs, Fábio Avila Belbute-Peres, Julian Balaguer, Sarah Bechtle, Gemma Bidoia, Kyle Burden, Erwin Chang, Sheila Chen, Todor Davchev, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

  26. [26]

    OpenAI o1 System Card

    URL https://arxiv.org/abs/2412.16720. Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, GuofanFan, etal. Sofar: Language-groundedorientationbridgesspatialreasoningandobjectmanipulation. arXiv preprint arXiv:2502.13143,

  27. [27]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

  28. [28]

    Sat: Spatial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755,

  29. [29]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

  31. [31]

    Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

  32. [32]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

  33. [33]

    23 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  34. [34]

    Octo: An open-source generalist robot policy

    Octo Team, RT-X Team, Anthony Brohan, Noah Brown, Lauren Chen, Michael Cheng, Krzysztof Choromanski, Eamonn Cullina, Gabe Dalal, Chelsea Fu, Florian Golemo, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2403.10164,

  35. [35]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    HaozheWang,ChaoQu,ZumingHuang,WeiChu,FangzhenLin,andWenhuChen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning, 2025a. URLhttps://arxiv.org/ abs/2504.08837. Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing projec...

  36. [36]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes.arXiv preprint arXiv:1711.00199,

  37. [37]

    Flow as the cross-domain manipulation interface

    Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208,

  38. [38]

    A0: An affordance-aware hierarchical model for general robotic manipulation

    Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al. A0: An affordance-aware hierarchical model for general robotic manipulation. arXiv preprint arXiv:2504.12636,

  39. [39]

    Thinking in Space: How Multi- modal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171,

  40. [40]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  41. [41]

    General flow as foundation affordance for scalable robot learning

    Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024a. 24 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and D...

  42. [42]

    From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

    URL https://arxiv.org/abs/2505.08548. Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

  43. [43]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

  44. [44]

    Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

    Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao, Yueyao Lin, Linkai Liu, Zipeng Guo, Hao Fei, Xiaobo Xia, and Chao Gou. Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

  45. [45]

    place the cup between the book and the spoon

    25 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Appendix A. Automatic Data Generation Pipeline In this section, we provide additional explanations regarding the generation of certain datasets. The generation processes of both the RRG and VTG datasets are improved based on Yuan et al. (2025). 3D RRG Data Generation using Isaa...

  46. [46]

    In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al

    The optimizer selected is AdamW, with a learning rate of 1e-6 and a weight decay coefficient of 1e-2. In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al. (2024), set the number of samples to 8, and introduced a KL penalty (coefficient 1e-2), with a global batch size of 128 for each step. For all experiments, we focus on ...

  47. [47]

    prepare a meal

    It can be seen that Embodied-R1 achieves accurate visual trajectory prediction across various scenarios. move brown chip bag near 7up can Place the teapot on the stovemove green jalapeno chip bag near apple place the burger meat in the ovenmove green can near sponge Put the blue block on the orange plate move the tomato from the cloth to table between the...