Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3
The pith
Defining four core embodied pointing abilities and training a 3B model with reinforced fine-tuning bridges vision-language understanding to robot actions for zero-shot generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embodied-R1 pioneers pointing as an embodiment-agnostic intermediate representation through four core abilities that link high-level vision-language comprehension to low-level action primitives. The 3B model is trained with a two-stage reinforced fine-tuning curriculum on the Embodied-Points-200K dataset and reaches state-of-the-art results on 11 embodied spatial and pointing benchmarks. It further shows robust zero-shot generalization with 56.2 percent success in the SIMPLEREnv and 87.5 percent across eight real-world XArm tasks with no task-specific fine-tuning, a 62 percent gain over strong baselines, plus high robustness to visual disturbances.
What carries the argument
pointing as a unified embodiment-agnostic intermediate representation consisting of four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives
If this is right
- The model reaches state-of-the-art performance on 11 embodied spatial and pointing benchmarks.
- It delivers 56.2 percent success in simulation and 87.5 percent on real XArm tasks without any task-specific fine-tuning.
- Performance improves 62 percent over strong baselines on the real-world tasks.
- The model remains robust under diverse visual disturbances.
- A pointing-centric representation paired with reinforced fine-tuning supplies a generalizable route to closing the perception-action gap.
Where Pith is reading between the lines
- The same pointing representation might transfer to other embodied domains such as navigation or object rearrangement without major redesign.
- Increasing the size of the Embodied-Points dataset or the model could produce further gains on more complex multi-step tasks.
- Direct comparison of the four abilities against alternative intermediate representations on the same robot hardware would clarify their relative contribution.
Load-bearing premise
That the four core embodied pointing abilities are sufficient to connect high-level vision-language comprehension to the low-level action primitives needed by heterogeneous robot embodiments and unseen tasks.
What would settle it
Running the model on a new manipulation task whose required actions fall outside the four defined pointing abilities and observing success rates drop well below the reported 56 percent in simulation or 87 percent in real settings would falsify the claim.
Figures
read the original abstract
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Embodied-R1, a 3B vision-language model for embodied reasoning that defines four core embodied pointing abilities as an embodiment-agnostic intermediate representation bridging high-level vision-language comprehension and low-level action primitives. It constructs the Embodied-Points-200K dataset from embodied and general visual reasoning sources, then trains the model via a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward. The model reports state-of-the-art results on 11 embodied spatial and pointing benchmarks and demonstrates zero-shot generalization, achieving 56.2% success in SIMPLEREnv and 87.5% across 8 real-world XArm tasks without task-specific fine-tuning (a claimed 62% relative improvement over strong baselines), plus robustness to visual disturbances.
Significance. If the central empirical claims hold after verification, the work could meaningfully advance embodied AI and robotics by offering a scalable pointing-centric pathway to close the perception-action gap across heterogeneous embodiments. The combination of large-scale dataset curation, RFT training, and real-robot zero-shot results on XArm tasks represents a concrete strength if baselines and ablations are properly documented; this could influence future generalizable manipulation systems.
major comments (3)
- [Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.
- [Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.
- [Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.
minor comments (1)
- [Abstract] Abstract: The specific list of the 11 embodied spatial and pointing benchmarks and their individual performance numbers are not provided, only the aggregate SOTA claim and the two generalization tasks.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below in a point-by-point manner and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.
Authors: We agree that additional analysis would better substantiate the role of the four embodied pointing abilities. In the revised manuscript, we have added a dedicated ablation study (new Section 4.3) that isolates the performance contribution of each ability by training variants with individual abilities removed. We have also included a failure-case breakdown in the supplementary material that categorizes errors into pointing inaccuracies versus downstream execution issues. Finally, we have explicitly verified and stated in Section 5.2 that the eight real-world XArm tasks were held out from the Embodied-Points-200K dataset and represent novel scenarios, thereby supporting the zero-shot generalization claim. revision: yes
-
Referee: [Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.
Authors: We acknowledge the importance of demonstrating robustness to the reward weight choices. The weights in the original work were selected via preliminary validation experiments. To address the concern, the revised manuscript now includes a sensitivity analysis (new Figure 5 and expanded text in Section 3.2) in which we vary the reward weights over a range of values and report the resulting performance on both SIMPLEREnv and real-robot tasks. The analysis shows that generalization remains stable within the tested range, indicating that the reported results are not overly sensitive to precise weight tuning. revision: yes
-
Referee: [Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.
Authors: We appreciate this request for greater transparency. Baseline implementation details, including how competing methods were adapted for fair comparison, are already described in Section 4.1; we have now expanded this description with additional implementation specifics. We have added statistical significance testing (paired t-tests with reported p-values) to the main results tables. Data exclusion criteria used during dataset curation are detailed in Section 3.1. As noted in our response to the first comment, we have also incorporated further ablation studies. These additions should enable full verification of the reported numbers and the 62% relative improvement. revision: yes
Circularity Check
No circularity: empirical training and held-out evaluation pipeline
full rationale
The paper defines four embodied pointing abilities as an intermediate representation, constructs the Embodied-Points-200K dataset from external embodied and visual reasoning sources, applies a two-stage RFT training procedure with a multi-task reward, and reports measured success rates on separate benchmarks plus held-out SIMPLEREnv and real XArm tasks. These outcomes are direct empirical results from training and evaluation, not quantities that reduce to the inputs by construction via any equation or self-citation. No load-bearing step equates a claimed prediction or generalization to a fitted parameter or prior self-result; the derivation chain remains a standard ML pipeline with independent test distributions.
Axiom & Free-Parameter Ledger
free parameters (1)
- multi-task reward weights
axioms (1)
- domain assumption Pointing constitutes an embodiment-agnostic intermediate representation that bridges vision-language comprehension and low-level action primitives.
invented entities (1)
-
Embodied pointing abilities
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
pioneer 'pointing' as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities... REG, RRG, OFG, VTG
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching
ForceFlow improves success rates by 37% on six real-world contact-rich tasks over ForceVLA by treating force as a global regulatory signal in a flow-matching policy with hierarchical vision-to-force decomposition.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, HumenZhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, ZherenFu, YihengXu, JiaboYe, XiZhang, TianbaoXie, ZesenCheng, HangZhang, ZhiboYang, HaiyangXu, and Junyang Lin. Qwen2.5-vl technical report, 2025a. URLh...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,
-
[3]
arXiv preprint arXiv:2502.05450 , year=
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,
-
[4]
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model.arXiv preprint arXiv:2406.01584,
-
[5]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025
Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729,
-
[7]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels. arXivpreprintarXiv:2406.05756,
-
[9]
Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024
Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hong- sheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024a. Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic afford...
-
[13]
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024c. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o syste...
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Robobrain: A unified brain model for robotic manipulation from abstract to concrete
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257,
-
[15]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
URLhttps://arxiv.org/ abs/2310.06770. Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s "up" with vision-language models? investigating their struggle with spatial reasoning,
work page internal anchor Pith review Pith/arXiv arXiv
- [16]
-
[17]
ReferItGame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October
work page 2014
-
[18]
OpenVLA: An Open-Source Vision-Language-Action Model
Association for Computational Linguistics. doi: 10.3115/v1/ D14-1086. URL https://aclanthology.org/D14-1086. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/
-
[19]
MolmoAct: Action Reasoning Models that can Reason in Space
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a. Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Superchar...
-
[21]
Laso: Language-guided affordance segmentation on 3d object
Yicong Li, Na Zhao, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-seng Chua. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024e. Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit qua...
-
[22]
Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,
Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,
-
[23]
Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a
Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a. URLhttps://arxiv.org/abs/2403.03174. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl.githu...
-
[24]
Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,
-
[25]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Jake O’Neill, Abraham Arthurs, Fábio Avila Belbute-Peres, Julian Balaguer, Sarah Bechtle, Gemma Bidoia, Kyle Burden, Erwin Chang, Sheila Chen, Todor Davchev, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/abs/2412.16720. Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, GuofanFan, etal. Sofar: Language-groundedorientationbridgesspatialreasoningandobjectmanipulation. arXiv preprint arXiv:2502.13143,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Sat: Spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755,
-
[29]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,
-
[32]
Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,
Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,
-
[33]
23 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Octo: An open-source generalist robot policy
Octo Team, RT-X Team, Anthony Brohan, Noah Brown, Lauren Chen, Michael Cheng, Krzysztof Choromanski, Eamonn Cullina, Gabe Dalal, Chelsea Fu, Florian Golemo, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2403.10164,
-
[35]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
HaozheWang,ChaoQu,ZumingHuang,WeiChu,FangzhenLin,andWenhuChen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning, 2025a. URLhttps://arxiv.org/ abs/2504.08837. Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing projec...
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes.arXiv preprint arXiv:1711.00199,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Flow as the cross-domain manipulation interface
Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208,
-
[38]
A0: An affordance-aware hierarchical model for general robotic manipulation
Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al. A0: An affordance-aware hierarchical model for general robotic manipulation. arXiv preprint arXiv:2504.12636,
-
[39]
arXiv preprint arXiv:2412.14171 , year=
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171,
-
[40]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
General flow as foundation affordance for scalable robot learning
Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024a. 24 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and D...
-
[42]
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
URL https://arxiv.org/abs/2505.08548. Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,
Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao, Yueyao Lin, Linkai Liu, Zipeng Guo, Hao Fei, Xiaobo Xia, and Chao Gou. Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,
-
[45]
place the cup between the book and the spoon
25 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Appendix A. Automatic Data Generation Pipeline In this section, we provide additional explanations regarding the generation of certain datasets. The generation processes of both the RRG and VTG datasets are improved based on Yuan et al. (2025). 3D RRG Data Generation using Isaa...
work page 2025
-
[46]
In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al
The optimizer selected is AdamW, with a learning rate of 1e-6 and a weight decay coefficient of 1e-2. In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al. (2024), set the number of samples to 8, and introduced a KL penalty (coefficient 1e-2), with a global batch size of 128 for each step. For all experiments, we focus on ...
work page 2024
-
[47]
It can be seen that Embodied-R1 achieves accurate visual trajectory prediction across various scenarios. move brown chip bag near 7up can Place the teapot on the stovemove green jalapeno chip bag near apple place the burger meat in the ovenmove green can near sponge Put the blue block on the orange plate move the tomato from the cloth to table between the...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.