Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Fei Ni; Haiqin Cui; Hongyao Tang; Jianye Hao; Pengyi Li; Yan Zheng; Yaoting Huang; Yibin Chen; Yifu Yuan; Zibin Dong

arxiv: 2508.13998 · v2 · submitted 2025-08-19 · 💻 cs.RO · cs.AI· cs.LG

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Yifu Yuan , Haiqin Cui , Yaoting Huang , Yibin Chen , Fei Ni , Zibin Dong , Pengyi Li , Yan Zheng

show 2 more authors

Hongyao Tang Jianye Hao

This is my paper

Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords embodied reasoningrobotic manipulationvision-language modelreinforced fine-tuningzero-shot generalizationpointing representationperception-action gapSIMPLEREnv

0 comments

The pith

Defining four core embodied pointing abilities and training a 3B model with reinforced fine-tuning bridges vision-language understanding to robot actions for zero-shot generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the seeing-to-doing gap in embodied AI that comes from scarce specialized data and differences across robot bodies. It establishes pointing as a single intermediate format that works for any robot by defining four specific abilities that turn visual and language inputs into action steps. A large dataset of 200K pointing examples is assembled from embodied and general sources, then used to train Embodied-R1 through a two-stage reinforced fine-tuning process with multi-task rewards. A sympathetic reader would care because success here would mean robots can handle new tasks and physical setups using existing vision-language models without retraining for each case.

Core claim

Embodied-R1 pioneers pointing as an embodiment-agnostic intermediate representation through four core abilities that link high-level vision-language comprehension to low-level action primitives. The 3B model is trained with a two-stage reinforced fine-tuning curriculum on the Embodied-Points-200K dataset and reaches state-of-the-art results on 11 embodied spatial and pointing benchmarks. It further shows robust zero-shot generalization with 56.2 percent success in the SIMPLEREnv and 87.5 percent across eight real-world XArm tasks with no task-specific fine-tuning, a 62 percent gain over strong baselines, plus high robustness to visual disturbances.

What carries the argument

pointing as a unified embodiment-agnostic intermediate representation consisting of four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives

If this is right

The model reaches state-of-the-art performance on 11 embodied spatial and pointing benchmarks.
It delivers 56.2 percent success in simulation and 87.5 percent on real XArm tasks without any task-specific fine-tuning.
Performance improves 62 percent over strong baselines on the real-world tasks.
The model remains robust under diverse visual disturbances.
A pointing-centric representation paired with reinforced fine-tuning supplies a generalizable route to closing the perception-action gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pointing representation might transfer to other embodied domains such as navigation or object rearrangement without major redesign.
Increasing the size of the Embodied-Points dataset or the model could produce further gains on more complex multi-step tasks.
Direct comparison of the four abilities against alternative intermediate representations on the same robot hardware would clarify their relative contribution.

Load-bearing premise

That the four core embodied pointing abilities are sufficient to connect high-level vision-language comprehension to the low-level action primitives needed by heterogeneous robot embodiments and unseen tasks.

What would settle it

Running the model on a new manipulation task whose required actions fall outside the four defined pointing abilities and observing success rates drop well below the reported 56 percent in simulation or 87 percent in real settings would falsify the claim.

Figures

Figures reproduced from arXiv: 2508.13998 by Fei Ni, Haiqin Cui, Hongyao Tang, Jianye Hao, Pengyi Li, Yan Zheng, Yaoting Huang, Yibin Chen, Yifu Yuan, Zibin Dong.

**Figure 2.** Figure 2: Overview of four embodied pointing abilities. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of training data: In stage 1, we focus on improving the model’s spatial reasoning capability, while incorporating a small amount of general reasoning data. In stage 2, we train the model’s embodied pointing capabilities, which comprise four distinct capability items. 2.2. Enhancing the Embodied Reasoning Abilities of VLM To develop general embodied pointing capabilities, we train Embodied-R1 on th… view at source ↗

**Figure 4.** Figure 4: Visualizing Embodied-R1’s Performance on Various Pointing Tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The process of Embodied-R1 performing real-world tasks. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: The process of Embodied-R1 performing Task 6 under different visual disturbances. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Case Analysis: Embodied-R1 possesses embodied reasoning capabilities. It can progressively locate relevant objects and infer spatial relationships according to task instructions, and ultimately provide coordinates through pointing based on embodied scene analysis. to some performance drop. Models utilizing RL consistently outperformed those without (SFT), indicating that RL plays a key role in OOD generali… view at source ↗

**Figure 8.** Figure 8: Embodied-R1 exhibits strong generalization capabilities. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of Embodied-R1 and the SFT baseline [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Visualizing Embodied-R1’s Prediction on VTG Tasks across Various Scenarios [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

read the original abstract

Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Embodied-R1 frames pointing as an intermediate representation and gets concrete zero-shot numbers on real robots, but the lack of ablations leaves the generalization mechanism untested.

read the letter

Embodied-R1 stands out for treating pointing as an embodiment-agnostic intermediate step that connects vision-language understanding to low-level actions. The four defined abilities plus the two-stage reinforced fine-tuning on the 200K dataset give the work a distinct setup compared to standard VLM adaptation for robotics. The paper builds Embodied-Points-200K from mixed embodied and visual sources and applies a multi-task reward design during training. This produces state-of-the-art results on 11 benchmarks and the headline numbers of 56.2 percent success in SIMPLEREnv plus 87.5 percent across eight real XArm tasks without task-specific fine-tuning, along with reported robustness to visual disturbances. Those concrete outcomes are the part worth noting if they hold under scrutiny. The main gap is in the supporting experiments. No ablations isolate the contribution of each pointing ability or separate pointing mistakes from downstream control errors. It is also unclear how much the evaluation tasks overlap with the training distribution, which directly affects how much credit the zero-shot claim deserves. Baseline details on implementation and statistical checks would make the 62 percent improvement easier to assess. This paper is for researchers working on VLMs for robotic manipulation and generalization across embodiments. Someone looking for practical intermediate representations to bridge seeing and doing would find the pointing idea and RFT curriculum worth examining. I would send it for peer review. The empirical results are specific and the core framing is different enough from prior work that referees can usefully press for the missing controls and clarify the generalization story.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Embodied-R1, a 3B vision-language model for embodied reasoning that defines four core embodied pointing abilities as an embodiment-agnostic intermediate representation bridging high-level vision-language comprehension and low-level action primitives. It constructs the Embodied-Points-200K dataset from embodied and general visual reasoning sources, then trains the model via a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward. The model reports state-of-the-art results on 11 embodied spatial and pointing benchmarks and demonstrates zero-shot generalization, achieving 56.2% success in SIMPLEREnv and 87.5% across 8 real-world XArm tasks without task-specific fine-tuning (a claimed 62% relative improvement over strong baselines), plus robustness to visual disturbances.

Significance. If the central empirical claims hold after verification, the work could meaningfully advance embodied AI and robotics by offering a scalable pointing-centric pathway to close the perception-action gap across heterogeneous embodiments. The combination of large-scale dataset curation, RFT training, and real-robot zero-shot results on XArm tasks represents a concrete strength if baselines and ablations are properly documented; this could influence future generalizable manipulation systems.

major comments (3)

[Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.
[Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.
[Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.

minor comments (1)

[Abstract] Abstract: The specific list of the 11 embodied spatial and pointing benchmarks and their individual performance numbers are not provided, only the aggregate SOTA claim and the two generalization tasks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below in a point-by-point manner and have revised the manuscript accordingly to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The headline zero-shot generalization claim (56.2% SIMPLEREnv success, 87.5% on 8 XArm tasks, 62% relative improvement) is presented as evidence that the four defined embodied pointing abilities close the seeing-to-doing gap, yet the manuscript provides no ablation isolating the contribution of each ability, no failure-case breakdown separating pointing errors from downstream execution errors, and no explicit verification that the 8 real tasks lie outside the Embodied-Points-200K distribution. This directly undermines assessment of whether the abilities are necessary and sufficient for the reported transfer.

Authors: We agree that additional analysis would better substantiate the role of the four embodied pointing abilities. In the revised manuscript, we have added a dedicated ablation study (new Section 4.3) that isolates the performance contribution of each ability by training variants with individual abilities removed. We have also included a failure-case breakdown in the supplementary material that categorizes errors into pointing inaccuracies versus downstream execution issues. Finally, we have explicitly verified and stated in Section 5.2 that the eight real-world XArm tasks were held out from the Embodied-Points-200K dataset and represent novel scenarios, thereby supporting the zero-shot generalization claim. revision: yes
Referee: [Abstract] Abstract and training description: The multi-task reward design in the two-stage RFT curriculum is described as specialized, but the presence of free parameters (reward weights) is not analyzed for sensitivity; without this, it is unclear whether the reported cross-embodiment and cross-task generalization is robust or dependent on task-specific tuning that would not generalize to truly novel tasks.

Authors: We acknowledge the importance of demonstrating robustness to the reward weight choices. The weights in the original work were selected via preliminary validation experiments. To address the concern, the revised manuscript now includes a sensitivity analysis (new Figure 5 and expanded text in Section 3.2) in which we vary the reward weights over a range of values and report the resulting performance on both SIMPLEREnv and real-robot tasks. The analysis shows that generalization remains stable within the tested range, indicating that the reported results are not overly sensitive to precise weight tuning. revision: yes
Referee: [Abstract] Abstract: Concrete performance numbers and the 62% relative improvement are reported without details on baseline implementations, statistical significance testing, data exclusion criteria, or ablation studies, preventing full verification of the central generalization claims from the provided text.

Authors: We appreciate this request for greater transparency. Baseline implementation details, including how competing methods were adapted for fair comparison, are already described in Section 4.1; we have now expanded this description with additional implementation specifics. We have added statistical significance testing (paired t-tests with reported p-values) to the main results tables. Data exclusion criteria used during dataset curation are detailed in Section 3.1. As noted in our response to the first comment, we have also incorporated further ablation studies. These additions should enable full verification of the reported numbers and the 62% relative improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and held-out evaluation pipeline

full rationale

The paper defines four embodied pointing abilities as an intermediate representation, constructs the Embodied-Points-200K dataset from external embodied and visual reasoning sources, applies a two-stage RFT training procedure with a multi-task reward, and reports measured success rates on separate benchmarks plus held-out SIMPLEREnv and real XArm tasks. These outcomes are direct empirical results from training and evaluation, not quantities that reduce to the inputs by construction via any equation or self-citation. No load-bearing step equates a claimed prediction or generalization to a fitted parameter or prior self-result; the derivation chain remains a standard ML pipeline with independent test distributions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the pointing representation and the RFT training procedure; these are supported only by the reported empirical outcomes and the assumption that the chosen datasets and reward design capture the necessary variations.

free parameters (1)

multi-task reward weights
The specialized multi-task reward design in the RFT curriculum requires choosing relative weights and functional forms for different pointing and reasoning objectives.

axioms (1)

domain assumption Pointing constitutes an embodiment-agnostic intermediate representation that bridges vision-language comprehension and low-level action primitives.
This premise is invoked to justify the four core abilities and the overall architecture.

invented entities (1)

Embodied pointing abilities no independent evidence
purpose: To structure the model's reasoning and training targets.
Four abilities are newly defined for this work.

pith-pipeline@v0.9.0 · 5811 in / 1440 out tokens · 65294 ms · 2026-05-18T22:00:18.920789+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

pioneer 'pointing' as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities... REG, RRG, OFG, VTG
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching
cs.RO 2026-05 unverdicted novelty 5.0

ForceFlow improves success rates by 37% on six real-world contact-rich tasks over ForceVLA by treating force as a global regulatory signal in a flow-matching policy with hierarchical vision-to-force decomposition.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 20 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, HumenZhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, ZherenFu, YihengXu, JiaboYe, XiZhang, TianbaoXie, ZesenCheng, HangZhang, ZhiboYang, HaiyangXu, and Junyang Lin. Qwen2.5-vl technical report, 2025a. URLh...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv
[3]

arXiv preprint arXiv:2502.05450 , year=

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,

work page arXiv
[4]

3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model.arXiv preprint arXiv:2406.01584,

work page arXiv
[5]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729,

work page arXiv
[7]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels. arXivpreprintarXiv:2406.05756,

work page arXiv
[9]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

work page arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024

Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hong- sheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024a. Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic afford...

work page arXiv
[13]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024c. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o syste...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257,

work page arXiv
[15]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URLhttps://arxiv.org/ abs/2310.06770. Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s "up" with vision-language models? investigating their struggle with spatial reasoning,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Kamath, J

URLhttps://arxiv.org/abs/2310.19785. Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rup- precht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831,

work page arXiv
[17]

ReferItGame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October

work page 2014
[18]

OpenVLA: An Open-Source Vision-Language-Action Model

Association for Computational Linguistics. doi: 10.3115/v1/ D14-1086. URL https://aclanthology.org/D14-1086. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/
[19]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a. Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Superchar...

work page arXiv
[21]

Laso: Language-guided affordance segmentation on 3d object

Yicong Li, Na Zhao, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-seng Chua. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024e. Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit qua...

work page arXiv
[22]

Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

work page arXiv
[23]

Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a. URLhttps://arxiv.org/abs/2403.03174. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl.githu...

work page arXiv 2024
[24]

Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

work page arXiv
[25]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Jake O’Neill, Abraham Arthurs, Fábio Avila Belbute-Peres, Julian Balaguer, Sarah Bechtle, Gemma Bidoia, Kyle Burden, Erwin Chang, Sheila Chen, Todor Davchev, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

OpenAI o1 System Card

URL https://arxiv.org/abs/2412.16720. Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, GuofanFan, etal. Sofar: Language-groundedorientationbridgesspatialreasoningandobjectmanipulation. arXiv preprint arXiv:2502.13143,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Sat: Spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755,

work page arXiv
[29]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

work page arXiv
[32]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

work page arXiv
[33]

23 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Octo: An open-source generalist robot policy

Octo Team, RT-X Team, Anthony Brohan, Noah Brown, Lauren Chen, Michael Cheng, Krzysztof Choromanski, Eamonn Cullina, Gabe Dalal, Chelsea Fu, Florian Golemo, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2403.10164,

work page arXiv
[35]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

HaozheWang,ChaoQu,ZumingHuang,WeiChu,FangzhenLin,andWenhuChen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning, 2025a. URLhttps://arxiv.org/ abs/2504.08837. Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing projec...

work page internal anchor Pith review Pith/arXiv arXiv
[36]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes.arXiv preprint arXiv:1711.00199,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Flow as the cross-domain manipulation interface

Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208,

work page arXiv
[38]

A0: An affordance-aware hierarchical model for general robotic manipulation

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al. A0: An affordance-aware hierarchical model for general robotic manipulation. arXiv preprint arXiv:2504.12636,

work page arXiv
[39]

arXiv preprint arXiv:2412.14171 , year=

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171,

work page arXiv
[40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

General flow as foundation affordance for scalable robot learning

Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024a. 24 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and D...

work page arXiv
[42]

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

URL https://arxiv.org/abs/2505.08548. Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao, Yueyao Lin, Linkai Liu, Zipeng Guo, Hao Fei, Xiaobo Xia, and Chao Gou. Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

work page arXiv
[45]

place the cup between the book and the spoon

25 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Appendix A. Automatic Data Generation Pipeline In this section, we provide additional explanations regarding the generation of certain datasets. The generation processes of both the RRG and VTG datasets are improved based on Yuan et al. (2025). 3D RRG Data Generation using Isaa...

work page 2025
[46]

In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al

The optimizer selected is AdamW, with a learning rate of 1e-6 and a weight decay coefficient of 1e-2. In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al. (2024), set the number of samples to 8, and introduced a KL penalty (coefficient 1e-2), with a global batch size of 128 for each step. For all experiments, we focus on ...

work page 2024
[47]

prepare a meal

It can be seen that Embodied-R1 achieves accurate visual trajectory prediction across various scenarios. move brown chip bag near 7up can Place the teapot on the stovemove green jalapeno chip bag near apple place the burger meat in the ovenmove green can near sponge Put the blue block on the orange plate move the tomato from the cloth to table between the...

work page 2024

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, HumenZhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, ZherenFu, YihengXu, JiaboYe, XiZhang, TianbaoXie, ZesenCheng, HangZhang, ZhiboYang, HaiyangXu, and Junyang Lin. Qwen2.5-vl technical report, 2025a. URLh...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

Wenxiao Cai, Iaroslav Ponomarenko, Jianhao Yuan, Xiaoqi Li, Wankou Yang, Hao Dong, and Bo Zhao. Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

work page arXiv

[3] [3]

arXiv preprint arXiv:2502.05450 , year=

Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,

work page arXiv

[4] [4]

3 Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, and Maosong Sun

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model.arXiv preprint arXiv:2406.01584,

work page arXiv

[5] [5]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Action-free reasoning for policy generalization.arXiv preprint arXiv:2502.03729, 2025

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729,

work page arXiv

[7] [7]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spa- tialunderstandingforembodiedtaskswithlargevision-languagemodels. arXivpreprintarXiv:2406.05756,

work page arXiv

[9] [9]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977,

work page arXiv

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024

Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hong- sheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024a. Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic afford...

work page arXiv

[13] [13]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024c. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o syste...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Robobrain: A unified brain model for robotic manipulation from abstract to concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. arXiv preprint arXiv:2502.21257,

work page arXiv

[15] [15]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

URLhttps://arxiv.org/ abs/2310.06770. Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s "up" with vision-language models? investigating their struggle with spatial reasoning,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Kamath, J

URLhttps://arxiv.org/abs/2310.19785. Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rup- precht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos.arXiv preprint arXiv:2410.11831,

work page arXiv

[17] [17]

ReferItGame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, October

work page 2014

[18] [18]

OpenVLA: An Open-Source Vision-Language-Action Model

Association for Computational Linguistics. doi: 10.3115/v1/ D14-1086. URL https://aclanthology.org/D14-1086. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.3115/v1/

[19] [19]

MolmoAct: Action Reasoning Models that can Reason in Space

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. Topviewrs: Vision- language models as top-view spatial reasoners.arXiv preprint arXiv:2406.02537, 2024a. Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Superchar...

work page arXiv

[21] [21]

Laso: Language-guided affordance segmentation on 3d object

Yicong Li, Na Zhao, Junbin Xiao, Chun Feng, Xiang Wang, and Tat-seng Chua. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14251–14260, 2024e. Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. Reasoning paths with reference objects elicit qua...

work page arXiv

[22] [22]

Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation.arXiv preprint arXiv:2410.18647,

work page arXiv

[23] [23]

Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a

Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024a. URLhttps://arxiv.org/abs/2403.03174. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl.githu...

work page arXiv 2024

[24] [24]

Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

Soroush Nasiriany, Sean Kirmani, Tianli Ding, Laura Smith, Yuke Zhu, Danny Driess, Dorsa Sadigh, and Ted Xiao. Rt-affordance: Affordances are versatile intermediate representations for robot manipulation.arXiv preprint arXiv:2411.02704,

work page arXiv

[25] [25]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Jake O’Neill, Abraham Arthurs, Fábio Avila Belbute-Peres, Julian Balaguer, Sarah Bechtle, Gemma Bidoia, Kyle Burden, Erwin Chang, Sheila Chen, Todor Davchev, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

OpenAI o1 System Card

URL https://arxiv.org/abs/2412.16720. Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, GuofanFan, etal. Sofar: Language-groundedorientationbridgesspatialreasoningandobjectmanipulation. arXiv preprint arXiv:2502.13143,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Sat: Spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755,

work page arXiv

[29] [29]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospa- tial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,

work page arXiv

[32] [32]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

work page arXiv

[33] [33]

23 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Octo: An open-source generalist robot policy

Octo Team, RT-X Team, Anthony Brohan, Noah Brown, Lauren Chen, Michael Cheng, Krzysztof Choromanski, Eamonn Cullina, Gabe Dalal, Chelsea Fu, Florian Golemo, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2403.10164,

work page arXiv

[35] [35]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

HaozheWang,ChaoQu,ZumingHuang,WeiChu,FangzhenLin,andWenhuChen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning, 2025a. URLhttps://arxiv.org/ abs/2504.08837. Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing projec...

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes.arXiv preprint arXiv:1711.00199,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Flow as the cross-domain manipulation interface

Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface.arXiv preprint arXiv:2407.15208,

work page arXiv

[38] [38]

A0: An affordance-aware hierarchical model for general robotic manipulation

Rongtao Xu, Jian Zhang, Minghao Guo, Youpeng Wen, Haoting Yang, Min Lin, Jianzheng Huang, Zhe Li, Kaidong Zhang, Liqiong Wang, et al. A0: An affordance-aware hierarchical model for general robotic manipulation. arXiv preprint arXiv:2504.12636,

work page arXiv

[39] [39]

arXiv preprint arXiv:2412.14171 , year=

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171,

work page arXiv

[40] [40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

General flow as foundation affordance for scalable robot learning

Chengbo Yuan, Chuan Wen, Tong Zhang, and Yang Gao. General flow as foundation affordance for scalable robot learning.arXiv preprint arXiv:2401.11439, 2024a. 24 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and D...

work page arXiv

[42] [42]

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

URL https://arxiv.org/abs/2505.08548. Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao, Yueyao Lin, Linkai Liu, Zipeng Guo, Hao Fei, Xiaobo Xia, and Chao Gou. Where, what, why: Towardsexplainabledriverattentionprediction.arXivpreprintarXiv:2506.23088,

work page arXiv

[45] [45]

place the cup between the book and the spoon

25 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Appendix A. Automatic Data Generation Pipeline In this section, we provide additional explanations regarding the generation of certain datasets. The generation processes of both the RRG and VTG datasets are improved based on Yuan et al. (2025). 3D RRG Data Generation using Isaa...

work page 2025

[46] [46]

In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al

The optimizer selected is AdamW, with a learning rate of 1e-6 and a weight decay coefficient of 1e-2. In Embodied-R1, we performed reinforcement learning training based on GRPO Shao et al. (2024), set the number of samples to 8, and introduced a KL penalty (coefficient 1e-2), with a global batch size of 128 for each step. For all experiments, we focus on ...

work page 2024

[47] [47]

prepare a meal

It can be seen that Embodied-R1 achieves accurate visual trajectory prediction across various scenarios. move brown chip bag near 7up can Place the teapot on the stovemove green jalapeno chip bag near apple place the burger meat in the ovenmove green can near sponge Put the blue block on the orange plate move the tomato from the cloth to table between the...

work page 2024