Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

Jia-Hong Lai; Jin-Cheng Jhang; KangTung-Hsu; Shiang-Feng Tsai; Shih-Yun Wong; Yen-Ling Tai; Yi-Ting Chen

arxiv: 2606.27663 · v1 · pith:N64O7X3Inew · submitted 2026-06-26 · 💻 cs.RO

Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization

Shiang-Feng Tsai , Jin-Cheng Jhang , Yen-Ling Tai , Jia-Hong Lai , Shih-Yun Wong , KangTung-Hsu , Yi-Ting Chen This is my paper

Pith reviewed 2026-06-29 05:02 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action models3D groundingspatial generalizationtask generalizationrobot manipulationaction head injectionadaptive layer normalization

0 comments

The pith

Injecting a grounded 3D point directly into the action head unlocks spatial and task generalization in vision-language-action models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models remain brittle when object positions shift or instructions change at test time. The paper demonstrates that language or 2D visual prompting of grounding fails to resolve this, while a 3D point representation of the target, converted to gripper-relative displacement and fed straight into the action head, produces large gains. A two-layer MLP computes the embedding and applies it through adaptive layer normalization without altering the pretrained backbone. On LIBERO-PRO this raises success rates under task perturbation from 31.2 to 77.5 and under position perturbation from 28.1 to 60.2 for one backbone, with similar lifts for another. The central claim is that the representation and injection site of the grounding signal, not additional pretraining, drives the generalization improvement.

Core claim

Exploiting a 3D point-based representation and feeding it directly to the action head leads to substantial improvements, revealing that how the grounding signal is represented and injected into the VLA is the true game changer. The lightweight module represents the grounding signal in 3D, computes its relative displacement to the gripper, and injects the resulting spatial embedding directly into the action head through adaptive layer normalization.

What carries the argument

Lightweight two-layer MLP module that lifts grounding to a 3D point, computes its relative displacement to the gripper, and injects the spatial embedding into the action head via adaptive layer normalization

If this is right

The gains appear for multiple VLA backbones without any change to pretraining or the core model
Both task instruction changes and object position shifts are addressed by the same injection mechanism
The module adds negligible parameters yet delivers over 30-point average success improvements
The result isolates injection method as the decisive factor rather than the amount of grounding data

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar direct geometric injection could be tested in other multimodal policies that currently rely on language prompting for spatial cues
The finding implies that action heads in existing VLAs may lack built-in mechanisms to extract 3D relations from visual features alone
Extending the module to multiple 3D points or dynamic gripper-object relations would be a direct next measurement

Load-bearing premise

The method assumes an accurate 3D point for the target object can be obtained and that its relative displacement to the gripper alone supplies sufficient spatial information for the action head to generalize.

What would settle it

Applying the module on LIBERO-PRO under task and position perturbations but replacing the 3D relative-displacement embedding with a 2D coordinate or an inaccurate point and seeing no success-rate gains would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.27663 by Jia-Hong Lai, Jin-Cheng Jhang, KangTung-Hsu, Shiang-Feng Tsai, Shih-Yun Wong, Yen-Ling Tai, Yi-Ting Chen.

**Figure 1.** Figure 1: Comparison of grounding representation and injection paradigms. Given an identical 2D target point, appending coordinates to the language prompt and overlaying it as a visual prompt on the image yield slight improvements over the baseline, whereas lifting it into 3D and injecting it into the action head leads to substantially better performance on LIBERO-PRO benchmark. Specifically, as shown in [PITH_F… view at source ↗

**Figure 2.** Figure 2: Overview of our proposed pipeline. Given a task instruction, (1) any off-the-shelf visual grounding module localizes the task-relevant object as a 2D point in the image. (2) We lift this 2D point into 3D, compute its relative displacement to the gripper, and encode the displacement through a two-layer MLP into a spatial embedding. (3) The spatial embedding is injected directly into the flow-matching action… view at source ↗

**Figure 3.** Figure 3: Comparison of all variants at 20-task and 69-task training scales. Our method trained on only 20 tasks already surpasses every other method trained on 69 tasks, indicating a well-chosen design outweighs data scaling. 4.5 Robustness to Noisy Depth Inputs We evaluate robustness of our method to depth noise by adding Gaussian noise (σ = 0.01 and 0.03 m) to depth during training and testing [PITH_FULL_IMAGE:f… view at source ↗

**Figure 4.** Figure 4: Depth Noise Experiments. Training with noisy depth causes a slight performance drop but still outperforms the baseline, and our method is tolerant to depth noise at test time. Effect of Training-Time Depth Noise. Training with noisy depth causes moderate degradation compared to the clean-depth variant, as noise propagates to errors in the estimated 3D positions. Nevertheless, both noisy-training variants… view at source ↗

**Figure 5.** Figure 5: Real-world experiment. Our method remains effective under both Task and Position Perturbation, while baselines collapse to near-zero success. Platform. We use a Franka Emika Panda robot with two RealSense D435 cameras to capture a front view (RGB-D) and a wrist-mounted view (RGB). Tasks and Evaluation. We collect 8 pickand-place tasks and 40 demonstrations per task for training. Each task is evaluated u… view at source ↗

**Figure 6.** Figure 6: Example pick-and-place demo showing the target object’s and gripper’s 2D and 3D coordinates across the two different phases. In this subsection, we provide the implementation details of each variant evaluated in Sec. 4.3. In [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the Visual Prompt variant. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Real-world experiment task setup. world setup relies entirely on an off-the-shelf vision-language model: we apply a Qwen3-VL-4B directly with no fine-tuning. Given the task instruction and the front-view RGB image, we query the model with a single prompt that asks it to localize both sub-goal targets at once—the object to be picked up (role target) and the region or object where it should be placed (role p… view at source ↗

**Figure 9.** Figure 9: Qualitative results on LIBERO and LIBERO-PRO. The baseline succeeds Indistribution but fails under Task Perturbation and Position Perturbation, whereas our method completes the task in all three conditions. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results in the real-world setting. The baseline succeeds In-distribution but fails under Task Perturbation and Position Perturbation, whereas our method completes the task in all three conditions. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models leverage large-scale vision-language pretraining for flexible robot manipulation, yet at test time they remain brittle along two axes: spatial generalization, when object positions differ from those seen during training, and task generalization, when a familiar scene is paired with a different language instruction than the one seen in training. A growing family of methods addresses this brittleness by endowing a policy with the spatial and task-aware information such as 2D pixel-coordinate for object localization and placement. However, we find that existing representation through language prompting or visual prompting does not address the limitations; in contrast, exploiting a 3D point-based representation and feeding it directly to the action head leads to substantial improvements-revealing that how the grounding signal is represented and injected into the VLA is the true game changer. Thus, we propose a lightweight, model-agnostic module that represents the grounding signal in 3D, computes its relative displacement to the gripper, and injects the resulting spatial embedding directly into the action head through adaptive layer normalization. The entire module is a two-layer MLP that requires no changes to the VLA backbone or pretraining pipeline. On LIBERO-PRO, our method improves the average success rate of GR00T-N1.6 from 31.2 to 77.5 points under task perturbation and from 28.1 to 60.2 points under position perturbation (gains of 46.3 and 32.1 points). Comparable gains are achieved for $\pi_{0.5}$ as well, demonstrating that the mechanism is backbone-agnostic. Together, these results support our central finding: given adequate grounding lifted into 3D, injecting it directly into the action head is what unlocks both spatial and task generalization in VLAs-achievable with nothing more than a lightweight module on top of a pretrained backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports large gains on LIBERO-PRO from injecting a 3D relative gripper-target displacement into the action head via a small MLP and AdaLN, but the result rests on having perfect 3D points at test time.

read the letter

The main point is that adding a lightweight module which turns a 3D target point into a relative displacement vector and feeds it directly into the action head through adaptive layer norm produces big reported jumps in success rate on LIBERO-PRO for both GR00T-N1.6 and pi_0.5. The numbers are 46-point and 32-point gains under task and position perturbations.

What is actually new is the combination of 3D representation plus direct action-head injection rather than language or 2D visual prompting. The paper shows this works on top of existing pretrained backbones with no changes to pretraining, which is a practical advantage. The results demonstrate that where and how the grounding signal enters the model can matter more than the signal itself.

The method is simple and model-agnostic, so the core idea is easy to test. That counts as a clear positive.

The soft spots are straightforward. The abstract conditions everything on "adequate grounding" lifted into 3D, yet gives no experiments on sensor noise, occlusion, or how the 3D point is obtained in the first place. No error bars or detailed ablations appear in the provided text, so it is hard to judge how stable the gains are. The stress-test note is right on this: the headline result assumes an oracle-accurate 3D point whose 3-DOF displacement alone is enough. Without tests under realistic conditions, the generalization claim stays conditional.

This is for people working on VLA policies who need better spatial and task robustness. A reader already running these models would find the injection technique worth trying.

It deserves a serious referee because the empirical gains are large and the change is minimal, even if the evaluation needs more controls on the 3D assumption. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper claims that existing 2D prompting methods fail to address spatial and task generalization brittleness in VLAs, but a lightweight module that lifts grounding to a 3D point, computes its relative 3-DOF displacement to the gripper, and injects the embedding directly into the action head via a two-layer MLP and AdaLN yields large gains. On LIBERO-PRO, GR00T-N1.6 improves from 31.2 to 77.5 under task perturbation and 28.1 to 60.2 under position perturbation (gains of 46.3 and 32.1); comparable gains hold for π_{0.5}. The module requires no backbone changes or retraining, supporting the claim that direct action-head injection of adequate 3D grounding is the key mechanism.

Significance. If reproducible, the result would indicate that the representation and injection point of spatial grounding matter more than the grounding source itself, offering a simple, backbone-agnostic path to better generalization in VLAs. The explicit conditioning on 'adequate grounding' and the model-agnostic design are clear strengths that could be adopted quickly. However, the significance is limited by the absence of robustness tests against the oracle-accurate 3D point assumption.

major comments (2)

[Abstract] Abstract: the headline gains (46.3 and 32.1 points) are reported without error bars, standard deviations, number of seeds, or full hyperparameter details, which is load-bearing for the empirical generalization claim.
[Abstract] Abstract and method description: the central claim that relative 3-DOF displacement alone suffices for both spatial and task generalization is conditioned on 'adequate grounding' but no experiments or analysis address degradation under realistic 3D sensor noise, partial occlusion, or grounding error, leaving the weakest assumption untested.

minor comments (1)

[Abstract] Abstract: the sentence 'endowing a policy with the spatial and task-aware information such as 2D pixel-coordinate' is grammatically incomplete and should be clarified when contrasting with the proposed 3D method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical reporting and the scope of our claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline gains (46.3 and 32.1 points) are reported without error bars, standard deviations, number of seeds, or full hyperparameter details, which is load-bearing for the empirical generalization claim.

Authors: We agree that statistical details are essential for supporting the generalization claims. The reported numbers are averages across evaluation episodes, but error bars, standard deviations, seed counts, and expanded hyperparameter information were omitted from the abstract in the initial submission. In the revised manuscript we will include these details (e.g., results over 3 seeds with standard deviations) either in the abstract or a new supplementary table referenced from the abstract. revision: yes
Referee: [Abstract] Abstract and method description: the central claim that relative 3-DOF displacement alone suffices for both spatial and task generalization is conditioned on 'adequate grounding' but no experiments or analysis address degradation under realistic 3D sensor noise, partial occlusion, or grounding error, leaving the weakest assumption untested.

Authors: The manuscript explicitly qualifies the claim with the phrase 'given adequate grounding' in both the abstract and conclusion, and the experiments deliberately use oracle 3D points to isolate the contribution of the injection mechanism itself. We acknowledge that robustness under realistic perception noise is an important untested dimension. Because new experiments with simulated sensor noise or occlusion would require additional data collection and are outside the current scope, we will add an explicit limitations paragraph discussing this assumption and listing it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical method with no derivations or self-referential reductions

full rationale

The paper describes a lightweight module for 3D point injection into VLA action heads and reports empirical success-rate gains on LIBERO-PRO benchmarks. No equations, parameter-fitting steps, uniqueness theorems, or self-citations appear in the abstract or described content. The central claim rests on experimental outcomes under the stated assumption of adequate 3D grounding rather than any reduction to fitted inputs or self-definitional constructs. This is the most common honest finding for empirical robotics papers without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the availability of accurate 3D grounding points and the sufficiency of relative gripper displacement; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption An accurate 3D point representing the target object is available at test time
The method requires this input to compute the relative displacement embedding.

pith-pipeline@v0.9.1-grok · 5914 in / 1239 out tokens · 35507 ms · 2026-06-29T05:02:47.381188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 10 internal anchors

[1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025
[6]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

2025
[7]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization.[arXiv preprint arXiv:2510.03827], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id= GVX6jpZOhU

2024
[10]

Huang, X

H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22540–22550, June 2025

2025
[11]

Deshpande, M

A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y . Kim, Q. Pfeifer, Y .-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. Vanderbilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmob0t: Large-scale simulation enables zero-shot mani...
[12]

URLhttps://arxiv.org/abs/2603.16861

work page arXiv
[13]

K. Fang, F. Liu, P. Abbeel, and S. Levine. Moka: Open-world robotic manipulation through mark-based visual prompting. InRobotics: Science and Systems, 2024. URLhttp://dblp. uni-trier.de/db/conf/rss/rss2024.html#FangLAL24

2024
[14]

Z. Wang, Y . Chen, Y . Liu, J. Ye, P. Chen, C. Lu, S. Liu, and J. Jia. Vp-vla: Visual prompting as an interface for vision-language-action models.arXiv preprint arXiv:2603.22003, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

J. Li, Y . Zhu, Z. Tang, J. Wen, M. Zhu, X. Liu, C. Li, R. Cheng, Y . Peng, Y . Peng, and F. Feng. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9759–9769, October 2025

2025
[16]

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URLhttps: //arxiv.org/abs/2508.07917

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=S70MgnIA0v

2024
[18]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026. URLhttps://arxiv.org/abs/2602.13193

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

2023
[21]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018
[22]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 44776–44791. Curran Associates, Inc.,
[23]

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 8c3c666820ea055a77726d66fc7d447f-Paper-Datasets_and_Benchmarks.pdf. 10

2023
[24]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

2024
[26]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. HAMSTER: Hierarchical action models for open-world robot manipulation. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=h7aQxzKbq6. 11

2025
[27]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-VLA: A 3d vision- language-action generative world model. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=EZcFK8HupF

2024
[28]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025
[29]

C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026. doi: 10.1109/LRA.2026.3653303

work page doi:10.1109/lra.2026.3653303 2026
[30]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations,
[31]

URLhttps://openreview.net/forum?id=fzmittHfq3
[32]

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui, Z. Zhang, and H. Wang. GraspVLA: a grasping foundation model pre-trained on billion-scale synthetic action data. In9th Annual Conference on Robot Learning, 2025. URLhttps: //openreview.net/forum?id=zEC8TOXDkH

2025
[33]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Borgefors

G. Borgefors. Distance transformations in digital images.Computer Vision, Graphics, and Image Processing, 34(3):344–371, 1986. ISSN 0734-189X. doi:https://doi.org/10.1016/ S0734-189X(86)80047-0. URLhttps://www.sciencedirect.com/science/article/ pii/S0734189X86800470

1986
[35]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024. 12 Appendix A Table of Contents

2024
[36]

B providesImplementation Detailsfor lifting a 2D point into 3D and encoding the spatial embedding, injecting it into the VLA, and each variant evaluated in Sec

Sec. B providesImplementation Detailsfor lifting a 2D point into 3D and encoding the spatial embedding, injecting it into the VLA, and each variant evaluated in Sec. 4.3 of the main paper
[37]

Sec. C providesSimulation Experimental Details, including the training tasks, how we obtain 2D grounding in simulation, the model training configurations, and the full experi- mental results
[38]

Sec. D providesReal-World Experimental Details, including the task and evaluation setup, how we obtain 2D grounding in real world, the model training configurations, and the full experimental results
[39]

u v 1 # =

Sec. E providesQualitative Resultsin both simulation and real-world settings. B Implementation Details In this section, we provide implementation details for: (1) lifting a 2D point into 3D and encoding the spatial embedding (Sec. B.1); (2) injecting the spatial embedding into the VLA (Sec. B.2); and (3) each variant evaluated in Sec. 4.3 of the main pape...
[40]

The object that needs to be picked up (role: "target")
[41]

placement

The location or object where it should be placed (role: "placement") Output ONLY a JSON array with exactly two entries, no extra text: [ {{"point 2d": [x, y], "label": "<object name>", "role": "target"}}, {{"point 2d": [x, y], "label": "<object name>", "role": "placement"}} ] We parse the returned JSON and rescale the normalized coordinates (in[0,1000]) t...

[1] [1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[3] [3]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

2024

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control, 2026. URLhttps://arxiv. o...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, brian ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Wa...

2025

[6] [6]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Casta ˜neda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, ...

2025

[7] [7]

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization.[arXiv preprint arXiv:2510.03827], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id= GVX6jpZOhU

2024

[10] [10]

Huang, X

H. Huang, X. Chen, Y . Chen, H. Li, X. Han, Z. Wang, T. Wang, J. Pang, and Z. Zhao. Roboground: Robotic manipulation with grounded vision-language priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22540–22550, June 2025

2025

[11] [11]

Deshpande, M

A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, W. Pumacay, Y . Kim, Q. Pfeifer, Y .-C. Lee, P. Wolters, O. Rayyan, M. Zhang, J. Duan, K. Farley, W. Han, E. Vanderbilt, D. Fox, A. Farhadi, G. Chalvatzaki, D. Shah, and R. Krishna. Molmob0t: Large-scale simulation enables zero-shot mani...

[12] [12]

URLhttps://arxiv.org/abs/2603.16861

work page arXiv

[13] [13]

K. Fang, F. Liu, P. Abbeel, and S. Levine. Moka: Open-world robotic manipulation through mark-based visual prompting. InRobotics: Science and Systems, 2024. URLhttp://dblp. uni-trier.de/db/conf/rss/rss2024.html#FangLAL24

2024

[14] [14]

Z. Wang, Y . Chen, Y . Liu, J. Ye, P. Chen, C. Lu, S. Liu, and J. Jia. Vp-vla: Visual prompting as an interface for vision-language-action models.arXiv preprint arXiv:2603.22003, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

J. Li, Y . Zhu, Z. Tang, J. Wen, M. Zhu, X. Liu, C. Li, R. Cheng, Y . Peng, Y . Peng, and F. Feng. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9759–9769, October 2025

2025

[16] [16]

J. Lee, J. Duan, H. Fang, Y . Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y . R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URLhttps: //arxiv.org/abs/2508.07917

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=S70MgnIA0v

2024

[18] [18]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026. URLhttps://arxiv.org/abs/2602.13193

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

2023

[21] [21]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018

[22] [22]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Bench- marking knowledge transfer for lifelong robot learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural In- formation Processing Systems, volume 36, pages 44776–44791. Curran Associates, Inc.,

[23] [23]

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 8c3c666820ea055a77726d66fc7d447f-Paper-Datasets_and_Benchmarks.pdf. 10

2023

[24] [24]

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Her- zog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakr- ishna, A. W...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

2024

[26] [26]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. HAMSTER: Hierarchical action models for open-world robot manipulation. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=h7aQxzKbq6. 11

2025

[27] [27]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-VLA: A 3d vision- language-action generative world model. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=EZcFK8HupF

2024

[28] [28]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models. InProceedings of Robotics: Science and Systems, LosAngeles, CA, USA, June 2025. doi: 10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025

[29] [29]

C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision- language-action models.IEEE Robotics and Automation Letters, 11(3):2506–2513, 2026. doi: 10.1109/LRA.2026.3653303

work page doi:10.1109/lra.2026.3653303 2026

[30] [30]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations,

[31] [31]

URLhttps://openreview.net/forum?id=fzmittHfq3

[32] [32]

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui, Z. Zhang, and H. Wang. GraspVLA: a grasping foundation model pre-trained on billion-scale synthetic action data. In9th Annual Conference on Robot Learning, 2025. URLhttps: //openreview.net/forum?id=zEC8TOXDkH

2025

[33] [33]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Borgefors

G. Borgefors. Distance transformations in digital images.Computer Vision, Graphics, and Image Processing, 34(3):344–371, 1986. ISSN 0734-189X. doi:https://doi.org/10.1016/ S0734-189X(86)80047-0. URLhttps://www.sciencedirect.com/science/article/ pii/S0734189X86800470

1986

[35] [35]

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024. 12 Appendix A Table of Contents

2024

[36] [36]

B providesImplementation Detailsfor lifting a 2D point into 3D and encoding the spatial embedding, injecting it into the VLA, and each variant evaluated in Sec

Sec. B providesImplementation Detailsfor lifting a 2D point into 3D and encoding the spatial embedding, injecting it into the VLA, and each variant evaluated in Sec. 4.3 of the main paper

[37] [37]

Sec. C providesSimulation Experimental Details, including the training tasks, how we obtain 2D grounding in simulation, the model training configurations, and the full experi- mental results

[38] [38]

Sec. D providesReal-World Experimental Details, including the task and evaluation setup, how we obtain 2D grounding in real world, the model training configurations, and the full experimental results

[39] [39]

u v 1 # =

Sec. E providesQualitative Resultsin both simulation and real-world settings. B Implementation Details In this section, we provide implementation details for: (1) lifting a 2D point into 3D and encoding the spatial embedding (Sec. B.1); (2) injecting the spatial embedding into the VLA (Sec. B.2); and (3) each variant evaluated in Sec. 4.3 of the main pape...

[40] [40]

The object that needs to be picked up (role: "target")

[41] [41]

placement

The location or object where it should be placed (role: "placement") Output ONLY a JSON array with exactly two entries, no extra text: [ {{"point 2d": [x, y], "label": "<object name>", "role": "target"}}, {{"point 2d": [x, y], "label": "<object name>", "role": "placement"}} ] We parse the returned JSON and rescale the normalized coordinates (in[0,1000]) t...