Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Dekun Lu; Huayi Zhou; Jian Chen; Kangyi Guo; Kui Jia; Ruiji Liu; Sheng Xu; Shichen Xu; Shumin Li; Wei Gao

arxiv: 2606.02274 · v2 · pith:QZGXX62Bnew · submitted 2026-06-01 · 💻 cs.RO

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Huayi Zhou , Wei Gao , Dekun Lu , Ruiji Liu , Zhanqi Zhang , Ziyang Zhang , Jian Chen , Wenlve Zhou

show 7 more authors

Sheng Xu Shumin Li Kangyi Guo Shichen Xu Zixin Huang Yongyi Su Kui Jia

This is my paper

Pith reviewed 2026-06-28 14:40 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationBEV representation3D alignmentgeneralizationvision-language modelsvertex mapdata pipelinespatial-temporal alignment

0 comments

The pith

Mapping pixel-wise 3D information and robot actions into a shared Bird's-Eye-View frame creates view-invariant inputs that support more generalizable manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix two problems in robot manipulation policies that use 2D vision-language models: they ignore the 3D nature of the task and they lack consistent spatial alignment between what the robot sees and what it does. By creating a pixel-wise 3D representation called aligned vertex map and then projecting both the visual data and the actions into the same top-down BEV coordinate system, the method produces representations that stay the same no matter the camera angle or the robot body. A reader would care if this holds because it could let one policy work on many different robots and camera setups without retraining from scratch. The authors also supply a data pipeline to prepare large datasets with these alignments and a way to line up trajectories over time.

Core claim

The authors claim that constructing BEV images from per-pixel 3D vertex information and aligning actions to the same canonical frame mitigates spatial-temporal misalignments, resulting in improved consistency and generalization for real-world robotic manipulation across diverse embodiments and camera setups.

What carries the argument

The canonical Bird's-Eye-View (BEV) frame, which serves as a shared coordinate system for both 3D visual inputs and robot actions to achieve view-invariance.

If this is right

Training policies on BEV-aligned data reduces sensitivity to variations in camera pose.
The temporal alignment scheme allows combining trajectories from different robots and human operators.
Pretrained VLMs can be used with 3D awareness without losing their generalization benefits.
The data processing pipeline supports scaling up training to larger datasets with proper alignments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If successful, this method might allow policies to transfer more easily between simulated and real environments.
The BEV construction could be applied to other perception tasks in robotics beyond manipulation.
It suggests that many current 2D-based approaches could benefit from similar 3D-to-shared-frame lifts.

Load-bearing premise

Expressing per-pixel 3D information and robot actions into a shared canonical BEV frame will produce a view-invariant representation robust to camera pose variations and embodiment differences.

What would settle it

Testing whether policies trained using the BEV alignment show significantly lower success rates when evaluated with unseen camera positions or different robot arms compared to the training conditions.

Figures

Figures reproduced from arXiv: 2606.02274 by Dekun Lu, Huayi Zhou, Jian Chen, Kangyi Guo, Kui Jia, Ruiji Liu, Sheng Xu, Shichen Xu, Shumin Li, Wei Gao, Wenlve Zhou, Yongyi Su, Zhanqi Zhang, Zixin Huang, Ziyang Zhang.

**Figure 2.** Figure 2: (a) We propose to construct BEV images and associated vertex maps towards invariance to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: 3D spatial alignment in our data processing pipeline. (a) We develop a customized GUI application for 3D alignment and visualization, as explained in Subsec. 3.4. In (a-f), we show the 3D alignment of representative public and internal datasets, including (a) LIBERO [28], (b) AgibotAlpha/Beta [54], (c) RoboTwin 2.0 [55], (d) RoboMind 2.0 [37] and our internal datasets (e-f). We also apply an unified TCP f… view at source ↗

**Figure 4.** Figure 4: Training loss comparison. This corresponds to Tab. 2. We further conduct simulated experiments to exanimate the generalization to different camera view points and environment layouts. To achieve this, we modify the setup on the LIBERO [28] datasets and platforms. In particular, for each trajectory we randomly modify the third-view camera pose by placing it at different distance and rotating it relative to… view at source ↗

**Figure 5.** Figure 5: Qualitative real-world rollouts across different longhorizon complex tasks. Distinct keyframes demonstrate successful autonomous executions on diverse bimanual robotic platforms involving articulated, deformable, and granular objects (from left to right): Fold Mailer Box and Fold Cloth on Agilex, Scoop Popcorn and Handover Book on the DexForce W1 humanoid, and Fold Cloth on the DexForce A1 semi-humanoid.… view at source ↗

**Figure 6.** Figure 6: Hardware, Platforms and Teleoperation Data Collection Interfaces. From left to right: (a) the Agilex dual-arm robot platform using grippers, (b) the W1 humanoid robot platform using dexterous hands, (c) the W1 humanoid robot platform using grippers, and (d) the A1 semi-humanoid robot platform using grippers. A.1 Agilex Bimanual Setup: Fold Mailer Box & Fold Cloth Hardware and Sensory Configuration: Followi… view at source ↗

**Figure 7.** Figure 7: Examples of modified LIBERO. From top to bottom, these represent observations with no parameters modified, observations with modifications to the robot and workstation, and observations with only the camera pose modified, respectively. C.1 Simulation Benchmarks and Ablation Studies For the LIBERO [28] and RoboTwin 2.0 [55] benchmarks, we first use the official setups regarding robots, training data and ev… view at source ↗

**Figure 8.** Figure 8: Real-world bimanual platform and task execution. (Left) Configuration of the Agilex bimanual robotic platform used for data collection and inference. (Right) Detailed view of target objects and sequential keyframes from autonomous rollouts of two challenging long-horizon tasks: Fold Mailer Box (articulated) and Fold Cloth (deformable). These snapshots illustrate DexBEV’s capability in handling complex spa… view at source ↗

**Figure 9.** Figure 9: Rollouts of Self-Recovery and Orientation Invariance. For task Fold Mailer Box, we demonstrate the qualitative results under the OOD trails. It is best to zoom in to view the details. (2) Continuous Operation Facility: Still for the task Fold Mailer Box, we demonstrate the policy’s capacity for continuous, multi-cycle operation in the Supplementary Videos. After completing a box, the right arm clears the… view at source ↗

**Figure 10.** Figure 10: Rollouts of Zero-Shot Instance Generalization. For task Fold Cloth, we demonstrate the qualitative results under the OOD trails. It is best to zoom in to view the details. C.3 DexForce W1 Humanoid Evaluations: Scoop Popcorn Multi-View Keyframe Breakdowns [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Bimanual humanoid platform and long-horizon task execution. (Left) Detailed configuration of the DexForce W1 humanoid robot with two dexterous hands and its operation environment. (Right) Multi-view keyframes showcasing the autonomous rollout of the Scoop Popcorn task. This complex, long-horizon sequence requires fine-grained bimanual coordination to manipulate the paper cup while simultaneously scoopi… view at source ↗

**Figure 12.** Figure 12: Robustness to dynamic interference. Sequential snapshots from the cup-grasping phase of the popcorn scooping task on the DexForce W1 platform. The images demonstrate the model’s real-time reactivity. Despite random manual displacements of the target cup by two different users, Dex-BEV successfully recalibrates the motion trajectory to achieve a successful grasp. This highlights the closed-loop robustness… view at source ↗

**Figure 13.** Figure 13: Multi-modal interactive handover book task. (Left) DexForce W1 humanoid platform with two grippers and its workspace setup. (Right) Successive keyframes of the robot executing a Handover Book task conditioned on different language instructions. The sequences highlight DexBEV’s ability to interpret color-specific commands and perform precise, interactive maneuvers for handing over target objects to a huma… view at source ↗

**Figure 14.** Figure 14: documents the cloth folding task executed on the A1 semi-humanoid platform, showing Dex-BEV’s capability about the distinct state segmentation and human-like rollout trajectories. The policy successfully handles two distinct structural initialization states: a flat canonical layout and an unconstrained, crumpled pile. Dex-BEV accurately segments the task phases, flattening the crumpled garment prior to e… view at source ↗

read the original abstract

End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The BEV alignment and vertex map ideas address real 3D misalignment issues in VLM robot policies, but the embodiment invariance claim looks under-supported without kinematic handling.

read the letter

The main new pieces are the aligned vertex map plus vertex spectrum as a pixel-wise 3D lift from 2D inputs, and the canonical BEV frame that puts both visual features and robot actions into one shared coordinate with a temporal trajectory alignment step. These are presented as concrete constructs rather than restatements of prior work.

The paper does well on the engineering side by laying out a full data processing pipeline for multi-robot, multi-camera, multi-dataset alignment and by releasing code, checkpoints, and the pipeline. That lowers the barrier for anyone wanting to test or extend the approach.

The soft spot is the assumption that coordinate-frame alignment alone produces robustness to embodiment differences. Different robots vary in kinematics, joint limits, and action dimensionality; the description does not spell out an explicit forward-kinematics step or adapter beyond the shared BEV transform. If that gap is not closed in the experiments, the cross-embodiment generalization benefit will be narrower than stated. Soundness is hard to judge from the given material, but the logic is internally consistent and the citation pattern looks standard.

This is for researchers working on scalable, cross-robot manipulation policies. Readers focused on 3D representations or VLM grounding would find the concrete representations worth examining. It deserves a serious referee to evaluate the experiments and check whether the kinematic concern actually materializes.

Referee Report

1 major / 0 minor

Summary. The paper claims that end-to-end manipulation policies using 2D VLMs suffer from ignoring 3D structure and lacking spatial alignment across inputs, outputs, robots, cameras, and datasets. It introduces aligned vertex map and vertex spectrum as a pixel-wise 3D input representation derived from camera calibration and optional depth; aligns per-pixel 3D information and robot actions into a shared coordinate system; constructs canonical BEV images for view-invariant representations; develops a data processing pipeline for alignments at scale; and proposes a novel temporal alignment scheme for trajectories across robots and human operators. These steps are asserted to mitigate spatial-temporal misalignments and improve consistency and generalization, with code, checkpoint, and pipeline released.

Significance. If the BEV alignment produces the claimed invariance, the work could meaningfully advance scalable, generalizable robot manipulation by bridging 2D foundation models with explicit 3D and action alignment. The public release of pretrained checkpoint, source code, and data pipeline is a clear strength for reproducibility and follow-on work.

major comments (1)

[Abstract] Abstract: The central claim that mapping per-pixel 3D inputs and robot actions into a shared canonical BEV frame yields a representation 'robust to ... embodiment differences' rests on an unverified assumption. The description provides no mechanism (e.g., forward kinematics, learned adapters, or normalization of joint limits/action dimensionality) to resolve kinematic differences across embodiments; coordinate transformation alone does not guarantee invariance when action spaces differ in structure and semantics. This assumption is load-bearing for the generalization claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that mapping per-pixel 3D inputs and robot actions into a shared canonical BEV frame yields a representation 'robust to ... embodiment differences' rests on an unverified assumption. The description provides no mechanism (e.g., forward kinematics, learned adapters, or normalization of joint limits/action dimensionality) to resolve kinematic differences across embodiments; coordinate transformation alone does not guarantee invariance when action spaces differ in structure and semantics. This assumption is load-bearing for the generalization claim.

Authors: We agree the abstract is brief and does not spell out the action transformation details. The full manuscript (Section 3.2 and the data pipeline in Section 4) describes expressing robot actions in the shared canonical BEV frame via forward kinematics and camera extrinsics, so that the policy outputs 3D end-effector deltas rather than joint commands. This produces a common action representation across embodiments. Experiments in Section 5 demonstrate cross-robot generalization using this alignment. We will add a concise sentence to the abstract and a short clarifying paragraph in Section 3.2 in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: methodological design choices presented without self-referential reduction

full rationale

The paper's abstract and description outline a sequence of engineering contributions: a pixel-wise 3D vertex map representation, alignment of inputs/outputs to a shared coordinate, construction of canonical BEV images, a data processing pipeline, and a temporal alignment scheme. None of these steps are shown to reduce to their own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are quoted that would create a definitional loop. The invariance claim is framed as the result of the proposed alignment procedure itself, not derived from prior author work in a circular manner. This is a standard non-circular description of a new method pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that 3D alignment via camera calibration plus a canonical BEV frame will remove misalignments across embodiments and cameras; no free parameters or invented entities with independent evidence are described.

axioms (1)

domain assumption Aligning per-pixel 3D information and robot actions to a shared canonical BEV frame produces a view-invariant representation robust to camera pose variations.
Invoked in the abstract as the justification for constructing BEV images.

invented entities (2)

aligned vertex map and vertex spectrum no independent evidence
purpose: Pixel-wise 3D representation that elevates 2D visual inputs to 3D using camera calibration and optional depth.
New representation introduced to marry 3D awareness with 2D VLM generalization.
BEV images no independent evidence
purpose: View-invariant representation for inputs and outputs.
Constructed from the aligned 3D information.

pith-pipeline@v0.9.1-grok · 5889 in / 1381 out tokens · 22204 ms · 2026-06-28T14:40:04.502190+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 10 canonical work pages

[1]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research (IJRR), 44(10-11):1684–1704, 2025

2025
[2]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

2016
[3]

Y . Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram ´ar, R. Hadsell, N. de Freitas, et al. Reinforcement and imitation learning for diverse visuomotor skills.arXiv preprint, 2018

2018
[4]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[5]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[6]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[7]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2- vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[8]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36:34892–34916, 2023

2023
[9]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

2024
[10]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint- embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026
[11]

Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

Pith/arXiv arXiv 2025
[12]

K. Team, J. Chen, Y . Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

Pith/arXiv arXiv 2025
[13]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

2025
[14]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.010. 9

work page doi:10.15607/rss.2025.xx.010 2025
[15]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), volume 2025, pages 29982–30009, 2025

2025
[16]

RT-1: robotics transformer for real-world control at scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalash- nikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

work page doi:10.15607/rss.2023.xix.025 2023
[17]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt- 2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023

2023
[18]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.049

work page doi:10.15607/rss.2024.xx.049 2024
[19]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[20]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026
[21]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026
[22]

G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

arXiv 2026
[23]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language- action generative world model. InInternational Conference on Machine Learning (ICML), pages 61229– 61245. PMLR, 2024

2024
[24]

X. Li, L. Heng, J. Liu, Y . Shen, C. Gu, Z. Liu, H. Chen, N. Han, R. Zhang, H. Tang, et al. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), pages 2344–2359. PMLR, 2025

2025
[25]

L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao. Geovla: Empowering 3d representations in vision- language-action models.arXiv preprint arXiv:2508.09071, 2025

arXiv 2025
[26]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=fzmittHfq3

2026
[27]

X. Fan, S. Deng, X. Wu, Y . Lu, Z. Li, M. Yan, Y . Zhang, Z. Zhang, H. Wang, and H. Zhao. Any3d-vla: Enhancing vla robustness via diverse point clouds.arXiv preprint arXiv:2602.00807, 2026

Pith/arXiv arXiv 2026
[28]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:44776– 44791, 2023

2023
[29]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025

2025
[30]

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id= yirunib8l8

2026
[31]

Y . Liu, T. Wang, X. Zhang, and J. Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022. 10

2022
[32]

Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3262–3272, 2023

2023
[33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[34]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), pages 4066–4083. PMLR, 2025

2025
[35]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Man- dlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[36]

DROID: A large-scale in-the-wild robot manipulation dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024
[37]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, S. Fan, X. Wang, F. Liao, Z. Zhao, G. Li, Z. Jin, L. Wang, J. Mao, N. Liu, P. Ren, Q. Zhang, Y . Lyu, M. Liu, H. Jingyang, Y . Luo, Z. Gao, C. Li, C. Gu, Y . Fu, D. Wu, X. Wang, S. Chen, Z. Wang, P. An, S. Qian, S. Zhang, and J. Tang. Robomind: Benchmark on multi-embodiment in...

2025
[38]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.090

work page doi:10.15607/rss.2024.xx.090 2024
[39]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision- language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR),
[40]

URLhttps://openreview.net/forum?id=kt51kZH4aG
[41]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π 0.5: a vision-language-action model with open-world generalization. InConference on Robot Learning (CoRL), pages 17–40. PMLR, 2025

2025
[42]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[43]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.011

work page doi:10.15607/rss.2025.xx.011 2025
[44]

T. Yuan, Y . Liu, C. Lu, Z. Chen, T. Jiang, and H. Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

arXiv 2025
[45]

S. Deng, M. Yan, Y . Zheng, J. Su, W. Zhang, X. Zhao, H. Cui, Z. Zhang, and H. Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

arXiv 2025
[46]

C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters (RAL), 11(3):2506–2513, 2026

2026
[47]

Q. Yu, X. Yuan, Y . Jiang, J. Chen, D. Zheng, C. Hao, Y . You, Y . Chen, Y . Mu, L. Liu, et al. Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13170–13177. IEEE, 2025

2025
[48]

F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. ZENG, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=euMVC1DO4k

2026
[49]

P. Li, Y . Chen, H. Wu, X. Ma, X. Wu, Y . Huang, L. Wang, T. Kong, and T. Tan. Bridgevla: Input- output alignment for efficient 3d manipulation learning with vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps: //openreview.net/forum?id=ffBF6hYuQv. 11

2025
[50]

Y . Chen, S. Liu, X. Shen, and J. Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12536–12545, 2020

2020
[51]

Reading, A

C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monoc- ular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021

2021
[52]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InEuropean Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022

2022
[53]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023
[54]

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

2025
[55]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[56]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27649–27660, 2025

2025
[57]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[58]

J. Wang, M. Chen, S. Zhang, N. Karaev, J. Sch ¨onberger, P. Labatut, P. Bojanowski, D. Novotny, A. Vedaldi, and C. Rupprecht. Vggt-Ω.arXiv preprint arXiv:2605.15195, 2026

Pith/arXiv arXiv 2026
[59]

M. Shi, L. Chen, J. Chen, Y . Lu, C. Liu, G. Ren, P. Luo, D. Huang, M. Yao, and H. Li. Is diversity all you need for scalable robotic manipulation?IEEE Transactions on Robotics (TRO), 2026

2026
[60]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.067

work page doi:10.15607/rss.2024.xx.067 2024
[61]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 54277–54296, 2025

2025
[62]

Zhang, Y

J. Zhang, Y . Chen, Y . Xu, Z. Huang, Y . Zhou, Y .-J. Yuan, X. Cai, G. Huang, X. Quan, H. Xu, et al. 4d- vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.Advances in Neural Information Processing Systems (NeurIPS), 38:33914–33937, 2025

2025
[63]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems (NeurIPS), 38:24195–24228, 2025

2025
[64]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Learning to act anywhere with task- centric latent actions. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.014

work page doi:10.15607/rss.2025.xxi.014 2025
[65]

J. Qian, B. Han, C. Shi, L. Xiao, L. Yang, S. Shi, and L. Jiang. Geopredict: Leveraging predictive kinematics and 3d gaussian geometry for precise vla manipulation.arXiv preprint arXiv:2512.16811, 2025

Pith/arXiv arXiv 2025
[66]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.017. 12 This appendix provides supplementary materials to support and expand upon the core methodolo- gies, architectural implementations,...

work page doi:10.15607/rss.2025.xxi.017 2025
[67]

top-down

picking up and orienting a rigid scooping shovel with the opposite hand, 3) executing a deep gran- ular scoop to fill the shovel cavity, and 4) executing multi-limb spatial synchronization to pour the granular materials smoothly into the target cup without spilling. A.3 DexForce W1 Humanoid Setup:Handover Book Hardware and Sensory Configuration:To maximiz...

[1] [1]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research (IJRR), 44(10-11):1684–1704, 2025

2025

[2] [2]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

2016

[3] [3]

Y . Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram ´ar, R. Hadsell, N. de Freitas, et al. Reinforcement and imitation learning for diverse visuomotor skills.arXiv preprint, 2018

2018

[4] [4]

Achiam, S

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[5] [5]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[6] [6]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[7] [7]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2- vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[8] [8]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36:34892–34916, 2023

2023

[9] [9]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

2024

[10] [10]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint- embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026

[11] [11]

Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

Pith/arXiv arXiv 2025

[12] [12]

K. Team, J. Chen, Y . Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

Pith/arXiv arXiv 2025

[13] [13]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

2025

[14] [14]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.010. 9

work page doi:10.15607/rss.2025.xx.010 2025

[15] [15]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), volume 2025, pages 29982–30009, 2025

2025

[16] [16]

RT-1: robotics transformer for real-world control at scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalash- nikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

work page doi:10.15607/rss.2023.xix.025 2023

[17] [17]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt- 2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023

2023

[18] [18]

Belkhale, T

S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.049

work page doi:10.15607/rss.2024.xx.049 2024

[19] [19]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[20] [20]

L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Pith/arXiv arXiv 2026

[21] [21]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

Pith/arXiv arXiv 2026

[22] [22]

G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

arXiv 2026

[23] [23]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language- action generative world model. InInternational Conference on Machine Learning (ICML), pages 61229– 61245. PMLR, 2024

2024

[24] [24]

X. Li, L. Heng, J. Liu, Y . Shen, C. Gu, Z. Liu, H. Chen, N. Han, R. Zhang, H. Tang, et al. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), pages 2344–2359. PMLR, 2025

2025

[25] [25]

L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao. Geovla: Empowering 3d representations in vision- language-action models.arXiv preprint arXiv:2508.09071, 2025

arXiv 2025

[26] [26]

Zhang, H

Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=fzmittHfq3

2026

[27] [27]

X. Fan, S. Deng, X. Wu, Y . Lu, Z. Li, M. Yan, Y . Zhang, Z. Zhang, H. Wang, and H. Zhao. Any3d-vla: Enhancing vla robustness via diverse point clouds.arXiv preprint arXiv:2602.00807, 2026

Pith/arXiv arXiv 2026

[28] [28]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:44776– 44791, 2023

2023

[29] [29]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025

2025

[30] [30]

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id= yirunib8l8

2026

[31] [31]

Y . Liu, T. Wang, X. Zhang, and J. Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022. 10

2022

[32] [32]

Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3262–3272, 2023

2023

[33] [33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023

[34] [34]

Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), pages 4066–4083. PMLR, 2025

2025

[35] [35]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Man- dlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[36] [36]

DROID: A large-scale in-the-wild robot manipulation dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024

[37] [37]

K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, S. Fan, X. Wang, F. Liao, Z. Zhao, G. Li, Z. Jin, L. Wang, J. Mao, N. Liu, P. Ren, Q. Zhang, Y . Lyu, M. Liu, H. Jingyang, Y . Luo, Z. Gao, C. Li, C. Gu, Y . Fu, D. Wu, X. Wang, S. Chen, Z. Wang, P. An, S. Qian, S. Zhang, and J. Tang. Robomind: Benchmark on multi-embodiment in...

2025

[38] [38]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.090

work page doi:10.15607/rss.2024.xx.090 2024

[39] [39]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision- language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR),

[40] [40]

URLhttps://openreview.net/forum?id=kt51kZH4aG

[41] [41]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π 0.5: a vision-language-action model with open-world generalization. InConference on Robot Learning (CoRL), pages 17–40. PMLR, 2025

2025

[42] [42]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[43] [43]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.011

work page doi:10.15607/rss.2025.xx.011 2025

[44] [44]

T. Yuan, Y . Liu, C. Lu, Z. Chen, T. Jiang, and H. Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

arXiv 2025

[45] [45]

S. Deng, M. Yan, Y . Zheng, J. Su, W. Zhang, X. Zhao, H. Cui, Z. Zhang, and H. Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

arXiv 2025

[46] [46]

C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters (RAL), 11(3):2506–2513, 2026

2026

[47] [47]

Q. Yu, X. Yuan, Y . Jiang, J. Chen, D. Zheng, C. Hao, Y . You, Y . Chen, Y . Mu, L. Liu, et al. Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13170–13177. IEEE, 2025

2025

[48] [48]

F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. ZENG, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=euMVC1DO4k

2026

[49] [49]

P. Li, Y . Chen, H. Wu, X. Ma, X. Wu, Y . Huang, L. Wang, T. Kong, and T. Tan. Bridgevla: Input- output alignment for efficient 3d manipulation learning with vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps: //openreview.net/forum?id=ffBF6hYuQv. 11

2025

[50] [50]

Y . Chen, S. Liu, X. Shen, and J. Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12536–12545, 2020

2020

[51] [51]

Reading, A

C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monoc- ular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021

2021

[52] [52]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InEuropean Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022

2022

[53] [53]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023

[54] [54]

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

2025

[55] [55]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[56] [56]

Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27649–27660, 2025

2025

[57] [57]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[58] [58]

J. Wang, M. Chen, S. Zhang, N. Karaev, J. Sch ¨onberger, P. Labatut, P. Bojanowski, D. Novotny, A. Vedaldi, and C. Rupprecht. Vggt-Ω.arXiv preprint arXiv:2605.15195, 2026

Pith/arXiv arXiv 2026

[59] [59]

M. Shi, L. Chen, J. Chen, Y . Lu, C. Liu, G. Ren, P. Luo, D. Huang, M. Yao, and H. Li. Is diversity all you need for scalable robotic manipulation?IEEE Transactions on Robotics (TRO), 2026

2026

[60] [60]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.067

work page doi:10.15607/rss.2024.xx.067 2024

[61] [61]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 54277–54296, 2025

2025

[62] [62]

Zhang, Y

J. Zhang, Y . Chen, Y . Xu, Z. Huang, Y . Zhou, Y .-J. Yuan, X. Cai, G. Huang, X. Quan, H. Xu, et al. 4d- vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.Advances in Neural Information Processing Systems (NeurIPS), 38:33914–33937, 2025

2025

[63] [63]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems (NeurIPS), 38:24195–24228, 2025

2025

[64] [64]

Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Learning to act anywhere with task- centric latent actions. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.014

work page doi:10.15607/rss.2025.xxi.014 2025

[65] [65]

J. Qian, B. Han, C. Shi, L. Xiao, L. Yang, S. Shi, and L. Jiang. Geopredict: Leveraging predictive kinematics and 3d gaussian geometry for precise vla manipulation.arXiv preprint arXiv:2512.16811, 2025

Pith/arXiv arXiv 2025

[66] [66]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.017. 12 This appendix provides supplementary materials to support and expand upon the core methodolo- gies, architectural implementations,...

work page doi:10.15607/rss.2025.xxi.017 2025

[67] [67]

top-down

picking up and orienting a rigid scooping shovel with the opposite hand, 3) executing a deep gran- ular scoop to fill the shovel cavity, and 4) executing multi-limb spatial synchronization to pour the granular materials smoothly into the target cup without spilling. A.3 DexForce W1 Humanoid Setup:Handover Book Hardware and Sensory Configuration:To maximiz...