pith. sign in

arxiv: 2606.02274 · v2 · pith:QZGXX62Bnew · submitted 2026-06-01 · 💻 cs.RO

Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning

Pith reviewed 2026-06-28 14:40 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationBEV representation3D alignmentgeneralizationvision-language modelsvertex mapdata pipelinespatial-temporal alignment
0
0 comments X

The pith

Mapping pixel-wise 3D information and robot actions into a shared Bird's-Eye-View frame creates view-invariant inputs that support more generalizable manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix two problems in robot manipulation policies that use 2D vision-language models: they ignore the 3D nature of the task and they lack consistent spatial alignment between what the robot sees and what it does. By creating a pixel-wise 3D representation called aligned vertex map and then projecting both the visual data and the actions into the same top-down BEV coordinate system, the method produces representations that stay the same no matter the camera angle or the robot body. A reader would care if this holds because it could let one policy work on many different robots and camera setups without retraining from scratch. The authors also supply a data pipeline to prepare large datasets with these alignments and a way to line up trajectories over time.

Core claim

The authors claim that constructing BEV images from per-pixel 3D vertex information and aligning actions to the same canonical frame mitigates spatial-temporal misalignments, resulting in improved consistency and generalization for real-world robotic manipulation across diverse embodiments and camera setups.

What carries the argument

The canonical Bird's-Eye-View (BEV) frame, which serves as a shared coordinate system for both 3D visual inputs and robot actions to achieve view-invariance.

If this is right

  • Training policies on BEV-aligned data reduces sensitivity to variations in camera pose.
  • The temporal alignment scheme allows combining trajectories from different robots and human operators.
  • Pretrained VLMs can be used with 3D awareness without losing their generalization benefits.
  • The data processing pipeline supports scaling up training to larger datasets with proper alignments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If successful, this method might allow policies to transfer more easily between simulated and real environments.
  • The BEV construction could be applied to other perception tasks in robotics beyond manipulation.
  • It suggests that many current 2D-based approaches could benefit from similar 3D-to-shared-frame lifts.

Load-bearing premise

Expressing per-pixel 3D information and robot actions into a shared canonical BEV frame will produce a view-invariant representation robust to camera pose variations and embodiment differences.

What would settle it

Testing whether policies trained using the BEV alignment show significantly lower success rates when evaluated with unseen camera positions or different robot arms compared to the training conditions.

Figures

Figures reproduced from arXiv: 2606.02274 by Dekun Lu, Huayi Zhou, Jian Chen, Kangyi Guo, Kui Jia, Ruiji Liu, Sheng Xu, Shichen Xu, Shumin Li, Wei Gao, Wenlve Zhou, Yongyi Su, Zhanqi Zhang, Zixin Huang, Ziyang Zhang.

Figure 1
Figure 1. Figure 1: We introduce Dexterity-BEV (Dex-BEV), a series of technical and systematic contribu [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) We propose to construct BEV images and associated vertex maps towards invariance to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 3D spatial alignment in our data processing pipeline. (a) We develop a customized GUI application for 3D alignment and visualization, as explained in Subsec. 3.4. In (a-f), we show the 3D alignment of representative public and internal datasets, including (a) LIBERO [28], (b) Agibot￾Alpha/Beta [54], (c) RoboTwin 2.0 [55], (d) RoboMind 2.0 [37] and our internal datasets (e-f). We also apply an unified TCP f… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss compar￾ison. This corresponds to Tab. 2. We further conduct simulated experiments to exanimate the generalization to different camera view points and environment layouts. To achieve this, we modify the setup on the LIBERO [28] datasets and platforms. In particular, for each trajectory we randomly modify the third-view camera pose by placing it at different distance and rotating it relative to… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative real-world rollouts across different long￾horizon complex tasks. Distinct keyframes demonstrate success￾ful autonomous executions on diverse bimanual robotic platforms involving articulated, deformable, and granular objects (from left to right): Fold Mailer Box and Fold Cloth on Agilex, Scoop Popcorn and Handover Book on the DexForce W1 humanoid, and Fold Cloth on the DexForce A1 semi-humanoid.… view at source ↗
Figure 6
Figure 6. Figure 6: Hardware, Platforms and Teleoperation Data Collection Interfaces. From left to right: (a) the Agilex dual-arm robot platform using grippers, (b) the W1 humanoid robot platform using dexterous hands, (c) the W1 humanoid robot platform using grippers, and (d) the A1 semi-humanoid robot platform using grippers. A.1 Agilex Bimanual Setup: Fold Mailer Box & Fold Cloth Hardware and Sensory Configuration: Followi… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of modified LIBERO. From top to bottom, these represent observations with no parameters modified, observations with modifications to the robot and workstation, and observa￾tions with only the camera pose modified, respectively. C.1 Simulation Benchmarks and Ablation Studies For the LIBERO [28] and RoboTwin 2.0 [55] benchmarks, we first use the official setups regarding robots, training data and ev… view at source ↗
Figure 8
Figure 8. Figure 8: Real-world bimanual platform and task execution. (Left) Configuration of the Agilex bimanual robotic platform used for data collection and inference. (Right) Detailed view of target objects and sequential keyframes from autonomous rollouts of two challenging long-horizon tasks: Fold Mailer Box (articulated) and Fold Cloth (deformable). These snapshots illustrate Dex￾BEV’s capability in handling complex spa… view at source ↗
Figure 9
Figure 9. Figure 9: Rollouts of Self-Recovery and Orientation Invariance. For task Fold Mailer Box, we demonstrate the qualitative results under the OOD trails. It is best to zoom in to view the details. (2) Continuous Operation Facility: Still for the task Fold Mailer Box, we demonstrate the pol￾icy’s capacity for continuous, multi-cycle operation in the Supplementary Videos. After complet￾ing a box, the right arm clears the… view at source ↗
Figure 10
Figure 10. Figure 10: Rollouts of Zero-Shot Instance Generalization. For task Fold Cloth, we demonstrate the qualitative results under the OOD trails. It is best to zoom in to view the details. C.3 DexForce W1 Humanoid Evaluations: Scoop Popcorn Multi-View Keyframe Breakdowns [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Bimanual humanoid platform and long-horizon task execution. (Left) Detailed con￾figuration of the DexForce W1 humanoid robot with two dexterous hands and its operation envi￾ronment. (Right) Multi-view keyframes showcasing the autonomous rollout of the Scoop Popcorn task. This complex, long-horizon sequence requires fine-grained bimanual coordination to manipu￾late the paper cup while simultaneously scoopi… view at source ↗
Figure 12
Figure 12. Figure 12: Robustness to dynamic interference. Sequential snapshots from the cup-grasping phase of the popcorn scooping task on the DexForce W1 platform. The images demonstrate the model’s real-time reactivity. Despite random manual displacements of the target cup by two different users, Dex-BEV successfully recalibrates the motion trajectory to achieve a successful grasp. This high￾lights the closed-loop robustness… view at source ↗
Figure 13
Figure 13. Figure 13: Multi-modal interactive handover book task. (Left) DexForce W1 humanoid platform with two grippers and its workspace setup. (Right) Successive keyframes of the robot executing a Handover Book task conditioned on different language instructions. The sequences highlight Dex￾BEV’s ability to interpret color-specific commands and perform precise, interactive maneuvers for handing over target objects to a huma… view at source ↗
Figure 14
Figure 14. Figure 14: documents the cloth folding task executed on the A1 semi-humanoid platform, showing Dex-BEV’s capability about the distinct state segmentation and human-like rollout trajectories. The policy successfully handles two distinct structural initialization states: a flat canonical layout and an unconstrained, crumpled pile. Dex-BEV accurately segments the task phases, flattening the crum￾pled garment prior to e… view at source ↗
read the original abstract

End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum -- a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation. Pretrained checkpoint, source code and data processing pipeline are available in https://hnuzhy.github.io/projects/Dex-BEV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that end-to-end manipulation policies using 2D VLMs suffer from ignoring 3D structure and lacking spatial alignment across inputs, outputs, robots, cameras, and datasets. It introduces aligned vertex map and vertex spectrum as a pixel-wise 3D input representation derived from camera calibration and optional depth; aligns per-pixel 3D information and robot actions into a shared coordinate system; constructs canonical BEV images for view-invariant representations; develops a data processing pipeline for alignments at scale; and proposes a novel temporal alignment scheme for trajectories across robots and human operators. These steps are asserted to mitigate spatial-temporal misalignments and improve consistency and generalization, with code, checkpoint, and pipeline released.

Significance. If the BEV alignment produces the claimed invariance, the work could meaningfully advance scalable, generalizable robot manipulation by bridging 2D foundation models with explicit 3D and action alignment. The public release of pretrained checkpoint, source code, and data pipeline is a clear strength for reproducibility and follow-on work.

major comments (1)
  1. [Abstract] Abstract: The central claim that mapping per-pixel 3D inputs and robot actions into a shared canonical BEV frame yields a representation 'robust to ... embodiment differences' rests on an unverified assumption. The description provides no mechanism (e.g., forward kinematics, learned adapters, or normalization of joint limits/action dimensionality) to resolve kinematic differences across embodiments; coordinate transformation alone does not guarantee invariance when action spaces differ in structure and semantics. This assumption is load-bearing for the generalization claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive comment. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that mapping per-pixel 3D inputs and robot actions into a shared canonical BEV frame yields a representation 'robust to ... embodiment differences' rests on an unverified assumption. The description provides no mechanism (e.g., forward kinematics, learned adapters, or normalization of joint limits/action dimensionality) to resolve kinematic differences across embodiments; coordinate transformation alone does not guarantee invariance when action spaces differ in structure and semantics. This assumption is load-bearing for the generalization claim.

    Authors: We agree the abstract is brief and does not spell out the action transformation details. The full manuscript (Section 3.2 and the data pipeline in Section 4) describes expressing robot actions in the shared canonical BEV frame via forward kinematics and camera extrinsics, so that the policy outputs 3D end-effector deltas rather than joint commands. This produces a common action representation across embodiments. Experiments in Section 5 demonstrate cross-robot generalization using this alignment. We will add a concise sentence to the abstract and a short clarifying paragraph in Section 3.2 in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: methodological design choices presented without self-referential reduction

full rationale

The paper's abstract and description outline a sequence of engineering contributions: a pixel-wise 3D vertex map representation, alignment of inputs/outputs to a shared coordinate, construction of canonical BEV images, a data processing pipeline, and a temporal alignment scheme. None of these steps are shown to reduce to their own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, uniqueness theorems, or ansatzes are quoted that would create a definitional loop. The invariance claim is framed as the result of the proposed alignment procedure itself, not derived from prior author work in a circular manner. This is a standard non-circular description of a new method pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that 3D alignment via camera calibration plus a canonical BEV frame will remove misalignments across embodiments and cameras; no free parameters or invented entities with independent evidence are described.

axioms (1)
  • domain assumption Aligning per-pixel 3D information and robot actions to a shared canonical BEV frame produces a view-invariant representation robust to camera pose variations.
    Invoked in the abstract as the justification for constructing BEV images.
invented entities (2)
  • aligned vertex map and vertex spectrum no independent evidence
    purpose: Pixel-wise 3D representation that elevates 2D visual inputs to 3D using camera calibration and optional depth.
    New representation introduced to marry 3D awareness with 2D VLM generalization.
  • BEV images no independent evidence
    purpose: View-invariant representation for inputs and outputs.
    Constructed from the aligned 3D information.

pith-pipeline@v0.9.1-grok · 5889 in / 1381 out tokens · 22204 ms · 2026-06-28T14:40:04.502190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 10 canonical work pages

  1. [1]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research (IJRR), 44(10-11):1684–1704, 2025

  2. [2]

    Levine, C

    S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016

  3. [3]

    Y . Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram ´ar, R. Hadsell, N. de Freitas, et al. Reinforcement and imitation learning for diverse visuomotor skills.arXiv preprint, 2018

  4. [4]

    Achiam, S

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  7. [7]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2- vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  8. [8]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36:34892–34916, 2023

  9. [9]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024

  10. [10]

    L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint- embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  11. [11]

    Y . Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025

  12. [12]

    K. Team, J. Chen, Y . Ci, X. Du, Z. Feng, K. Gai, S. Guo, F. Han, J. He, K. He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  13. [13]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

  14. [14]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.010. 9

  15. [15]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representa- tions (ICLR), volume 2025, pages 29982–30009, 2025

  16. [16]

    RT-1: robotics transformer for real-world control at scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalash- nikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

  17. [17]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt- 2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), pages 2165–2183. PMLR, 2023

  18. [18]

    Belkhale, T

    S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.049

  19. [19]

    A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  20. [20]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  21. [21]

    S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026

  22. [22]

    G. Team, B. Wang, B. Li, C. Ni, G. Huang, G. Zhao, H. Li, J. Li, J. Lv, J. Liu, et al. Gigabrain-0.5 m*: a vla that learns from world model-based reinforcement learning.arXiv preprint arXiv:2602.12099, 2026

  23. [23]

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: A 3d vision-language- action generative world model. InInternational Conference on Machine Learning (ICML), pages 61229– 61245. PMLR, 2024

  24. [24]

    X. Li, L. Heng, J. Liu, Y . Shen, C. Gu, Z. Liu, H. Chen, N. Han, R. Zhang, H. Tang, et al. 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation. InConference on Robot Learning (CoRL), pages 2344–2359. PMLR, 2025

  25. [25]

    L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao. Geovla: Empowering 3d representations in vision- language-action models.arXiv preprint arXiv:2508.09071, 2025

  26. [26]

    Zhang, H

    Z. Zhang, H. Li, Y . Dai, Z. Zhu, L. Zhou, C. Liu, D. Wang, F. E. H. Tay, S. Chen, Z. Liu, Y . Liu, X. Li, and P. Zhou. From spatial to actions: Grounding vision-language-action model in spatial foundation priors. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/forum?id=fzmittHfq3

  27. [27]

    X. Fan, S. Deng, X. Wu, Y . Lu, Z. Li, M. Yan, Y . Zhang, Z. Zhang, H. Wang, and H. Zhao. Any3d-vla: Enhancing vla robustness via diverse point clouds.arXiv preprint arXiv:2602.00807, 2026

  28. [28]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems (NeurIPS), 36:44776– 44791, 2023

  29. [29]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5294–5306, 2025

  30. [30]

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id= yirunib8l8

  31. [31]

    Y . Liu, T. Wang, X. Zhang, and J. Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean Conference on Computer Vision (ECCV), pages 531–548. Springer, 2022. 10

  32. [32]

    Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang. Petrv2: A unified framework for 3d perception from multi-camera images. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3262–3272, 2023

  33. [33]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016

  34. [34]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. InConference on Robot Learning (CoRL), pages 4066–4083. PMLR, 2025

  35. [35]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Man- dlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  36. [36]

    DROID: A large-scale in-the-wild robot manipulation dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.120

  37. [37]

    K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, Z. Yang, M. Li, Y . Zhao, Z. Xu, G. Yang, S. Fan, X. Wang, F. Liao, Z. Zhao, G. Li, Z. Jin, L. Wang, J. Mao, N. Liu, P. Ren, Q. Zhang, Y . Lyu, M. Liu, H. Jingyang, Y . Luo, Z. Gao, C. Li, C. Gu, Y . Fu, D. Wu, X. Wang, S. Chen, Z. Wang, P. An, S. Qian, S. Zhang, and J. Tang. Robomind: Benchmark on multi-embodiment in...

  38. [38]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.090

  39. [39]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision- language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR),

  40. [40]

    URLhttps://openreview.net/forum?id=kt51kZH4aG

  41. [41]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π 0.5: a vision-language-action model with open-world generalization. InConference on Robot Learning (CoRL), pages 17–40. PMLR, 2025

  42. [42]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  43. [43]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XX.011

  44. [44]

    T. Yuan, Y . Liu, C. Lu, Z. Chen, T. Jiang, and H. Zhao. Depthvla: Enhancing vision-language-action models with depth-aware spatial reasoning.arXiv preprint arXiv:2510.13375, 2025

  45. [45]

    S. Deng, M. Yan, Y . Zheng, J. Su, W. Zhang, X. Zhao, H. Cui, Z. Zhang, and H. Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

  46. [46]

    C. Li, J. Wen, Y . Peng, Y . Peng, and Y . Zhu. Pointvla: Injecting the 3d world into vision-language-action models.IEEE Robotics and Automation Letters (RAL), 11(3):2506–2513, 2026

  47. [47]

    Q. Yu, X. Yuan, Y . Jiang, J. Chen, D. Zheng, C. Hao, Y . You, Y . Chen, Y . Mu, L. Liu, et al. Artgs: 3d gaussian splatting for interactive visual-physical modeling and manipulation of articulated objects. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13170–13177. IEEE, 2025

  48. [48]

    F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. ZENG, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=euMVC1DO4k

  49. [49]

    P. Li, Y . Chen, H. Wu, X. Ma, X. Wu, Y . Huang, L. Wang, T. Kong, and T. Tan. Bridgevla: Input- output alignment for efficient 3d manipulation learning with vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps: //openreview.net/forum?id=ffBF6hYuQv. 11

  50. [50]

    Y . Chen, S. Liu, X. Shen, and J. Jia. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12536–12545, 2020

  51. [51]

    Reading, A

    C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monoc- ular 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, 2021

  52. [52]

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InEuropean Conference on Computer Vision (ECCV), pages 1–18. Springer, 2022

  53. [53]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

  54. [54]

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5249–5260, 2025

  55. [55]

    Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  56. [56]

    Y . Mu, T. Chen, Z. Chen, S. Peng, Z. Lan, Z. Gao, Z. Liang, Q. Yu, Y . Zou, M. Xu, et al. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27649–27660, 2025

  57. [57]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  58. [58]

    J. Wang, M. Chen, S. Zhang, N. Karaev, J. Sch ¨onberger, P. Labatut, P. Bojanowski, D. Novotny, A. Vedaldi, and C. Rupprecht. Vggt-Ω.arXiv preprint arXiv:2605.15195, 2026

  59. [59]

    M. Shi, L. Chen, J. Chen, Y . Lu, C. Liu, G. Ren, P. Luo, D. Huang, M. Yao, and H. Li. Is diversity all you need for scalable robotic manipulation?IEEE Transactions on Robotics (TRO), 2026

  60. [60]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), Delft, Netherlands, July 2024. doi:10.15607/RSS.2024.XX.067

  61. [61]

    Zheng, Y

    R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daum´e III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 54277–54296, 2025

  62. [62]

    Zhang, Y

    J. Zhang, Y . Chen, Y . Xu, Z. Huang, Y . Zhou, Y .-J. Yuan, X. Cai, G. Huang, X. Quan, H. Xu, et al. 4d- vla: Spatiotemporal vision-language-action pretraining with cross-scene calibration.Advances in Neural Information Processing Systems (NeurIPS), 38:33914–33937, 2025

  63. [63]

    Zhang, H

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems (NeurIPS), 38:24195–24228, 2025

  64. [64]

    Q. Bu, Y . Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, and H. Li. Learning to act anywhere with task- centric latent actions. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.014

  65. [65]

    J. Qian, B. Han, C. Shi, L. Xiao, L. Yang, S. Shi, and L. Jiang. Geopredict: Leveraging predictive kinematics and 3d gaussian geometry for precise vla manipulation.arXiv preprint arXiv:2512.16811, 2025

  66. [66]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems (RSS), LosAngeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.017. 12 This appendix provides supplementary materials to support and expand upon the core methodolo- gies, architectural implementations,...

  67. [67]

    top-down

    picking up and orienting a rigid scooping shovel with the opposite hand, 3) executing a deep gran- ular scoop to fill the shovel cavity, and 4) executing multi-limb spatial synchronization to pour the granular materials smoothly into the target cup without spilling. A.3 DexForce W1 Humanoid Setup:Handover Book Hardware and Sensory Configuration:To maximiz...