pith. machine review for the scientific record. sign in

arxiv: 2605.09989 · v1 · submitted 2026-05-11 · 💻 cs.RO · cs.CV

Recognition: no theorem link

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords stereo visionrobotic manipulationvisuomotor policiesimitation learningdiffusion policiesstereo transformergeometric reasoning
0
0 comments X

The pith

StereoPolicy processes synchronized stereo image pairs with 2D encoders and a fusion transformer to improve robotic manipulation policies without explicit 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StereoPolicy as a visuomotor policy framework that takes stereo image pairs as input to address the lack of reliable depth and spatial cues in monocular observations. Pretrained 2D vision encoders handle each image separately before a Stereo Transformer fuses the representations to capture correspondence and disparity information implicitly. This design integrates directly with diffusion-based and vision-language-action policies. The approach is evaluated across simulation benchmarks and real-robot tabletop and bimanual tasks, showing gains over RGB, RGB-D, point cloud, and multi-view inputs. A reader would care because precise manipulation in cluttered scenes often fails without better geometric awareness, and this method offers a practical bridge from existing 2D models to 3D understanding.

Core claim

StereoPolicy directly leverages synchronized stereo image pairs to strengthen geometric reasoning in robot policies. It employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues without requiring explicit 3D reconstruction or camera calibration. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action policies and delivers consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines.

What carries the argument

The Stereo Transformer, which fuses feature representations from independent 2D encoders applied to each image in a stereo pair to extract implicit spatial and disparity information.

If this is right

  • Consistent gains over RGB, RGB-D, point cloud, and multi-view baselines hold across RoboMimic, RoboCasa, and OmniGibson simulation environments.
  • The same framework transfers to real-robot tabletop and bimanual mobile manipulation without additional calibration.
  • StereoPolicy combines directly with both diffusion policies and pretrained vision-language-action models.
  • Stereo vision functions as a scalable input modality that connects 2D pretrained representations to 3D geometric reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standard stereo camera rigs could replace depth sensors in many manipulation setups while retaining compatibility with existing 2D encoders.
  • The implicit fusion may generalize more readily than explicit 3D reconstruction when camera intrinsics vary across deployments.
  • Similar stereo fusion could be tested on navigation or grasping tasks where precise relative positioning matters.
  • An open extension is whether adding a lightweight explicit stereo-matching head on top of the transformer would yield further gains.

Load-bearing premise

That independent processing of each stereo view by pretrained 2D encoders followed by transformer-based fusion is sufficient to recover the needed spatial correspondence and depth cues.

What would settle it

An ablation or comparison experiment in which removing the stereo fusion step or switching to monocular input eliminates the reported performance gains on depth-critical manipulation tasks.

Figures

Figures reproduced from arXiv: 2605.09989 by Evans Han, Haoyue Xiao, Huang Huang, Jiajun Wu, Jianwen Xie, Li Fei-Fei, Ruohan Zhang, Yingke Wang, Yunfan Jiang.

Figure 1
Figure 1. Figure 1: Compared to traditional visual modalities for robot learning, stereo input provides certain [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: StereoPolicy Pipeline. Stereo inputs are encoded by a vision backbone, fused with a Stereo Transformer, and applied to both diffusion-policy training and finetuning VLA baselines. datasets. To mitigate the limited availability of 3D data, some recent approaches aim to “lift” pre￾trained 2D vision representations into 3D representations (e.g. NeRF [57]) for 3D scene understand￾ing [58–61]. Stereo Vision in … view at source ↗
Figure 3
Figure 3. Figure 3: Real-World Task Visualization. Top: Tabletop tasks. Bottom: Mobile manipulation tasks. 3.3 STEREOPOLICY-VLA: Adapting Monocular VLA to Stereo Inputs Pre-trained VLA models exhibit strong semantic understanding through VLM pretraining, but their depth reasoning is limited by monocular inputs. To improve spatial perception, we extend the visual input from monocular to stereo by introducing a lightweight ster… view at source ↗
Figure 4
Figure 4. Figure 4: Simulation Task Visualization, from three benchmarks: OMNIGIBSON (4 tasks), ROBO￾CASA (24 tasks), and ROBOMIMIC (3 tasks). Real-World We consider 7 tasks spanning tabletop manipulation (Banana PnP, Toast Insert, Cup Hang, Steel Cup Hang, Glass Cup Hang) and mobile manipulation (PnP Toast, Turn on Radio), visualized in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RGB-D and PCD are fragile in real. Glass cup is entirely missing. Evaluation For real-robot evaluation, we report the average success rate over 20 trials, with ran￾domized initial poses. For simulation tasks, we perform 50 rollouts at every 50 training epochs for Robomimic tasks, and every 250 training epochs for Omnigibson tasks. The highest suc￾cess rate achieved across training is reported. 6 [PITH_FUL… view at source ↗
Figure 7
Figure 7. Figure 7: STEREOPOLICY-VLA (Pi0.5) Per￾formance on Bimanual Mobile Manipulation Tasks in both real-world and simulation. (Q2) StereoPolicy enhances pretrained VLA models, despite these models are trained on monocular data. We next examine the STEREOPOLICY-VLA, where StereoPolicy is incorpo￾rated into a state-of-the-art pre-trained VLA model, Pi0.5 and GR00T-N1.5 for fine-tuning. As summarized in [PITH_FULL_IMAGE:fi… view at source ↗
Figure 8
Figure 8. Figure 8: Performance of STEREOPOLICY-DP across different camera angles. (Q3) STEREOPOLICY-DP is most effective when the baseline is approximately 10% of the tar￾get object distance. We vary the stereo baseline distance (2cm, 6cm, 10cm) and camera–object distance (0.6m–1.0m) while keeping other conditions fixed. Results show that performance is gov￾erned not by either factor alone, but by their ratio: r = stereo bas… view at source ↗
Figure 10
Figure 10. Figure 10: Vision Encoder and Component Ablation. Ex￾periments on ToolHang task with 100 demos. (Q4) The choice of vision backbone significantly influences STEREOPOLICY-DP ’s perfor￾mance, particularly in low-data regimes. To explore the most effective vision backbone for robot manipulation, we evaluate several architectures on the TOOLHANG task. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Trajectory of Real-world Tabletop Tasks. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Stereo camera views across different baselines and distances. Baseline indicates the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Camera angle view visualization. Experimental results are [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Failure cases of baseline visual modalities on real-world tabletop tasks. RGB, RGB-D, [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Monocular RGB failures in bimanual mobile manipulation. In the T [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
read the original abstract

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces StereoPolicy, a visuomotor policy framework for robotic manipulation that processes synchronized stereo image pairs using independent pretrained 2D vision encoders whose outputs are fused by a Stereo Transformer. This design is claimed to implicitly capture spatial correspondence and disparity cues without explicit 3D reconstruction or camera calibration. The framework integrates with diffusion-based and vision-language-action policies and is reported to deliver consistent improvements over RGB, RGB-D, point-cloud, and multi-view baselines on the RoboMimic, RoboCasa, and OmniGibson simulation benchmarks, with additional validation on real-robot tabletop and bimanual mobile manipulation tasks.

Significance. If the empirical gains are robust and can be attributed to stereo-induced geometric reasoning rather than increased input capacity, the work would offer a practical, calibration-free route to strengthen spatial awareness in existing 2D-pretrained policy architectures. This could be valuable for scaling manipulation policies in cluttered or geometrically complex scenes where monocular depth cues are insufficient.

major comments (2)
  1. [Abstract] Abstract and results sections: the central claim of 'consistent improvements' across three simulation benchmarks and real-robot settings is asserted without any quantitative metrics, success rates, error bars, or statistical significance tests in the provided abstract; the full results must be examined to verify whether the gains are large enough to support the geometric-reasoning interpretation.
  2. [Method] Method description of the Stereo Transformer: the assertion that independent 2D-pretrained encoders plus generic transformer fusion 'implicitly captures spatial correspondence and disparity cues' lacks supporting evidence such as attention-map visualizations, disparity estimation accuracy, or an ablation that applies the identical fusion module to non-stereo multi-view pairs; without these, the reported gains could be explained by the extra synchronized view alone rather than 3D understanding.
minor comments (2)
  1. The paper should provide explicit details on training procedures, hyperparameters, and data augmentation for both simulation and real-robot experiments to support reproducibility.
  2. Figure captions for real-robot experiments would benefit from additional description of camera setup, baseline comparisons, and failure modes observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve the presentation of quantitative results and to provide additional supporting evidence for the Stereo Transformer's role in capturing geometric cues.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results sections: the central claim of 'consistent improvements' across three simulation benchmarks and real-robot settings is asserted without any quantitative metrics, success rates, error bars, or statistical significance tests in the provided abstract; the full results must be examined to verify whether the gains are large enough to support the geometric-reasoning interpretation.

    Authors: We agree that the abstract would be strengthened by including concrete metrics. In the revised manuscript we have added representative success-rate improvements and consistency notes drawn from the full experimental tables. The complete results (Sections 4–5) report means and standard deviations over multiple random seeds together with direct comparisons to all baselines; the observed gains remain larger than those obtained from multi-view or RGB-D inputs, supporting the geometric-reasoning interpretation. revision: yes

  2. Referee: [Method] Method description of the Stereo Transformer: the assertion that independent 2D-pretrained encoders plus generic transformer fusion 'implicitly captures spatial correspondence and disparity cues' lacks supporting evidence such as attention-map visualizations, disparity estimation accuracy, or an ablation that applies the identical fusion module to non-stereo multi-view pairs; without these, the reported gains could be explained by the extra synchronized view alone rather than 3D understanding.

    Authors: We acknowledge that additional evidence strengthens the claim. The existing multi-view baselines already apply comparable fusion to non-stereo image pairs and yield smaller gains than stereo pairs, indicating that the benefit is not explained by input count alone. In the revision we have added attention-map visualizations (new figure in the supplement) that illustrate cross-view correspondence only when stereo pairs are used. We do not report explicit disparity-estimation accuracy because the architecture is trained end-to-end for policy performance rather than depth prediction; the policy-level ablations and real-robot results in geometrically demanding tasks serve as the primary validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmarks

full rationale

The paper introduces StereoPolicy as an empirical architecture: pretrained 2D encoders process stereo pairs independently, a Stereo Transformer fuses them, and the resulting policy is trained and tested on RoboMimic, RoboCasa, OmniGibson plus real-robot tasks. No equations, closed-form derivations, or predictions are presented that reduce the claimed gains to inputs by construction. The implicit-capture claim is a hypothesis tested via ablation and baseline comparisons rather than a tautological redefinition. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the Stereo Transformer learns useful disparity cues from pretrained 2D features.

pith-pipeline@v0.9.0 · 5522 in / 1109 out tokens · 48951 ms · 2026-05-12T03:52:55.765366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 15 internal anchors

  1. [1]

    Levine, C

    S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies,

  2. [2]

    URLhttps://arxiv.org/abs/1504.00702

  3. [3]

    Y . Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram´ar, R. Had- sell, N. de Freitas, and N. Heess. Reinforcement and imitation learning for diverse visuomotor skills, 2018. URLhttps://arxiv.org/abs/1802.09564

  4. [4]

    Cliport: What and where pathways for robotic manipulation

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipu- lation.arXiv preprint arXiv: Arxiv-2109.12098, 2021

  5. [5]

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent.arXiv preprint arXiv: Arxiv-2205.06175, 2022

  6. [6]

    VIMA : General robot manipulation with multimodal prompts

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv: 2210.03094, 2022

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Per...

  8. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, K. Choromanski, T. Ding, D. Driess, K. A. Dubey, C. Finn, P. R. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Haus- man, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, I. Leal, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Rey- ma...

  9. [9]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024. 10

  10. [10]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In K. E. Bekris, K. Hauser, S. L. Herbert, and J. Yu, editors,Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi:10.15607/ RSS.2023.XIX.016. URLhttps://doi.org/10.15607/RSS.2023.XIX.016

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

  12. [12]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv: 2410...

  13. [13]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

  14. [14]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

  15. [15]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  16. [16]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

  17. [17]

    M. Liu, Z. Chen, X. Cheng, Y . Ji, R.-Z. Qiu, R. Yang, and X. Wang. Visual whole-body control for legged loco-manipulation.arXiv preprint arXiv: 2403.16967, 2024

  18. [18]

    Uppal, A

    S. Uppal, A. Agarwal, H. Xiong, K. Shaw, and D. Pathak. Spin: Simultaneous perception, interaction and navigation.CVPR, 2024

  19. [19]

    J. Yang, Z. ang Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. Equibot: Sim(3)-equivariant diffusion policy for generalizable and data efficient learning, 2024. URLhttps://arxiv. org/abs/2407.01479

  20. [20]

    Jiang, R

    Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025. URLhttps:// openreview.net/forum?id=v2KevjWScT

  21. [21]

    Homer: Learn- ing in-the-wild mobile manipulation via hybrid imitation and whole- body control,

    P. Sundaresan, R. Malhotra, P. Miao, J. Yang, J. Wu, H. Hu, R. Antonova, F. Engelmann, D. Sadigh, and J. Bohg. Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control, 2025. URLhttps://arxiv.org/abs/2506.01185

  22. [22]

    Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation.Conference on Robot Learning,

  23. [23]

    doi:10.48550/arXiv.2211.09423. 11

  24. [24]

    C. Wang, H. Shi, W. Wang, R. Zhang, F.-F. Li, and K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.ROBOTICS, 2024. doi:10.48550/ arXiv.2403.07788

  25. [25]

    Marr and T

    D. Marr and T. Poggio. Cooperative computation of stereo disparity.Science, 194(4262):283– 287, 1976. doi:10.1126/science.968482. URLhttps://www.science.org/doi/abs/10. 1126/science.968482

  26. [26]

    Zhang, X

    F. Zhang, X. Qi, R. Yang, V . Prisacariu, B. Wah, and P. Torr. Domain-invariant stereo matching networks. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors,Computer Vision – ECCV 2020, pages 420–439, Cham, 2020. Springer International Publishing. ISBN 978-3- 030-58536-5

  27. [27]

    Poggi, F

    M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia. On the uncertainty of self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  28. [28]

    Xu and J

    H. Xu and J. Zhang. Aanet: Adaptive aggregation network for efficient stereo matching.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1956– 1965, 2020

  29. [29]

    Lipson, Z

    L. Lipson, Z. Teed, and J. Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In2021 International Conference on 3D Vision (3DV), pages 218–227, 2021. doi: 10.1109/3DV53792.2021.00032

  30. [30]

    Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath. Revis- iting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6197–6206, October 2021

  31. [31]

    Z. Shen, Y . Dai, and Z. Rao. Cfnet: Cascade and fused cost volume for robust stereo match- ing.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13901–13910, 2021

  32. [32]

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu. Practical stereo matching via cascaded recurrent network with adaptive correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16263– 16272, June 2022

  33. [33]

    Weinzaepfel, T

    P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Br´egier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud. Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17969–17980, October 2023

  34. [34]

    G. Xu, X. Wang, X. Ding, and X. Yang. Iterative geometry encoding volume for stereo match- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21919–21928, June 2023

  35. [35]

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. URLhttps://arxiv.org/abs/2501.09898

  36. [36]

    Shankar, M

    K. Shankar, M. Tjersland, J. Ma, K. Stone, and M. Bajracharya. A learned stereo depth system for robotic manipulation in homes, 2021. URLhttps://arxiv.org/abs/2109.11644

  37. [37]

    R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion con- trol.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1430–1440, 2023. 12

  38. [38]

    Goyal, J

    A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation.ArXiv, abs/2306.14896, 2023

  39. [40]

    URLhttps://arxiv.org/abs/2403.03954v7

  40. [41]

    Jiang, C

    Y . Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei. Transic: Sim-to-real policy transfer by learning from online correction.arXiv preprint arXiv: 2405.10315, 2024. URLhttps: //arxiv.org/abs/2405.10315v3

  41. [42]

    Are we ready for autonomous driving? the KITTI vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012. doi:10.1109/CVPR.2012.6248074

  42. [43]

    Menze and A

    M. Menze and A. Geiger. Object scene flow for autonomous vehicles. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

  43. [44]

    Scharstein, H

    D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Neˇsi´c, X. Wang, and P. West- ling. High-resolution stereo datasets with subpixel-accurate ground truth. In X. Jiang, J. Hornegger, and R. Koch, editors,Pattern Recognition, pages 31–42, Cham, 2014. Springer International Publishing. ISBN 978-3-319-11752-2

  44. [45]

    C. Qi, H. Su, K. Mo, and L. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2017. 16

  45. [46]

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  46. [47]

    Thomas, C

    H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2019

  47. [48]

    H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V . Koltun. Point transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16239–16248, 2020

  48. [49]

    Perceiver: General perception with iterative attention

    A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver: General perception with iterative attention.arXiv preprint arXiv: Arxiv-2103.03206, 2021

  49. [50]

    X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao. Point transformer v2: Grouped vector attention and partition-based pooling. In S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa- tion Processing Systems, volume 35, pages 33330–33342. Curran Associates, Inc.,

  50. [51]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ d78ece6613953f46501b958b7bb4582f-Paper-Conference.pdf

  51. [52]

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4840–4851, 2023

  52. [53]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski. Dinov3, 2025. URLhttps: //arxiv.org/...

  53. [54]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020. 13

  54. [55]

    X. Wu, D. DeTone, D. Frost, T. Shen, C. Xie, N. Yang, J. Engel, R. Newcombe, H. Zhao, and J. Straub. Sonata: Self-supervised learning of reliable point representations, 2025. URL https://arxiv.org/abs/2503.16429

  55. [56]

    J. Hou, B. Graham, M. Nießner, and S. Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts, 2021. URLhttps://arxiv.org/abs/2012.09165

  56. [57]

    X. Wu, X. Wen, X. Liu, and H. Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning, 2023. URLhttps://arxiv.org/abs/2303.14191

  57. [58]

    X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger, 2024. URLhttps://arxiv.org/abs/2312. 10035

  58. [59]

    S. Xie, J. Gu, D. Guo, C. R. Qi, L. J. Guibas, and O. Litany. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding, 2020. URLhttps://arxiv.org/abs/2007. 10985

  59. [60]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

  60. [61]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URLhttps://arxiv. org/abs/2003.08934

  61. [62]

    Z. Fan, P. Wang, Y . Jiang, X. Gong, D. Xu, and Z. Wang. Nerf-sos: Any-view self- supervised object segmentation on complex scenes, 2022. URLhttps://arxiv.org/abs/ 2209.08776

  62. [63]

    Y . Liu, L. Kong, J. Cen, R. Chen, W. Zhang, L. Pan, K. Chen, and Z. Liu. Segment any point cloud sequences by distilling vision foundation models, 2023. URLhttps://arxiv.org/ abs/2306.09347

  63. [64]

    A., Holynski, A., and Kanazawa, A

    A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions, 2023. URLhttps://arxiv.org/abs/2303.12789

  64. [65]

    K. Liu, F. Zhan, J. Zhang, M. Xu, Y . Yu, A. E. Saddik, C. Theobalt, E. Xing, and S. Lu. Weakly supervised 3d open-vocabulary segmentation, 2024. URLhttps://arxiv.org/abs/2305. 14093

  65. [66]

    J. Min, Y . Jeon, J. Kim, and M. Choi. S2M2: Scalable stereo matching model for reliable depth estimation, 2025. URLhttps://arxiv.org/abs/2507.13229

  66. [67]

    arXiv preprint arXiv:2412.04472 , year=

    L. Bartolomei, F. Tosi, M. Poggi, and S. Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail, 2025. URLhttps://arxiv.org/ abs/2412.04472

  67. [68]

    G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang. Igev++: Iterative multi-range geometry encoding volumes for stereo matching, 2025. URLhttps://arxiv.org/abs/ 2409.00638

  68. [69]

    Lipson, Z

    L. Lipson, Z. Teed, and J. Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching, 2021. URLhttps://arxiv.org/abs/2109.07547

  69. [70]

    Chang and Y .-S

    J.-R. Chang and Y .-S. Chen. Pyramid stereo matching network, 2018. URLhttps://arxiv. org/abs/1803.08669

  70. [71]

    Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu. Fadnet: A fast and accurate network for disparity estimation, 2020. URLhttps://arxiv.org/abs/2003.10758. 14

  71. [72]

    Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu. Fadnet++: Real-time and accurate disparity estimation with configurable networks, 2021. URLhttps://arxiv.org/abs/2110.02582

  72. [73]

    Y . Wang, Y . Liang, Y . Hu, and Y . Fu. Robustereo: Robust zero-shot stereo matching under adverse weather, 2025. URLhttps://arxiv.org/abs/2507.01653

  73. [74]

    Kalra, V

    A. Kalra, V . Tamaazyan, A. Dall’olio, R. Khanna, T. Gerlich, G. Giannopolou, G. Stoppi, D. Baxter, A. Ghosh, R. Szeliski, et al. A plentoptic 3d vision system. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

  74. [75]

    Pradeep, C

    V . Pradeep, C. Rhemann, S. Izadi, C. Zach, M. Bleyer, and S. Bathiche. Monofusion: Real- time 3d reconstruction of small scenes with a single web camera. In2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 83–88. IEEE, 2013

  75. [76]

    K. Bai, H. Zeng, L. Zhang, Y . Liu, H. Xu, Z. Chen, and J. Zhang. Cleardepth: Enhanced stereo perception of transparent objects for robotic manipulation, 2025. URLhttps://arxiv.org/ abs/2409.08926

  76. [77]

    Kollar, M

    T. Kollar, M. Laskey, K. Stone, B. Thananjeyan, and M. Tjersland. Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo, 2021. URLhttps:// arxiv.org/abs/2106.16118

  77. [78]

    H. Li, T. Padir, and H. Jiang. Stereonavnet: Learning to navigate using stereo cameras with auxiliary occupancy voxels, 2024. URLhttps://arxiv.org/abs/2403.12039

  78. [79]

    H. Li, Z. Li, N. U. Akmandor, H. Jiang, Y . Wang, and T. Padir. Stereovoxelnet: Real-time ob- stacle detection based on occupancy voxels from a stereo camera using deep neural networks,

  79. [80]

    URLhttps://arxiv.org/abs/2209.08459

  80. [81]

    Dextrah- rgb: Visuomotor policies to grasp anything with dexterous hands,

    R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

Showing first 80 references.