arxiv: 2605.09989 · v1 · submitted 2026-05-11 · 💻 cs.RO · cs.CV

Recognition: no theorem link

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

Evans Han , Yunfan Jiang , Yingke Wang , Haoyue Xiao , Huang Huang , Jianwen Xie , Jiajun Wu , Li Fei-Fei

show 1 more author

Ruohan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:52 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords stereo visionrobotic manipulationvisuomotor policiesimitation learningdiffusion policiesstereo transformergeometric reasoning

0 comments

The pith

StereoPolicy processes synchronized stereo image pairs with 2D encoders and a fusion transformer to improve robotic manipulation policies without explicit 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StereoPolicy as a visuomotor policy framework that takes stereo image pairs as input to address the lack of reliable depth and spatial cues in monocular observations. Pretrained 2D vision encoders handle each image separately before a Stereo Transformer fuses the representations to capture correspondence and disparity information implicitly. This design integrates directly with diffusion-based and vision-language-action policies. The approach is evaluated across simulation benchmarks and real-robot tabletop and bimanual tasks, showing gains over RGB, RGB-D, point cloud, and multi-view inputs. A reader would care because precise manipulation in cluttered scenes often fails without better geometric awareness, and this method offers a practical bridge from existing 2D models to 3D understanding.

Core claim

StereoPolicy directly leverages synchronized stereo image pairs to strengthen geometric reasoning in robot policies. It employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues without requiring explicit 3D reconstruction or camera calibration. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action policies and delivers consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines.

What carries the argument

The Stereo Transformer, which fuses feature representations from independent 2D encoders applied to each image in a stereo pair to extract implicit spatial and disparity information.

If this is right

Consistent gains over RGB, RGB-D, point cloud, and multi-view baselines hold across RoboMimic, RoboCasa, and OmniGibson simulation environments.
The same framework transfers to real-robot tabletop and bimanual mobile manipulation without additional calibration.
StereoPolicy combines directly with both diffusion policies and pretrained vision-language-action models.
Stereo vision functions as a scalable input modality that connects 2D pretrained representations to 3D geometric reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standard stereo camera rigs could replace depth sensors in many manipulation setups while retaining compatibility with existing 2D encoders.
The implicit fusion may generalize more readily than explicit 3D reconstruction when camera intrinsics vary across deployments.
Similar stereo fusion could be tested on navigation or grasping tasks where precise relative positioning matters.
An open extension is whether adding a lightweight explicit stereo-matching head on top of the transformer would yield further gains.

Load-bearing premise

That independent processing of each stereo view by pretrained 2D encoders followed by transformer-based fusion is sufficient to recover the needed spatial correspondence and depth cues.

What would settle it

An ablation or comparison experiment in which removing the stereo fusion step or switching to monocular input eliminates the reported performance gains on depth-critical manipulation tasks.

Figures

Figures reproduced from arXiv: 2605.09989 by Evans Han, Haoyue Xiao, Huang Huang, Jiajun Wu, Jianwen Xie, Li Fei-Fei, Ruohan Zhang, Yingke Wang, Yunfan Jiang.

**Figure 2.** Figure 2: StereoPolicy Pipeline. Stereo inputs are encoded by a vision backbone, fused with a Stereo Transformer, and applied to both diffusion-policy training and finetuning VLA baselines. datasets. To mitigate the limited availability of 3D data, some recent approaches aim to “lift” pretrained 2D vision representations into 3D representations (e.g. NeRF [57]) for 3D scene understanding [58–61]. Stereo Vision in … view at source ↗

**Figure 3.** Figure 3: Real-World Task Visualization. Top: Tabletop tasks. Bottom: Mobile manipulation tasks. 3.3 STEREOPOLICY-VLA: Adapting Monocular VLA to Stereo Inputs Pre-trained VLA models exhibit strong semantic understanding through VLM pretraining, but their depth reasoning is limited by monocular inputs. To improve spatial perception, we extend the visual input from monocular to stereo by introducing a lightweight ster… view at source ↗

**Figure 4.** Figure 4: Simulation Task Visualization, from three benchmarks: OMNIGIBSON (4 tasks), ROBOCASA (24 tasks), and ROBOMIMIC (3 tasks). Real-World We consider 7 tasks spanning tabletop manipulation (Banana PnP, Toast Insert, Cup Hang, Steel Cup Hang, Glass Cup Hang) and mobile manipulation (PnP Toast, Turn on Radio), visualized in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: RGB-D and PCD are fragile in real. Glass cup is entirely missing. Evaluation For real-robot evaluation, we report the average success rate over 20 trials, with randomized initial poses. For simulation tasks, we perform 50 rollouts at every 50 training epochs for Robomimic tasks, and every 250 training epochs for Omnigibson tasks. The highest success rate achieved across training is reported. 6 [PITH_FUL… view at source ↗

**Figure 7.** Figure 7: STEREOPOLICY-VLA (Pi0.5) Performance on Bimanual Mobile Manipulation Tasks in both real-world and simulation. (Q2) StereoPolicy enhances pretrained VLA models, despite these models are trained on monocular data. We next examine the STEREOPOLICY-VLA, where StereoPolicy is incorporated into a state-of-the-art pre-trained VLA model, Pi0.5 and GR00T-N1.5 for fine-tuning. As summarized in [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 8.** Figure 8: Performance of STEREOPOLICY-DP across different camera angles. (Q3) STEREOPOLICY-DP is most effective when the baseline is approximately 10% of the target object distance. We vary the stereo baseline distance (2cm, 6cm, 10cm) and camera–object distance (0.6m–1.0m) while keeping other conditions fixed. Results show that performance is governed not by either factor alone, but by their ratio: r = stereo bas… view at source ↗

**Figure 10.** Figure 10: Vision Encoder and Component Ablation. Experiments on ToolHang task with 100 demos. (Q4) The choice of vision backbone significantly influences STEREOPOLICY-DP ’s performance, particularly in low-data regimes. To explore the most effective vision backbone for robot manipulation, we evaluate several architectures on the TOOLHANG task. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Trajectory of Real-world Tabletop Tasks. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Stereo camera views across different baselines and distances. Baseline indicates the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Camera angle view visualization. Experimental results are [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Failure cases of baseline visual modalities on real-world tabletop tasks. RGB, RGB-D, [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Monocular RGB failures in bimanual mobile manipulation. In the T [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

Recent advances in robot imitation learning have yielded powerful visuomotor policies capable of manipulating a wide variety of objects directly from monocular visual inputs. However, monocular observations inherently lack reliable depth cues and spatial awareness, which are critical for precise manipulation in cluttered or geometrically complex scenes. To address this limitation, we introduce StereoPolicy, a new visuomotor policy learning framework that directly leverages synchronized stereo image pairs to strengthen geometric reasoning, without requiring explicit 3D reconstruction or camera calibration. StereoPolicy employs pretrained 2D vision encoders to process each image independently and fuses the resulting representations through a Stereo Transformer. This design implicitly captures spatial correspondence and disparity cues. The framework integrates seamlessly with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks: RoboMimic, RoboCasa, and OmniGibson. We further validate StereoPolicy on real-robot experiments spanning both tabletop and bimanual mobile manipulation settings. Our results underscore stereo vision as a scalable and robust modality that bridges 2D pretrained representations with 3D geometric understanding for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StereoPolicy adds a stereo fusion module on top of existing 2D encoders and reports gains over RGB and depth baselines, but the gains may come from extra views or capacity rather than implicit geometry.

read the letter

The main point on this paper is that StereoPolicy runs two pretrained 2D encoders on a stereo pair, feeds the features into a dedicated Stereo Transformer for fusion, and then uses the result inside standard diffusion or VLA policies. They show this beats plain RGB, RGB-D, point clouds, and a multi-view baseline on RoboMimic, RoboCasa, and OmniGibson, plus some real-robot tabletop and bimanual tasks. The setup avoids calibration and explicit 3D, which keeps it simple to drop into current imitation pipelines.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces StereoPolicy, a visuomotor policy framework for robotic manipulation that processes synchronized stereo image pairs using independent pretrained 2D vision encoders whose outputs are fused by a Stereo Transformer. This design is claimed to implicitly capture spatial correspondence and disparity cues without explicit 3D reconstruction or camera calibration. The framework integrates with diffusion-based and vision-language-action policies and is reported to deliver consistent improvements over RGB, RGB-D, point-cloud, and multi-view baselines on the RoboMimic, RoboCasa, and OmniGibson simulation benchmarks, with additional validation on real-robot tabletop and bimanual mobile manipulation tasks.

Significance. If the empirical gains are robust and can be attributed to stereo-induced geometric reasoning rather than increased input capacity, the work would offer a practical, calibration-free route to strengthen spatial awareness in existing 2D-pretrained policy architectures. This could be valuable for scaling manipulation policies in cluttered or geometrically complex scenes where monocular depth cues are insufficient.

major comments (2)

[Abstract] Abstract and results sections: the central claim of 'consistent improvements' across three simulation benchmarks and real-robot settings is asserted without any quantitative metrics, success rates, error bars, or statistical significance tests in the provided abstract; the full results must be examined to verify whether the gains are large enough to support the geometric-reasoning interpretation.
[Method] Method description of the Stereo Transformer: the assertion that independent 2D-pretrained encoders plus generic transformer fusion 'implicitly captures spatial correspondence and disparity cues' lacks supporting evidence such as attention-map visualizations, disparity estimation accuracy, or an ablation that applies the identical fusion module to non-stereo multi-view pairs; without these, the reported gains could be explained by the extra synchronized view alone rather than 3D understanding.

minor comments (2)

The paper should provide explicit details on training procedures, hyperparameters, and data augmentation for both simulation and real-robot experiments to support reproducibility.
Figure captions for real-robot experiments would benefit from additional description of camera setup, baseline comparisons, and failure modes observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve the presentation of quantitative results and to provide additional supporting evidence for the Stereo Transformer's role in capturing geometric cues.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the central claim of 'consistent improvements' across three simulation benchmarks and real-robot settings is asserted without any quantitative metrics, success rates, error bars, or statistical significance tests in the provided abstract; the full results must be examined to verify whether the gains are large enough to support the geometric-reasoning interpretation.

Authors: We agree that the abstract would be strengthened by including concrete metrics. In the revised manuscript we have added representative success-rate improvements and consistency notes drawn from the full experimental tables. The complete results (Sections 4–5) report means and standard deviations over multiple random seeds together with direct comparisons to all baselines; the observed gains remain larger than those obtained from multi-view or RGB-D inputs, supporting the geometric-reasoning interpretation. revision: yes
Referee: [Method] Method description of the Stereo Transformer: the assertion that independent 2D-pretrained encoders plus generic transformer fusion 'implicitly captures spatial correspondence and disparity cues' lacks supporting evidence such as attention-map visualizations, disparity estimation accuracy, or an ablation that applies the identical fusion module to non-stereo multi-view pairs; without these, the reported gains could be explained by the extra synchronized view alone rather than 3D understanding.

Authors: We acknowledge that additional evidence strengthens the claim. The existing multi-view baselines already apply comparable fusion to non-stereo image pairs and yield smaller gains than stereo pairs, indicating that the benefit is not explained by input count alone. In the revision we have added attention-map visualizations (new figure in the supplement) that illustrate cross-view correspondence only when stereo pairs are used. We do not report explicit disparity-estimation accuracy because the architecture is trained end-to-end for policy performance rather than depth prediction; the policy-level ablations and real-robot results in geometrically demanding tasks serve as the primary validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmarks

full rationale

The paper introduces StereoPolicy as an empirical architecture: pretrained 2D encoders process stereo pairs independently, a Stereo Transformer fuses them, and the resulting policy is trained and tested on RoboMimic, RoboCasa, OmniGibson plus real-robot tasks. No equations, closed-form derivations, or predictions are presented that reduce the claimed gains to inputs by construction. The implicit-capture claim is a hypothesis tested via ablation and baseline comparisons rather than a tautological redefinition. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that the Stereo Transformer learns useful disparity cues from pretrained 2D features.

pith-pipeline@v0.9.0 · 5522 in / 1109 out tokens · 48951 ms · 2026-05-12T03:52:55.765366+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 15 internal anchors

[1]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies,

work page
[2]

URLhttps://arxiv.org/abs/1504.00702

work page Pith review arXiv
[3]

Y . Zhu, Z. Wang, J. Merel, A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kram´ar, R. Had- sell, N. de Freitas, and N. Heess. Reinforcement and imitation learning for diverse visuomotor skills, 2018. URLhttps://arxiv.org/abs/1802.09564

work page arXiv 2018
[4]

Cliport: What and where pathways for robotic manipulation

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipu- lation.arXiv preprint arXiv: Arxiv-2109.12098, 2021

work page arXiv 2021
[5]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent.arXiv preprint arXiv: Arxiv-2205.06175, 2022

work page internal anchor Pith review arXiv 2022
[6]

VIMA : General robot manipulation with multimodal prompts

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv: 2210.03094, 2022

work page arXiv 2022
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Per...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.06817 2022
[8]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, K. Choromanski, T. Ding, D. Driess, K. A. Dubey, C. Finn, P. R. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Haus- man, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. C. Julian, D. Kalashnikov, Y . Kuang, I. Leal, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Rey- ma...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15818 2023
[9]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024. 10

work page 2024
[10]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In K. E. Bekris, K. Hauser, S. L. Herbert, and J. Yu, editors,Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi:10.15607/ RSS.2023.XIX.016. URLhttps://doi.org/10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[11]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=ZMnD6QZAE6

work page 2024
[12]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv: 2410...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps: //arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

M. Liu, Z. Chen, X. Cheng, Y . Ji, R.-Z. Qiu, R. Yang, and X. Wang. Visual whole-body control for legged loco-manipulation.arXiv preprint arXiv: 2403.16967, 2024

work page arXiv 2024
[18]

Uppal, A

S. Uppal, A. Agarwal, H. Xiong, K. Shaw, and D. Pathak. Spin: Simultaneous perception, interaction and navigation.CVPR, 2024

work page 2024
[19]

J. Yang, Z. ang Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. Equibot: Sim(3)-equivariant diffusion policy for generalizable and data efficient learning, 2024. URLhttps://arxiv. org/abs/2407.01479

work page arXiv 2024
[20]

Jiang, R

Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. BEHA VIOR robot suite: Streamlining real-world whole-body manipulation for everyday household activities. In9th Annual Conference on Robot Learning, 2025. URLhttps:// openreview.net/forum?id=v2KevjWScT

work page 2025
[21]

Homer: Learn- ing in-the-wild mobile manipulation via hybrid imitation and whole- body control,

P. Sundaresan, R. Malhotra, P. Miao, J. Yang, J. Wu, H. Hu, R. Antonova, F. Engelmann, D. Sadigh, and J. Bohg. Homer: Learning in-the-wild mobile manipulation via hybrid imitation and whole-body control, 2025. URLhttps://arxiv.org/abs/2506.01185

work page arXiv 2025
[22]

Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation.Conference on Robot Learning,

work page
[23]

doi:10.48550/arXiv.2211.09423. 11

work page doi:10.48550/arxiv.2211.09423
[24]

C. Wang, H. Shi, W. Wang, R. Zhang, F.-F. Li, and K. Liu. Dexcap: Scalable and portable mocap data collection system for dexterous manipulation.ROBOTICS, 2024. doi:10.48550/ arXiv.2403.07788

work page arXiv 2024
[25]

Marr and T

D. Marr and T. Poggio. Cooperative computation of stereo disparity.Science, 194(4262):283– 287, 1976. doi:10.1126/science.968482. URLhttps://www.science.org/doi/abs/10. 1126/science.968482

work page doi:10.1126/science.968482 1976
[26]

Zhang, X

F. Zhang, X. Qi, R. Yang, V . Prisacariu, B. Wah, and P. Torr. Domain-invariant stereo matching networks. In A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, editors,Computer Vision – ECCV 2020, pages 420–439, Cham, 2020. Springer International Publishing. ISBN 978-3- 030-58536-5

work page 2020
[27]

Poggi, F

M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia. On the uncertainty of self-supervised monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[28]

Xu and J

H. Xu and J. Zhang. Aanet: Adaptive aggregation network for efficient stereo matching.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1956– 1965, 2020

work page 2020
[29]

Lipson, Z

L. Lipson, Z. Teed, and J. Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In2021 International Conference on 3D Vision (3DV), pages 218–227, 2021. doi: 10.1109/3DV53792.2021.00032

work page doi:10.1109/3dv53792.2021.00032 2021
[30]

Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath. Revis- iting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6197–6206, October 2021

work page 2021
[31]

Z. Shen, Y . Dai, and Z. Rao. Cfnet: Cascade and fused cost volume for robust stereo match- ing.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13901–13910, 2021

work page 2021
[32]

J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu. Practical stereo matching via cascaded recurrent network with adaptive correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16263– 16272, June 2022

work page 2022
[33]

Weinzaepfel, T

P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Br´egier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud. Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17969–17980, October 2023

work page 2023
[34]

G. Xu, X. Wang, X. Ding, and X. Yang. Iterative geometry encoding volume for stereo match- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21919–21928, June 2023

work page 2023
[35]

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. URLhttps://arxiv.org/abs/2501.09898

work page arXiv 2025
[36]

Shankar, M

K. Shankar, M. Tjersland, J. Ma, K. Stone, and M. Bajracharya. A learned stereo depth system for robotic manipulation in homes, 2021. URLhttps://arxiv.org/abs/2109.11644

work page arXiv 2021
[37]

R. Yang, G. Yang, and X. Wang. Neural volumetric memory for visual locomotion con- trol.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1430–1440, 2023. 12

work page 2023
[38]

Goyal, J

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation.ArXiv, abs/2306.14896, 2023

work page arXiv 2023
[40]

URLhttps://arxiv.org/abs/2403.03954v7

work page internal anchor Pith review arXiv
[41]

Jiang, C

Y . Jiang, C. Wang, R. Zhang, J. Wu, and L. Fei-Fei. Transic: Sim-to-real policy transfer by learning from online correction.arXiv preprint arXiv: 2405.10315, 2024. URLhttps: //arxiv.org/abs/2405.10315v3

work page arXiv 2024
[42]

Are we ready for autonomous driving? the KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012. doi:10.1109/CVPR.2012.6248074

work page doi:10.1109/cvpr.2012.6248074 2012
[43]

Menze and A

M. Menze and A. Geiger. Object scene flow for autonomous vehicles. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

work page 2015
[44]

Scharstein, H

D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Neˇsi´c, X. Wang, and P. West- ling. High-resolution stereo datasets with subpixel-accurate ground truth. In X. Jiang, J. Hornegger, and R. Koch, editors,Pattern Recognition, pages 31–42, Cham, 2014. Springer International Publishing. ISBN 978-3-319-11752-2

work page 2014
[45]

C. Qi, H. Su, K. Mo, and L. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2017. 16

work page doi:10.1109/cvpr.2017 2016
[46]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

work page 2017
[47]

Thomas, C

H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. Kpconv: Flexible and deformable convolution for point clouds. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2019

work page 2019
[48]

H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V . Koltun. Point transformer.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16239–16248, 2020

work page 2021
[49]

Perceiver: General perception with iterative attention

A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver: General perception with iterative attention.arXiv preprint arXiv: Arxiv-2103.03206, 2021

work page arXiv 2021
[50]

X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao. Point transformer v2: Grouped vector attention and partition-based pooling. In S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Informa- tion Processing Systems, volume 35, pages 33330–33342. Curran Associates, Inc.,

work page
[51]

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ d78ece6613953f46501b958b7bb4582f-Paper-Conference.pdf

work page 2022
[52]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4840–4851, 2023

work page 2024
[53]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Cou- prie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski. Dinov3, 2025. URLhttps: //arxiv.org/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

X. Wu, D. DeTone, D. Frost, T. Shen, C. Xie, N. Yang, J. Engel, R. Newcombe, H. Zhao, and J. Straub. Sonata: Self-supervised learning of reliable point representations, 2025. URL https://arxiv.org/abs/2503.16429

work page arXiv 2025
[56]

J. Hou, B. Graham, M. Nießner, and S. Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts, 2021. URLhttps://arxiv.org/abs/2012.09165

work page arXiv 2021
[57]

X. Wu, X. Wen, X. Liu, and H. Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning, 2023. URLhttps://arxiv.org/abs/2303.14191

work page arXiv 2023
[58]

X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao. Point transformer v3: Simpler, faster, stronger, 2024. URLhttps://arxiv.org/abs/2312. 10035

work page 2024
[59]

S. Xie, J. Gu, D. Guo, C. R. Qi, L. J. Guibas, and O. Litany. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding, 2020. URLhttps://arxiv.org/abs/2007. 10985

work page 2020
[60]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy, 2024. URLhttps://arxiv.org/abs/2312.14132

work page arXiv 2024
[61]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis, 2020. URLhttps://arxiv. org/abs/2003.08934

work page arXiv 2020
[62]

Z. Fan, P. Wang, Y . Jiang, X. Gong, D. Xu, and Z. Wang. Nerf-sos: Any-view self- supervised object segmentation on complex scenes, 2022. URLhttps://arxiv.org/abs/ 2209.08776

work page arXiv 2022
[63]

Y . Liu, L. Kong, J. Cen, R. Chen, W. Zhang, L. Pan, K. Chen, and Z. Liu. Segment any point cloud sequences by distilling vision foundation models, 2023. URLhttps://arxiv.org/ abs/2306.09347

work page arXiv 2023
[64]

A., Holynski, A., and Kanazawa, A

A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions, 2023. URLhttps://arxiv.org/abs/2303.12789

work page arXiv 2023
[65]

K. Liu, F. Zhan, J. Zhang, M. Xu, Y . Yu, A. E. Saddik, C. Theobalt, E. Xing, and S. Lu. Weakly supervised 3d open-vocabulary segmentation, 2024. URLhttps://arxiv.org/abs/2305. 14093

work page 2024
[66]

J. Min, Y . Jeon, J. Kim, and M. Choi. S2M2: Scalable stereo matching model for reliable depth estimation, 2025. URLhttps://arxiv.org/abs/2507.13229

work page arXiv 2025
[67]

arXiv preprint arXiv:2412.04472 , year=

L. Bartolomei, F. Tosi, M. Poggi, and S. Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail, 2025. URLhttps://arxiv.org/ abs/2412.04472

work page arXiv 2025
[68]

G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang. Igev++: Iterative multi-range geometry encoding volumes for stereo matching, 2025. URLhttps://arxiv.org/abs/ 2409.00638

work page arXiv 2025
[69]

Lipson, Z

L. Lipson, Z. Teed, and J. Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching, 2021. URLhttps://arxiv.org/abs/2109.07547

work page arXiv 2021
[70]

Chang and Y .-S

J.-R. Chang and Y .-S. Chen. Pyramid stereo matching network, 2018. URLhttps://arxiv. org/abs/1803.08669

work page arXiv 2018
[71]

Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu. Fadnet: A fast and accurate network for disparity estimation, 2020. URLhttps://arxiv.org/abs/2003.10758. 14

work page arXiv 2020
[72]

Q. Wang, S. Shi, S. Zheng, K. Zhao, and X. Chu. Fadnet++: Real-time and accurate disparity estimation with configurable networks, 2021. URLhttps://arxiv.org/abs/2110.02582

work page arXiv 2021
[73]

Y . Wang, Y . Liang, Y . Hu, and Y . Fu. Robustereo: Robust zero-shot stereo matching under adverse weather, 2025. URLhttps://arxiv.org/abs/2507.01653

work page arXiv 2025
[74]

Kalra, V

A. Kalra, V . Tamaazyan, A. Dall’olio, R. Khanna, T. Gerlich, G. Giannopolou, G. Stoppi, D. Baxter, A. Ghosh, R. Szeliski, et al. A plentoptic 3d vision system. InSIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024

work page 2024
[75]

Pradeep, C

V . Pradeep, C. Rhemann, S. Izadi, C. Zach, M. Bleyer, and S. Bathiche. Monofusion: Real- time 3d reconstruction of small scenes with a single web camera. In2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 83–88. IEEE, 2013

work page 2013
[76]

K. Bai, H. Zeng, L. Zhang, Y . Liu, H. Xu, Z. Chen, and J. Zhang. Cleardepth: Enhanced stereo perception of transparent objects for robotic manipulation, 2025. URLhttps://arxiv.org/ abs/2409.08926

work page arXiv 2025
[77]

Kollar, M

T. Kollar, M. Laskey, K. Stone, B. Thananjeyan, and M. Tjersland. Simnet: Enabling robust unknown object manipulation from pure synthetic data via stereo, 2021. URLhttps:// arxiv.org/abs/2106.16118

work page arXiv 2021
[78]

H. Li, T. Padir, and H. Jiang. Stereonavnet: Learning to navigate using stereo cameras with auxiliary occupancy voxels, 2024. URLhttps://arxiv.org/abs/2403.12039

work page arXiv 2024
[79]

H. Li, Z. Li, N. U. Akmandor, H. Jiang, Y . Wang, and T. Padir. Stereovoxelnet: Real-time ob- stacle detection based on occupancy voxels from a stereo camera using deep neural networks,

work page
[80]

URLhttps://arxiv.org/abs/2209.08459

work page arXiv
[81]

Dextrah- rgb: Visuomotor policies to grasp anything with dexterous hands,

R. Singh, A. Allshire, A. Handa, N. Ratliff, and K. Van Wyk. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands.arXiv preprint arXiv:2412.01791, 2024

work page arXiv 2024

Showing first 80 references.