Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing; Krystian Mikolajczyk; Weixun Luo; Ye Mao

arxiv: 2511.16555 · v3 · submitted 2025-11-20 · 💻 cs.CV

Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing , Weixun Luo , Ye Mao , Krystian Mikolajczyk This is my paper

Pith reviewed 2026-05-17 20:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo matchingzero-shot learningefficient modelsdepth estimationcomputer visiongeneralization

0 comments

The pith

An ultra-light stereo matching model matches or exceeds the accuracy of much larger methods on real-world benchmarks while using under 1% of their computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lite Any Stereo as a stereo depth estimation system that delivers strong performance on unseen real data despite its small size. It combines a compact expressive backbone with a hybrid cost aggregation module and uses a three-stage training process on million-scale datasets to overcome the usual limitations of lightweight models in generalizing from simulation to reality. This challenges the view that efficient models lack zero-shot capability, showing they can rank first on multiple benchmarks.

Core claim

By designing a compact backbone and hybrid cost aggregation module, and applying a three-stage training strategy on large-scale data, an ultra-light architecture can achieve top accuracy in zero-shot stereo matching across real-world benchmarks while consuming less than 1% of the computational resources of state-of-the-art accurate methods.

What carries the argument

The Lite Any Stereo framework, centered on its compact yet expressive backbone and hybrid cost aggregation module, which together enable efficient processing and effective generalization.

If this is right

Lightweight models become viable for accurate zero-shot stereo depth estimation in practical settings.
Computational costs for high-performance stereo matching drop dramatically, enabling deployment on resource-limited devices.
The three-stage training approach demonstrates a scalable way to bridge simulation-to-real gaps in depth estimation.
Non-prior-based methods can now prioritize efficiency without major accuracy trade-offs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that similar efficiency-focused designs could apply to related tasks such as optical flow estimation.
Future work might test if even smaller models or different training scales yield comparable results.
Adoption could shift industry standards toward lightweight architectures for real-time 3D perception.

Load-bearing premise

The three-stage training strategy on million-scale data bridges the sim-to-real gap for the ultra-light model without relying on hidden tuning specific to the evaluation benchmarks.

What would settle it

Evaluating the model on a new, unseen real-world stereo dataset where it fails to rank highly or loses its accuracy advantage over heavier models would challenge the claim.

Figures

Figures reproduced from arXiv: 2511.16555 by Junpeng Jing, Krystian Mikolajczyk, Weixun Luo, Ye Mao.

**Figure 1.** Figure 1: Zero-shot prediction on in-the-wild stereo images. The proposed method achieves accurate disparity estimation across diverse [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Zero-shot performance. Our method achieves SOTA by [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Lite Any Stereo. Given an input stereo image pair, features are first extracted using a shared-weight Extraction 3D Ati2D Ati c 𝑮𝑮 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed three-stage training strategy. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Design choices for hybrid cost aggregation module. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Effects of the proposed three stages training strategy. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of zero-shot inference on in-the-wild images. Each column shows disparity predictions from different [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Lite Any Stereo, an ultra-light stereo depth estimation framework featuring a compact yet expressive backbone and a hybrid cost aggregation module. It employs a three-stage training strategy on million-scale synthetic data to bridge the sim-to-real gap, claiming to rank first on four widely used real-world benchmarks while achieving accuracy comparable to or exceeding state-of-the-art non-prior-based methods at less than 1% computational cost.

Significance. If the results hold under rigorous validation, the work would demonstrate that deliberately low-capacity architectures can deliver strong zero-shot generalization in stereo matching through architecture design and large-scale training, potentially redefining efficiency-accuracy trade-offs and enabling deployment on resource-limited devices. The explicit focus on million-scale synthetic pretraining for sim-to-real transfer is a notable strength if supported by ablations.

major comments (2)

[Abstract and §4] Abstract and experimental section: the central claim of 1st-place zero-shot rankings and comparable accuracy to heavier SOTA methods is presented without error bars, ablation details on training-data composition, or explicit comparison tables; this makes it impossible to verify whether benchmark choices or data exclusions affect the reported advantage for the ultra-light model.
[§3] §3 (three-stage training): the strategy is load-bearing for the sim-to-real claim, yet the description provides no quantitative evidence such as held-out real validation sets or ablations ruling out per-benchmark hyperparameter search or post-training selection; for a low-capacity backbone this is required to establish that the <1% compute advantage is reproducible on truly unseen real data.

minor comments (2)

[§2] Clarify notation for the hybrid cost aggregation module and ensure all equations are numbered consistently with references in the text.
[Table in §4] Add a table summarizing compute (FLOPs or runtime) alongside accuracy metrics for all compared methods to support the <1% claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the experimental validation and clarity of our claims. We address each major point below and have incorporated revisions to improve transparency and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and experimental section: the central claim of 1st-place zero-shot rankings and comparable accuracy to heavier SOTA methods is presented without error bars, ablation details on training-data composition, or explicit comparison tables; this makes it impossible to verify whether benchmark choices or data exclusions affect the reported advantage for the ultra-light model.

Authors: We agree that additional statistical and compositional details would aid verification. In the revised manuscript we have added error bars computed across multiple independent training runs with different random seeds to all reported rankings and accuracy figures. We have also inserted a dedicated ablation subsection on training-data composition, quantifying the contribution of each synthetic source to zero-shot performance. Section 4 now contains expanded comparison tables that list all competing methods alongside their compute costs, accuracies on the four benchmarks, and explicit notes on any data exclusions or benchmark usage. revision: yes
Referee: [§3] §3 (three-stage training): the strategy is load-bearing for the sim-to-real claim, yet the description provides no quantitative evidence such as held-out real validation sets or ablations ruling out per-benchmark hyperparameter search or post-training selection; for a low-capacity backbone this is required to establish that the <1% compute advantage is reproducible on truly unseen real data.

Authors: We acknowledge that quantitative support for the training strategy strengthens the sim-to-real claims. The revised Section 3 now reports results on held-out real validation subsets drawn from the benchmarks, confirming consistent transfer without per-benchmark adaptation. We have added ablations that vary the training stages while keeping hyperparameters fixed across all evaluations, demonstrating that gains arise from the staged curriculum rather than post-training selection or benchmark-specific tuning. These changes support reproducibility of the efficiency advantage on truly unseen real data. revision: yes

Circularity Check

0 steps flagged

Empirical results on external benchmarks; no load-bearing reduction to self-defined quantities or self-citations

full rationale

The paper describes an ultra-light backbone, hybrid cost aggregation, and three-stage training on synthetic million-scale data, then reports rankings on four real-world benchmarks. No equations or derivations are presented that reduce by construction to fitted parameters or prior self-citations. The zero-shot generalization claim is supported by external benchmark evaluation rather than internal redefinition or renaming of known results. This matches the low circularity expected for an empirical architecture paper whose central claims remain falsifiable on held-out real data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of deep stereo matching plus empirical training choices whose details are not visible in the abstract.

free parameters (1)

three-stage training hyperparameters
Learning rates, stage durations, and data mixing ratios chosen to bridge sim-to-real gap on million-scale data.

axioms (1)

domain assumption Stereo matching can be solved via learned cost-volume aggregation in a CNN backbone
Invoked implicitly by the design of the hybrid cost aggregation module.

pith-pipeline@v0.9.0 · 5456 in / 1309 out tokens · 29374 ms · 2026-05-17T20:28:08.739363+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid cost aggregation module that jointly leverages 2D and 3D representations... C_agg = G_2D(G_3D(C))
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-stage training strategy on million-scale data... Stage① supervised on 1.8 M synthetic, Stage② self-distillation, Stage③ knowledge distillation on 0.5 M real pairs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 2 internal anchors

[1]

Correlate-and- excite: Real-time stereo matching via guided cost volume excitation

Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and- excite: Real-time stereo matching via guided cost volume excitation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3542–3548. IEEE, 2021. 2, 6, 8

work page 2021
[2]

Instereo2k: a large real dataset for stereo matching in indoor scenes.Science China Information Sci- ences, 63(11):1–11, 2020

Wei Bao, Wei Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and Xiaohu Zhang. Instereo2k: a large real dataset for stereo matching in indoor scenes.Science China Information Sci- ences, 63(11):1–11, 2020. 4

work page 2020
[3]

Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail.arXiv preprint arXiv:2412.04472, 2024

Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail.arXiv preprint arXiv:2412.04472, 2024. 2

work page arXiv 2024
[4]

Uasol, a large- scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019

Zuria Bauer, Francisco Gomez-Donoso, Edmanuel Cruz, Sergio Orts-Escolano, and Miguel Cazorla. Uasol, a large- scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019. 4

work page 2019
[5]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InECCV, pages 611–625, 2012. 4

work page 2012
[6]

Vir- tual kitti 2, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2, 2020. 4, 5, 7

work page 2020
[7]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InCVPR, pages 5410–5418, 2018. 2

work page 2018
[8]

Domain generalized stereo matching via hierarchical visual transformation

Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. Domain generalized stereo matching via hierarchical visual transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559– 9568, 2023. 2

work page 2023
[9]

Monster: Marry monodepth to stereo unleashes power, 2025

Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, and Xin Yang. Monster: Marry monodepth to stereo unleashes power, 2025. 1, 2, 3

work page 2025
[10]

Hierarchical neural architecture search for deep stereo matching.arXiv preprint arXiv:2010.13501,

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Tom Drummond, Hongdong Li, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching.arXiv preprint arXiv:2010.13501,

work page arXiv 2010
[11]

Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching networks

WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar, and David Suter. Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13022–13032, 2022. 2

work page 2022
[12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

work page 2009
[13]

Deeppruner: Learning efficient stereo matching via differentiable patchmatch

Shivam Duggal, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. InProceedings of the IEEE/CVF international conference on computer vision, pages 4384–4393, 2019. 2, 8

work page 2019
[14]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InCVPR, pages 3354–3361, 2012. 1, 2, 3, 5, 6, 7, 8

work page 2012
[15]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2020. 2

work page 2020
[16]

Context-enhanced stereo transformer

Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus- sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei Li. Context-enhanced stereo transformer. InEuropean Con- ference on Computer Vision, pages 263–279. Springer, 2022. 2

work page 2022
[17]

Group-wise correlation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In CVPR, pages 3273–3282, 2019. 2

work page 2019
[18]

Openstereo: A comprehensive benchmark for stereo matching and strong baseline,

Xianda Guo, Chenming Zhang, Juntao Lu, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, and Long Chen. Openstereo: A comprehensive benchmark for stereo matching and strong baseline.arXiv preprint arXiv:2312.00343, 2023. 2

work page arXiv 2023
[19]

Stereo anything: Unifying stereo matching with large-scale mixed data,

Xianda Guo, Chenming Zhang, Youmin Zhang, Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, and Long Chen. Stereo anything: Unifying stereo matching with large- scale mixed data.arXiv preprint arXiv:2411.14053, 2024. 2, 6

work page arXiv 2024
[20]

Light- stereo: Channel boost is all your need for efficient 2d cost aggregation, 2024

Xianda Guo, Chenming Zhang, Youmin Zhang, Wenzhao Zheng, Dujun Nie, Matteo Poggi, and Long Chen. Light- stereo: Channel boost is all your need for efficient 2d cost aggregation, 2024. 1, 2, 3, 4, 6, 8

work page 2024
[21]

Holopix50k: A large-scale in-the-wild stereo image dataset

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. InCVPR Workshop on Computer Vision for Augmented and Virtual Reality, Seattle, WA, 2020., 2020. 4

work page 2020
[22]

Defom-stereo: Depth foundation model based stereo matching, 2025

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching, 2025. 1, 2, 3

work page 2025
[23]

Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint,

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint,

work page
[24]

Uncertainty guided adaptive warping for robust and efficient stereo matching

Junpeng Jing, Jiankun Li, Pengfei Xiong, Jiangyu Liu, Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, and Leonid Sigal. Uncertainty guided adaptive warping for robust and efficient stereo matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3318–3327, 2023. 2, 3, 6, 8

work page 2023
[25]

Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching

Junpeng Jing, Ye Mao, and Krystian Mikolajczyk. Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching. InEuropean Conference on Com- puter Vision, pages 415–432. Springer, 2024. 2

work page 2024
[26]

Match stereo videos via bidirectional alignment,

Junpeng Jing, Ye Mao, Anlan Qiu, and Krystian Miko- lajczyk. Match stereo videos via bidirectional alignment,

work page
[27]

Stereo any video: Temporally consistent stereo match- ing, 2025

Junpeng Jing, Weixun Luo, Ye Mao, and Krystian Mikola- jczyk. Stereo any video: Temporally consistent stereo match- ing, 2025. 9

work page 2025
[28]

Dy- namicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 2, 4

work page 2023
[29]

Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction

Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. InECCV, pages 573–590, 2018. 2

work page 2018
[30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Practical stereo matching via cascaded recurrent net- work with adaptive correlation

Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded recurrent net- work with adaptive correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16263–16272, 2022. 1, 2, 4, 5, 7

work page 2022
[32]

Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers

Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X Creighton, Russell H Taylor, and Mathias Un- berath. Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 6197–6206, 2021. 2

work page 2021
[33]

Learning for disparity estimation through feature constancy

Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learning for disparity estimation through feature constancy. InCVPR, pages 2811–2820, 2018. 2

work page 2018
[34]

Raft-stereo: Multilevel recurrent field transforms for stereo matching

Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. arXiv preprint arXiv:2109.07547, 2021. 1, 2

work page arXiv 2021
[35]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page
[36]

Cooperative computation of stereo disparity

D Marr and T Poggio. Cooperative computation of stereo disparity. InNeurocomputing: foundations of research, pages 259–267. 1988. 1

work page 1988
[37]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InCVPR, pages 4040–4048, 2016. 4, 5, 6, 7, 8

work page 2016
[38]

Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr ´es Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 4

work page 2023
[39]

Object scene flow for au- tonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for au- tonomous vehicles. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1, 2, 3, 5, 6, 7, 8

work page 2015
[40]

Confidence aware stereo matching for realistic cluttered scenario

Junhong Min and Youngpil Jeon. Confidence aware stereo matching for realistic cluttered scenario. In2024 IEEE In- ternational Conference on Image Processing (ICIP), pages 3491–3497. IEEE, 2024. 5

work page 2024
[41]

Cascade residual learning: A two-stage con- volutional neural network for stereo matching

Jiahao Pang, Wenxiu Sun, Jimmy SJ Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage con- volutional neural network for stereo matching. InCVPRW, pages 887–895, 2017. 2

work page 2017
[42]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 5

work page 1992
[43]

Masked representation learn- ing for domain generalized stereo matching

Zhibo Rao, Bangshu Xiong, Mingyi He, Yuchao Dai, Renjie He, Zhelun Shen, and Xing Li. Masked representation learn- ing for domain generalized stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5435–5444, 2023. 2

work page 2023
[44]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4510–4520, 2018. 3

work page 2018
[45]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47(1):7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47(1):7–42, 2002. 1, 2, 5, 6, 7

work page 2002
[46]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, pages 3260–3269, 2017. 1, 2, 5, 6, 7

work page 2017
[47]

Mobilestereonet: Towards lightweight deep net- works for stereo matching

Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and An- dreas Zell. Mobilestereonet: Towards lightweight deep net- works for stereo matching. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2417–2426, 2022. 1, 3, 4, 6, 8

work page 2022
[48]

Cfnet: Cascade and fused cost volume for robust stereo matching

Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. InCVPR, pages 13906–13915, 2021. 2

work page 2021
[49]

Chitransformer: Towards reliable stereo from cues

Qing Su and Shihao Ji. Chitransformer: Towards reliable stereo from cues. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1939–1949, 2022. 2

work page 1939
[50]

Continuous 3d label stereo matching us- ing local expansion moves.IEEE TPAMI, 40(11):2725– 2739, 2017

Tatsunori Taniai, Yasuyuki Matsushita, Yoichi Sato, and Takeshi Naemura. Continuous 3d label stereo matching us- ing local expansion moves.IEEE TPAMI, 40(11):2725– 2739, 2017. 1

work page 2017
[51]

Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching

Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching. InCVPR, pages 14362–14372, 2021. 2, 3, 8

work page 2021
[52]

Falling things: A synthetic dataset for 3d object detection and pose estimation

Jonathan Tremblay, Thang To, and Stan Birchfield. Falling things: A synthetic dataset for 3d object detection and pose estimation. InCVPRW, pages 2038–2041, 2018. 4, 5, 7

work page 2038
[53]

Sparsity invariant cnns

Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In2017 international conference on 3D Vision (3DV), pages 11–20. IEEE, 2017. 8

work page 2017
[54]

Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,

Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation.arXiv preprint arXiv:1912.09678, 2019. 4 10

work page arXiv 1912
[55]

FADNet: A fast and accurate network for disparity estimation

Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. FADNet: A fast and accurate network for disparity estimation. In2020 IEEE International Conference on Robotics and Automation (ICRA 2020), pages 101–107,

work page 2020
[56]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020. 4

work page 2020
[57]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19701–19710, 2024. 1, 2, 6, 7

work page 2024
[58]

Flickr1024: A large-scale dataset for stereo image super-resolution

Yingqian Wang, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. Flickr1024: A large-scale dataset for stereo image super-resolution. InInternational Conference on Computer Vision Workshops, pages 3852–3857, 2019. 4

work page 2019
[59]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023
[60]

Foundationstereo: Zero- shot stereo matching, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. 1, 2, 3, 4, 5, 6, 7

work page 2025
[61]

Con- vnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con- vnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133– 16142, 2023. 3

work page 2023
[62]

Structure-guided ranking loss for single image depth prediction

Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single image depth prediction. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5

work page 2020
[63]

Bilateral grid learning for stereo matching networks

Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia, and Yulan Guo. Bilateral grid learning for stereo matching networks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12497–12506, 2021. 2, 8

work page 2021
[64]

Atten- tion concatenation volume for accurate and efficient stereo matching

Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten- tion concatenation volume for accurate and efficient stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12981– 12990, 2022. 2, 6, 8

work page 2022
[65]

Iterative geometry encoding volume for stereo matching

Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21919–21928, 2023. 2

work page 2023
[66]

Accurate and efficient stereo matching via attention concatenation volume.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2023

Gangwei Xu, Yun Wang, Junda Cheng, Jinhui Tang, and Xin Yang. Accurate and efficient stereo matching via attention concatenation volume.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2023. 1, 2, 6, 8

work page 2023
[67]

Banet: Bilateral aggregation network for mobile stereo matching

Gangwei Xu, Jiaxin Liu, Xianqi Wang, Junda Cheng, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. Banet: Bilateral aggregation network for mobile stereo matching. arXiv preprint arXiv:2503.03259, 2025. 1, 3, 4, 5, 6, 8

work page arXiv 2025
[68]

Aanet: Adaptive aggregation network for efficient stereo matching

Haofei Xu and Juyong Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. InCVPR, pages 1959– 1968, 2020. 2, 3, 8

work page 1959
[69]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2

work page 2023
[70]

Hierarchical deep stereo matching on high- resolution images

Gengshan Yang, Joshua Manela, Michael Happold, and Deva Ramanan. Hierarchical deep stereo matching on high- resolution images. InCVPR, pages 5515–5524, 2019. 2

work page 2019
[71]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 899–908, 2019. 4, 5, 6

work page 2019
[72]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 1, 2, 3

work page 2024
[73]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv preprint arXiv:2406.09414, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

A decomposition model for stereo matching

Chengtang Yao, Yunde Jia, Huijun Di, Pengxiang Li, and Yuwei Wu. A decomposition model for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6091–6100, 2021. 8

work page 2021
[75]

Ga-net: Guided aggregation net for end- to-end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end- to-end stereo matching. InCVPR, pages 185–194, 2019. 2

work page 2019
[76]

Revisiting domain generalized stereo matching networks from a feature consistency perspective

Jiawei Zhang, Xiang Wang, Xiao Bai, Chen Wang, Lei Huang, Yimin Chen, Lin Gu, Jun Zhou, Tatsuya Harada, and Edwin R Hancock. Revisiting domain generalized stereo matching networks from a feature consistency perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13001–13011, 2022. 2

work page 2022
[77]

Learning representations from foundation models for domain generalized stereo matching

Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang, and Yulan Guo. Learning representations from foundation models for domain generalized stereo matching. InEuropean Conference on Computer Vision, pages 146–162. Springer,

work page
[78]

All- in-one: Transferring vision foundation models into stereo matching.arXiv preprint arXiv:2412.09912, 2024

Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, and Yangyang Zhang. All- in-one: Transferring vision foundation models into stereo matching.arXiv preprint arXiv:2412.09912, 2024. 2 11

work page arXiv 2024

[1] [1]

Correlate-and- excite: Real-time stereo matching via guided cost volume excitation

Antyanta Bangunharcana, Jae Won Cho, Seokju Lee, In So Kweon, Kyung-Soo Kim, and Soohyun Kim. Correlate-and- excite: Real-time stereo matching via guided cost volume excitation. In2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3542–3548. IEEE, 2021. 2, 6, 8

work page 2021

[2] [2]

Instereo2k: a large real dataset for stereo matching in indoor scenes.Science China Information Sci- ences, 63(11):1–11, 2020

Wei Bao, Wei Wang, Yuhua Xu, Yulan Guo, Siyu Hong, and Xiaohu Zhang. Instereo2k: a large real dataset for stereo matching in indoor scenes.Science China Information Sci- ences, 63(11):1–11, 2020. 4

work page 2020

[3] [3]

Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail.arXiv preprint arXiv:2412.04472, 2024

Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail.arXiv preprint arXiv:2412.04472, 2024. 2

work page arXiv 2024

[4] [4]

Uasol, a large- scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019

Zuria Bauer, Francisco Gomez-Donoso, Edmanuel Cruz, Sergio Orts-Escolano, and Miguel Cazorla. Uasol, a large- scale high-resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019. 4

work page 2019

[5] [5]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InECCV, pages 611–625, 2012. 4

work page 2012

[6] [6]

Vir- tual kitti 2, 2020

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2, 2020. 4, 5, 7

work page 2020

[7] [7]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InCVPR, pages 5410–5418, 2018. 2

work page 2018

[8] [8]

Domain generalized stereo matching via hierarchical visual transformation

Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. Domain generalized stereo matching via hierarchical visual transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9559– 9568, 2023. 2

work page 2023

[9] [9]

Monster: Marry monodepth to stereo unleashes power, 2025

Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, and Xin Yang. Monster: Marry monodepth to stereo unleashes power, 2025. 1, 2, 3

work page 2025

[10] [10]

Hierarchical neural architecture search for deep stereo matching.arXiv preprint arXiv:2010.13501,

Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Tom Drummond, Hongdong Li, and Zongyuan Ge. Hierarchical neural architecture search for deep stereo matching.arXiv preprint arXiv:2010.13501,

work page arXiv 2010

[11] [11]

Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching networks

WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar, and David Suter. Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13022–13032, 2022. 2

work page 2022

[12] [12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 3

work page 2009

[13] [13]

Deeppruner: Learning efficient stereo matching via differentiable patchmatch

Shivam Duggal, Shenlong Wang, Wei-Chiu Ma, Rui Hu, and Raquel Urtasun. Deeppruner: Learning efficient stereo matching via differentiable patchmatch. InProceedings of the IEEE/CVF international conference on computer vision, pages 4384–4393, 2019. 2, 8

work page 2019

[14] [14]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InCVPR, pages 3354–3361, 2012. 1, 2, 3, 5, 6, 7, 8

work page 2012

[15] [15]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2020. 2

work page 2020

[16] [16]

Context-enhanced stereo transformer

Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus- sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei Li. Context-enhanced stereo transformer. InEuropean Con- ference on Computer Vision, pages 263–279. Springer, 2022. 2

work page 2022

[17] [17]

Group-wise correlation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In CVPR, pages 3273–3282, 2019. 2

work page 2019

[18] [18]

Openstereo: A comprehensive benchmark for stereo matching and strong baseline,

Xianda Guo, Chenming Zhang, Juntao Lu, Yiqi Wang, Yiqun Duan, Tian Yang, Zheng Zhu, and Long Chen. Openstereo: A comprehensive benchmark for stereo matching and strong baseline.arXiv preprint arXiv:2312.00343, 2023. 2

work page arXiv 2023

[19] [19]

Stereo anything: Unifying stereo matching with large-scale mixed data,

Xianda Guo, Chenming Zhang, Youmin Zhang, Dujun Nie, Ruilin Wang, Wenzhao Zheng, Matteo Poggi, and Long Chen. Stereo anything: Unifying stereo matching with large- scale mixed data.arXiv preprint arXiv:2411.14053, 2024. 2, 6

work page arXiv 2024

[20] [20]

Light- stereo: Channel boost is all your need for efficient 2d cost aggregation, 2024

Xianda Guo, Chenming Zhang, Youmin Zhang, Wenzhao Zheng, Dujun Nie, Matteo Poggi, and Long Chen. Light- stereo: Channel boost is all your need for efficient 2d cost aggregation, 2024. 1, 2, 3, 4, 6, 8

work page 2024

[21] [21]

Holopix50k: A large-scale in-the-wild stereo image dataset

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. InCVPR Workshop on Computer Vision for Augmented and Virtual Reality, Seattle, WA, 2020., 2020. 4

work page 2020

[22] [22]

Defom-stereo: Depth foundation model based stereo matching, 2025

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching, 2025. 1, 2, 3

work page 2025

[23] [23]

Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint,

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint,

work page

[24] [24]

Uncertainty guided adaptive warping for robust and efficient stereo matching

Junpeng Jing, Jiankun Li, Pengfei Xiong, Jiangyu Liu, Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, and Leonid Sigal. Uncertainty guided adaptive warping for robust and efficient stereo matching. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3318–3327, 2023. 2, 3, 6, 8

work page 2023

[25] [25]

Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching

Junpeng Jing, Ye Mao, and Krystian Mikolajczyk. Match- stereo-videos: Bidirectional alignment for consistent dy- namic stereo matching. InEuropean Conference on Com- puter Vision, pages 415–432. Springer, 2024. 2

work page 2024

[26] [26]

Match stereo videos via bidirectional alignment,

Junpeng Jing, Ye Mao, Anlan Qiu, and Krystian Miko- lajczyk. Match stereo videos via bidirectional alignment,

work page

[27] [27]

Stereo any video: Temporally consistent stereo match- ing, 2025

Junpeng Jing, Weixun Luo, Ye Mao, and Krystian Mikola- jczyk. Stereo any video: Temporally consistent stereo match- ing, 2025. 9

work page 2025

[28] [28]

Dy- namicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 2, 4

work page 2023

[29] [29]

Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction

Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for real-time edge-aware depth prediction. InECCV, pages 573–590, 2018. 2

work page 2018

[30] [30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Practical stereo matching via cascaded recurrent net- work with adaptive correlation

Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded recurrent net- work with adaptive correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16263–16272, 2022. 1, 2, 4, 5, 7

work page 2022

[32] [32]

Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers

Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X Creighton, Russell H Taylor, and Mathias Un- berath. Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 6197–6206, 2021. 2

work page 2021

[33] [33]

Learning for disparity estimation through feature constancy

Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learning for disparity estimation through feature constancy. InCVPR, pages 2811–2820, 2018. 2

work page 2018

[34] [34]

Raft-stereo: Multilevel recurrent field transforms for stereo matching

Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. arXiv preprint arXiv:2109.07547, 2021. 1, 2

work page arXiv 2021

[35] [35]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page

[36] [36]

Cooperative computation of stereo disparity

D Marr and T Poggio. Cooperative computation of stereo disparity. InNeurocomputing: foundations of research, pages 259–267. 1988. 1

work page 1988

[37] [37]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InCVPR, pages 4040–4048, 2016. 4, 5, 6, 7, 8

work page 2016

[38] [38]

Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr ´es Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 4

work page 2023

[39] [39]

Object scene flow for au- tonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for au- tonomous vehicles. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 1, 2, 3, 5, 6, 7, 8

work page 2015

[40] [40]

Confidence aware stereo matching for realistic cluttered scenario

Junhong Min and Youngpil Jeon. Confidence aware stereo matching for realistic cluttered scenario. In2024 IEEE In- ternational Conference on Image Processing (ICIP), pages 3491–3497. IEEE, 2024. 5

work page 2024

[41] [41]

Cascade residual learning: A two-stage con- volutional neural network for stereo matching

Jiahao Pang, Wenxiu Sun, Jimmy SJ Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage con- volutional neural network for stereo matching. InCVPRW, pages 887–895, 2017. 2

work page 2017

[42] [42]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992. 5

work page 1992

[43] [43]

Masked representation learn- ing for domain generalized stereo matching

Zhibo Rao, Bangshu Xiong, Mingyi He, Yuchao Dai, Renjie He, Zhelun Shen, and Xing Li. Masked representation learn- ing for domain generalized stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 5435–5444, 2023. 2

work page 2023

[44] [44]

Mobilenetv2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4510–4520, 2018. 3

work page 2018

[45] [45]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47(1):7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.IJCV, 47(1):7–42, 2002. 1, 2, 5, 6, 7

work page 2002

[46] [46]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, pages 3260–3269, 2017. 1, 2, 5, 6, 7

work page 2017

[47] [47]

Mobilestereonet: Towards lightweight deep net- works for stereo matching

Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and An- dreas Zell. Mobilestereonet: Towards lightweight deep net- works for stereo matching. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2417–2426, 2022. 1, 3, 4, 6, 8

work page 2022

[48] [48]

Cfnet: Cascade and fused cost volume for robust stereo matching

Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. InCVPR, pages 13906–13915, 2021. 2

work page 2021

[49] [49]

Chitransformer: Towards reliable stereo from cues

Qing Su and Shihao Ji. Chitransformer: Towards reliable stereo from cues. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1939–1949, 2022. 2

work page 1939

[50] [50]

Continuous 3d label stereo matching us- ing local expansion moves.IEEE TPAMI, 40(11):2725– 2739, 2017

Tatsunori Taniai, Yasuyuki Matsushita, Yoichi Sato, and Takeshi Naemura. Continuous 3d label stereo matching us- ing local expansion moves.IEEE TPAMI, 40(11):2725– 2739, 2017. 1

work page 2017

[51] [51]

Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching

Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierar- chical iterative tile refinement network for real-time stereo matching. InCVPR, pages 14362–14372, 2021. 2, 3, 8

work page 2021

[52] [52]

Falling things: A synthetic dataset for 3d object detection and pose estimation

Jonathan Tremblay, Thang To, and Stan Birchfield. Falling things: A synthetic dataset for 3d object detection and pose estimation. InCVPRW, pages 2038–2041, 2018. 4, 5, 7

work page 2038

[53] [53]

Sparsity invariant cnns

Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In2017 international conference on 3D Vision (3DV), pages 11–20. IEEE, 2017. 8

work page 2017

[54] [54]

Irs: A large naturalistic indoor robotics stereo dataset to train deep models for dis- parity and surface normal estimation,

Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation.arXiv preprint arXiv:1912.09678, 2019. 4 10

work page arXiv 1912

[55] [55]

FADNet: A fast and accurate network for disparity estimation

Qiang Wang, Shaohuai Shi, Shizhen Zheng, Kaiyong Zhao, and Xiaowen Chu. FADNet: A fast and accurate network for disparity estimation. In2020 IEEE International Conference on Robotics and Automation (ICRA 2020), pages 101–107,

work page 2020

[56] [56]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020. 4

work page 2020

[57] [57]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19701–19710, 2024. 1, 2, 6, 7

work page 2024

[58] [58]

Flickr1024: A large-scale dataset for stereo image super-resolution

Yingqian Wang, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. Flickr1024: A large-scale dataset for stereo image super-resolution. InInternational Conference on Computer Vision Workshops, pages 3852–3857, 2019. 4

work page 2019

[59] [59]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023

[60] [60]

Foundationstereo: Zero- shot stereo matching, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching, 2025. 1, 2, 3, 4, 5, 6, 7

work page 2025

[61] [61]

Con- vnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Con- vnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133– 16142, 2023. 3

work page 2023

[62] [62]

Structure-guided ranking loss for single image depth prediction

Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. Structure-guided ranking loss for single image depth prediction. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 5

work page 2020

[63] [63]

Bilateral grid learning for stereo matching networks

Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia, and Yulan Guo. Bilateral grid learning for stereo matching networks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12497–12506, 2021. 2, 8

work page 2021

[64] [64]

Atten- tion concatenation volume for accurate and efficient stereo matching

Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten- tion concatenation volume for accurate and efficient stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12981– 12990, 2022. 2, 6, 8

work page 2022

[65] [65]

Iterative geometry encoding volume for stereo matching

Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21919–21928, 2023. 2

work page 2023

[66] [66]

Accurate and efficient stereo matching via attention concatenation volume.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2023

Gangwei Xu, Yun Wang, Junda Cheng, Jinhui Tang, and Xin Yang. Accurate and efficient stereo matching via attention concatenation volume.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 2023. 1, 2, 6, 8

work page 2023

[67] [67]

Banet: Bilateral aggregation network for mobile stereo matching

Gangwei Xu, Jiaxin Liu, Xianqi Wang, Junda Cheng, Yong Deng, Jinliang Zang, Yurui Chen, and Xin Yang. Banet: Bilateral aggregation network for mobile stereo matching. arXiv preprint arXiv:2503.03259, 2025. 1, 3, 4, 5, 6, 8

work page arXiv 2025

[68] [68]

Aanet: Adaptive aggregation network for efficient stereo matching

Haofei Xu and Juyong Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. InCVPR, pages 1959– 1968, 2020. 2, 3, 8

work page 1959

[69] [69]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 2

work page 2023

[70] [70]

Hierarchical deep stereo matching on high- resolution images

Gengshan Yang, Joshua Manela, Michael Happold, and Deva Ramanan. Hierarchical deep stereo matching on high- resolution images. InCVPR, pages 5515–5524, 2019. 2

work page 2019

[71] [71]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios

Guorun Yang, Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi, and Bolei Zhou. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 899–908, 2019. 4, 5, 6

work page 2019

[72] [72]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024. 1, 2, 3

work page 2024

[73] [73]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv preprint arXiv:2406.09414, 2024. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

A decomposition model for stereo matching

Chengtang Yao, Yunde Jia, Huijun Di, Pengxiang Li, and Yuwei Wu. A decomposition model for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6091–6100, 2021. 8

work page 2021

[75] [75]

Ga-net: Guided aggregation net for end- to-end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end- to-end stereo matching. InCVPR, pages 185–194, 2019. 2

work page 2019

[76] [76]

Revisiting domain generalized stereo matching networks from a feature consistency perspective

Jiawei Zhang, Xiang Wang, Xiao Bai, Chen Wang, Lei Huang, Yimin Chen, Lin Gu, Jun Zhou, Tatsuya Harada, and Edwin R Hancock. Revisiting domain generalized stereo matching networks from a feature consistency perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13001–13011, 2022. 2

work page 2022

[77] [77]

Learning representations from foundation models for domain generalized stereo matching

Yongjian Zhang, Longguang Wang, Kunhong Li, Yun Wang, and Yulan Guo. Learning representations from foundation models for domain generalized stereo matching. InEuropean Conference on Computer Vision, pages 146–162. Springer,

work page

[78] [78]

All- in-one: Transferring vision foundation models into stereo matching.arXiv preprint arXiv:2412.09912, 2024

Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, and Yangyang Zhang. All- in-one: Transferring vision foundation models into stereo matching.arXiv preprint arXiv:2412.09912, 2024. 2 11

work page arXiv 2024