pith. machine review for the scientific record. sign in

arxiv: 2604.10218 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised stereo matchingvision foundation modelsdata augmentationstereo matchingdisparity estimationfeature pyramid networkconsistency regularizationdepth estimation
0
0 comments X

The pith

SMFormer uses vision foundation models and data augmentation to let self-supervised stereo matching compete with supervised methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the accuracy shortfall in self-supervised stereo matching, where the usual assumption that left and right image points share identical appearance often fails under real lighting or other changes. It does this by feeding features from a vision foundation model through a pyramid network for more stable representations, then adding a data-augmentation step that forces the network to produce consistent features and disparity maps between normal and transformed views. A reader would care because labeled training data for stereo depth is expensive to collect, so a method that works nearly as well without labels could scale to more applications. If the approach holds, self-supervised models become viable alternatives for depth estimation in varied environments where supervised training is impractical.

Core claim

SMFormer incorporates a vision foundation model together with a feature pyramid network to supply discriminative features that resist disturbances. It adds a data-augmentation procedure that explicitly enforces consistency between features extracted from standard and illumination-altered samples and that regularizes disparity outputs between strongly augmented inputs and their unaugmented counterparts. On standard benchmarks this yields state-of-the-art accuracy among self-supervised stereo methods and performance on par with supervised approaches, including better results than some supervised baselines on the Booster dataset.

What carries the argument

Vision foundation model integrated with a feature pyramid network for robust feature extraction, plus a data-augmentation pipeline that enforces feature and disparity consistency across transformations.

If this is right

  • Self-supervised stereo matching can reach accuracy levels previously attainable only with ground-truth disparity labels.
  • The approach improves handling of illumination changes and other real-world variations without extra supervision.
  • Gains appear across multiple benchmarks, with particular strength on difficult cases such as Booster.
  • Consistency regularization between augmented and standard samples becomes a reliable training signal for disparity networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same foundation-model-plus-consistency pattern could transfer to other geometric tasks that currently rely on photometric assumptions, such as optical flow estimation.
  • Reducing dependence on labeled stereo pairs would lower the barrier to deploying depth systems in new domains where annotation is costly.
  • Testing the framework with different foundation-model backbones or additional geometric augmentations would reveal how much of the gain is tied to the specific model chosen.
  • If the method scales to video sequences, it could support self-supervised video depth estimation with temporal consistency added as another regularization term.

Load-bearing premise

The vision foundation model combined with the feature pyramid network produces features robust enough to real-world disturbances and the data-augmentation step enforces consistency without introducing biases that lower disparity accuracy.

What would settle it

Running SMFormer on a new stereo dataset containing extreme unmodeled disturbances and observing that its error rates remain substantially higher than those of leading supervised methods would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.10218 by Dapeng Oliver Wu, Jiahao Zheng, Yulan Guo, Yun Wang, Zhanjie Zhang, Zhengjie Yang.

Figure 1
Figure 1. Figure 1: (a), our method, equipped with the Vision Foundation Model [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hard cases in Booster [9] with reflective and texture-less regions. (c), (e) indicates Winner-Take-All (WTA) disparity from the feature correlation (1/4 scale) achieved by the dot products among left and right features before the 3D CNN cost aggregation. WTA disparity enhanced with pre-trained VFM (SAM [1]) contains less noise than the original ones without VFM priors. Compared with (d), (f) achieves bette… view at source ↗
Figure 3
Figure 3. Figure 3: An overview of SMFormer. During training, the framework adopts a data augmentation branch with an augmented pair (the upper part of subfigure [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The visualization of the reconstructed left and the left image on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Self and cross-view attentions are used to learn left and right [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: A pipeline of valid checks. T denotes the horizontal flip operation. Positive and Negative Pairs. Contrastive learning aims to construct positive and negative samples, in which the represen￾tations in the positive samples stay close to each other while the negative ones are far apart. We first define the query pair features Q L,R p in the standard branch as the anchor points (red points in [PITH_FULL_IMAG… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization results achieved by our method and other self-supervised stereo matching methods on KITTI benchmarks. Our [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization results achieved by our method and other supervised methods. Disparity maps are generated from Middlebury & ETH3D benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization results achieved by our method and the baseline method. Note that the baseline method uses the CFNet backbone and employs the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparison of SMFormer on Middlebury PianoL pair. Top row: the predicted disparity maps with different combinations of the proposed [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison with different pre-trained weights on KITTI 2012, [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SMFormer, a self-supervised stereo matching framework that integrates a Vision Foundation Model (VFM) with a Feature Pyramid Network (FPN) to generate discriminative and robust features against real-world disturbances, replacing reliance on photometric consistency. It further introduces a data augmentation mechanism that enforces feature consistency under illumination variations and disparity prediction consistency between strongly augmented and standard samples. Experiments are claimed to show SOTA performance among self-supervised methods on mainstream benchmarks, competitive results with supervised methods, and outperformance of some supervised SOTA methods such as CFNet on the challenging Booster benchmark.

Significance. If the empirical results and underlying assumptions hold, the work would indicate that monocular-pretrained VFMs can supply cross-view geometric features for stereo when combined with FPN and augmentation-based regularization, substantially narrowing the accuracy gap with supervised stereo matching by mitigating photometric consistency failures. This would represent a meaningful step toward more reliable self-supervised geometric vision pipelines.

major comments (2)
  1. [Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.
  2. [Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.
minor comments (1)
  1. [Abstract] The abstract refers to 'multiple mainstream benchmarks' without naming them beyond Booster; adding the full list of evaluated datasets would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the claims with additional targeted evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.

    Authors: We appreciate this observation on the load-bearing nature of the VFM+FPN claim. Our experiments across multiple benchmarks, including strong results on Booster, show that the integrated features yield robust stereo performance where photometric methods fail. To directly address the request for evidence of cross-view geometric encoding, we will add in the revision: (1) feature similarity metrics across stereo views under controlled disturbances (e.g., illumination and noise), and (2) ablation studies isolating the FPN's adaptation of monocular VFM features for disparity estimation. These will help confirm the gains are not dataset-specific. revision: yes

  2. Referee: [Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.

    Authors: We agree that explicit verification is needed to ensure the augmentation regularizers preserve geometry and do not degrade unaugmented accuracy. Our reported results already indicate that models trained with the full augmentation pipeline achieve strong performance on standard (unaugmented) test sets, which indirectly supports lack of harmful bias. In the revised manuscript we will add direct analyses: side-by-side disparity map comparisons and endpoint-error metrics on unaugmented validation splits with/without the consistency losses, plus checks for geometric distortion (e.g., smoothness and edge preservation). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external VFM and benchmark evaluation

full rationale

The paper presents an empirical framework that integrates a pre-trained Vision Foundation Model with FPN for feature extraction and introduces data augmentation for consistency enforcement in self-supervised stereo matching. Performance claims rest on experimental results across benchmarks rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are shown reducing to inputs by construction. The central claims are falsifiable via external benchmarks and do not invoke load-bearing self-citations or self-definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach appears to rest on standard deep-learning components and the unstated assumption that foundation-model features transfer robustly to stereo tasks.

pith-pipeline@v0.9.0 · 5520 in / 1186 out tokens · 58298 ms · 2026-05-10T16:44:15.830517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4015–4026, 2023

  2. [2]

    CFNet: Cascade and fused cost volume for robust stereo matching,

    Z. Shen, Y . Dai, and Z. Rao, “CFNet: Cascade and fused cost volume for robust stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13906–13915, 2021

  3. [3]

    Iterative geometry encoding volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21919–21928, 2023

  4. [4]

    Practical stereo matching via cascaded recurrent network with adaptive correlation,

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16263–16272, 2022

  5. [5]

    RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,

    L. Lipson, Z. Teed, and J. Deng, “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,”2021 International Conference on 3D Vision (3DV), pp. 218–227, 2021

  6. [6]

    SPNet: Learning stereo matching with slanted plane aggregation,

    Y . Wang, L. Wang, H. Wang, and Y . Guo, “SPNet: Learning stereo matching with slanted plane aggregation,”IEEE Robotics and Automa- tion Letters, 2022

  7. [7]

    Exploring fine-grained sparsity in convolutional neural networks for efficient inference,

    L. Wang, Y . Guo, X. Dong, Y . Wang, X. Ying, Z. Lin, and W. An, “Exploring fine-grained sparsity in convolutional neural networks for efficient inference,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 4, pp. 4474–4493, 2022

  8. [8]

    Stereo processing by semiglobal matching and mutual information,

    H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 328–341, 2007

  9. [9]

    Open challenges in deep stereo: the booster dataset,

    P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Open challenges in deep stereo: the booster dataset,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21168–21178, 2022

  10. [10]

    Par- allax attention for unsupervised stereo correspondence learning,

    L. Wang, Y . Guo, Y . Wang, Z. Liang, Z. Lin, J. Yang, and W. An, “Par- allax attention for unsupervised stereo correspondence learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

  11. [11]

    Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,

    P. Liu, I. King, M. R. Lyu, and J. Xu, “Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 6648–6657, 2020

  12. [12]

    Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,

    J. Zhang, K. A. Skinner, R. Vasudevan, and M. Johnson-Roberson, “Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1162–1169, 2019

  13. [13]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

  14. [14]

    Depth Anything V2

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024

  15. [15]

    Eva-02: A visual representation for neon genesis

    Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva-02: A visual representation for neon genesis,”arXiv preprint arXiv:2303.11331, 2023

  16. [16]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2024

  17. [17]

    Playing to vision foundation model’s strengths in stereo matching,

    C.-W. Liu, Q. Chen, and R. Fan, “Playing to vision foundation model’s strengths in stereo matching,”arXiv preprint arXiv:2404.06261, 2024

  18. [18]

    Learning representa- tions from foundation models for domain generalized stereo matching,

    Y . Zhang, L. Wang, K. Li, Y . Wang, and Y . Guo, “Learning representa- tions from foundation models for domain generalized stereo matching,” inEuropean Conference on Computer Vision (ECCV), pp. 146–162, Springer, 2025

  19. [19]

    Finetune like you pretrain: Improved finetuning of zero-shot vision models,

    S. Goyal, A. Kumar, S. Garg, Z. Kolter, and A. Raghunathan, “Finetune like you pretrain: Improved finetuning of zero-shot vision models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19338–19347, 2023

  20. [20]

    Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,

    R. Dutt, L. Ericsson, P. Sanchez, S. A. Tsaftaris, and T. Hospedales, “Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,”arXiv preprint arXiv:2305.08252, 2023

  21. [21]

    Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,

    P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Br ´egier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud, “Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 17969–17980, 2023

  22. [22]

    Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,

    Y . Wang, L. Wang, K. Li, Y . Zhang, D. O. Wu, and Y . Guo, “Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,”IEEE Transactions on Image Processing (TIP), 2024

  23. [23]

    Deep stereo matching with hysteresis attention and supervised cost volume construction,

    K. Zeng, Y . Wang, J. Mao, C. Liu, W. Peng, and Y . Yang, “Deep stereo matching with hysteresis attention and supervised cost volume construction,”IEEE Transactions on Image Processing (TIP), vol. 31, pp. 812–822, 2021

  24. [24]

    Active disparity sampling for stereo matching with adjoint network,

    C. Zhang, G. Meng, K. Tian, B. Ni, and S. Xiang, “Active disparity sampling for stereo matching with adjoint network,”IEEE Transactions on Image Processing (TIP), 2023

  25. [25]

    Selective-stereo: Adaptive frequency information selection for stereo matching,

    X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive frequency information selection for stereo matching,”arXiv preprint arXiv:2403.00486, 2024

  26. [26]

    Defom-stereo: Depth foundation model based stereo matching,

    H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang, “Defom-stereo: Depth foundation model based stereo matching,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 21857–21867, 2025

  27. [27]

    Foundationstereo: Zero-shot stereo matching,

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” 2025

  28. [28]

    All-in-one: Transferring vision foundation models into stereo matching,

    J. Zhou, H. Zhang, J. Yuan, P. Ye, T. Chen, H. Jiang, M. Chen, and Y . Zhang, “All-in-one: Transferring vision foundation models into stereo matching,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 10797–10805, 2025

  29. [29]

    Learning robust stereo matching in the wild with selective mixture-of-experts,

    Y . Wang, L. Wang, C. Zhang, Y . Zhang, Z. Zhang, A. Ma, C. Fan, T. L. Lam, and J. Hu, “Learning robust stereo matching in the wild with selective mixture-of-experts,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 21276–21287, 2025

  30. [30]

    Self-supervised learning for stereo matching with self-improving ability,

    Y . Zhong, Y . Dai, and H. Li, “Self-supervised learning for stereo matching with self-improving ability,”CoRR, vol. abs/1709.00930, 2017

  31. [31]

    Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,

    Y . Wang, P. Wang, Z. Yang, C. Luo, Y . Yang, and W. Xu, “Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8063–8073, 2019

  32. [32]

    Unsupervised occlusion-aware stereo matching with directed disparity smoothing,

    A. Li, Z. Yuan, Y . Ling, W. Chi, S. Zhang, and C. Zhang, “Unsupervised occlusion-aware stereo matching with directed disparity smoothing,” IEEE Transactions on Intelligent Transportation Systems (TITS), vol. 23, no. 7, pp. 7457–7468, 2021

  33. [33]

    Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,

    Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang, “Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 15529– 15538, 2021

  34. [34]

    Chitransformer: Towards reliable stereo from cues,

    Q. Su and S. Ji, “Chitransformer: Towards reliable stereo from cues,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1939–1949, 2022

  35. [35]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  36. [36]

    Nerf-supervised deep stereo,

    F. Tosi, A. Tonioni, D. De Gregorio, and M. Poggi, “Nerf-supervised deep stereo,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 855–866, 2023

  37. [37]

    Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,

    H. Xu, Z. Zhou, Y . Qiao, W. Kang, and Q. Wu, “Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 3030–3038, 2021

  38. [38]

    Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,

    D. Chang, A. Bo ˇziˇc, T. Zhang, Q. Yan, Y . Chen, S. S ¨usstrunk, and M. Nießner, “Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,” inEuropean Conference on Computer Vision (ECCV), pp. 665–680, Springer, 2022

  39. [39]

    Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,

    Y . Wang, J. Zheng, C. Zhang, Z. Zhang, K. Li, Y . Zhang, and J. Hu, “Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 8178–8186, 2025

  40. [40]

    Rose: Robust self-supervised stereo matching under adverse weather condi- tions,

    Y . Wang, J. Hu, J. Hou, C. Zhang, R. Yang, and D. O. Wu, “Rose: Robust self-supervised stereo matching under adverse weather condi- tions,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

  41. [41]

    Pyramid stereo matching network,

    J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418, 2018

  42. [42]

    Pcw-net: Pyramid combination and warping cost volume for stereo matching,

    Z. Shen, Y . Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “Pcw-net: Pyramid combination and warping cost volume for stereo matching,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 280–297, Springer, 2022

  43. [43]

    Cvcnet: Learning cost volume compression for efficient stereo matching,

    Y . Guo, Y . Wang, L. Wang, Z. Wang, and C. Cheng, “Cvcnet: Learning cost volume compression for efficient stereo matching,”IEEE Transac- tions on Multimedia (TMM), vol. 25, pp. 7786–7799, 2022

  44. [44]

    Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,

    Y . Wang, K. Li, L. Wang, J. Hu, D. O. Wu, and Y . Guo, “Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,”IEEE Transactions on Image Processing (TIP), 2025

  45. [45]

    AANet: Adaptive aggregation network for efficient stereo matching,

    H. Xu and J. Zhang, “AANet: Adaptive aggregation network for efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1959–1968, 2020

  46. [46]

    Hda-net: Horizontal deformable attention network for stereo matching,

    Q. Zhang, X. Zhang, B. Li, Y . Chen, and A. Ming, “Hda-net: Horizontal deformable attention network for stereo matching,” inProceedings of the 29th ACM International Conference on Multimedia (ACMMM), pp. 32– 40, 2021

  47. [47]

    High- frequency stereo matching network,

    H. Zhao, H. Zhou, Y . Zhang, J. Chen, Y . Yang, and Y . Zhao, “High- frequency stereo matching network,” inProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 1327– 1336, 2023

  48. [48]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  49. [49]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020

  50. [50]

    Exploring simple siamese representation learning,

    X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15750–15758, 2021

  51. [51]

    Contrastive learning with stronger augmenta- tions,

    X. Wang and G.-J. Qi, “Contrastive learning with stronger augmenta- tions,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 5, pp. 5549–5560, 2022

  52. [52]

    Improved Baselines with Momentum Contrastive Learning

    X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with mo- mentum contrastive learning,”arXiv preprint arXiv:2003.04297, 2020

  53. [53]

    Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,

    Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16684–16693, 2021

  54. [54]

    Revisiting domain generalized stereo matching networks from a feature consistency perspective,

    J. Zhang, X. Wang, X. Bai, C. Wang, L. Huang, Y . Chen, L. Gu, J. Zhou, T. Harada, and E. R. Hancock, “Revisiting domain generalized stereo matching networks from a feature consistency perspective,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13001–13011, 2022

  55. [55]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing (TIP), vol. 13, no. 4, pp. 600–612, 2004

  56. [56]

    Sense: Self-evolving learning for self-supervised monocular depth estimation,

    G. Li, R. Huang, H. Li, Z. You, and W. Chen, “Sense: Self-evolving learning for self-supervised monocular depth estimation,”IEEE Trans- actions on Image Processing (TIP), 2023

  57. [57]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, 2022

  58. [58]

    Masked representation learning for domain generalized stereo matching,

    Z. Rao, B. Xiong, M. He, Y . Dai, R. He, Z. Shen, and X. Li, “Masked representation learning for domain generalized stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5435–5444, 2023

  59. [59]

    Faster r-cnn: Towards real- time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 6, pp. 1137–1149, 2016

  60. [60]

    Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,

    Y . Ding, Q. Zhu, X. Liu, W. Yuan, H. Zhang, and C. Zhang, “Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,” inEuropean Conference on Computer Vision (ECCV), pp. 630–646, Springer, 2022

  61. [61]

    Flownet: Learning optical flow with convolutional networks,

    A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766, 2015

  62. [62]

    Are we ready for autonomous driving? the KITTI vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012

  63. [63]

    Object scene flow for autonomous vehicles,

    M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070, 2015

  64. [64]

    End-to-end learning of geometry and context for deep stereo regression,

    A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 66–75, 2017

  65. [65]

    Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,

    J. Xing, Z. Qi, J. Dong, J. Cai, and H. Liu, “Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,” inProceedings of the IEEE European Conference on Computer Vision (ECCV), Springer, 2020

  66. [66]

    Sgm-nets: Semi-global matching with neural networks,

    A. Seki and M. Pollefeys, “Sgm-nets: Semi-global matching with neural networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 231–240, 2017

  67. [67]

    Occlusion aware stereo matching via cooperative un- supervised learning,

    A. Li and Z. Yuan, “Occlusion aware stereo matching via cooperative un- supervised learning,” inAsian Conference on Computer Vision (ACCV), pp. 197–213, Springer, 2018

  68. [68]

    Digging into uncertainty-based pseudo-label for robust stereo matching,

    Z. Shen, X. Song, Y . Dai, D. Zhou, Z. Rao, and L. Zhang, “Digging into uncertainty-based pseudo-label for robust stereo matching,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 1–18, 2023

  69. [69]

    Los: Local structure-guided stereo matching,

    K. Li, L. Wang, Y . Zhang, K. Xue, S. Zhou, and Y . Guo, “Los: Local structure-guided stereo matching,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 19746– 19756, 2024

  70. [70]

    Mocha-stereo: Motif channel attention network for stereo matching,

    Z. Chen, W. Long, H. Yao, Y . Zhang, B. Wang, Y . Qin, and J. Wu, “Mocha-stereo: Motif channel attention network for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27768–27777, 2024

  71. [71]

    Neural markov random field for stereo matching,

    T. Guan, C. Wang, and Y .-H. Liu, “Neural markov random field for stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 5459–5469, 2024

  72. [72]

    High-resolution stereo datasets with subpixel-accurate ground truth,

    D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Ne ˇsi´c, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGerman conference on pattern recog- nition (GCPR), pp. 31–42, Springer, 2014

  73. [73]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos,

    T. Sch ¨ops, J. L. Sch ¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2538–2547, 2017

  74. [74]

    Attention concatenation volume for accurate and efficient stereo matching,

    G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12981–12990, 2022

  75. [75]

    Unambiguous pyramid cost volumes fusion for stereo matching,

    Q. Chen, B. Ge, and J. Quan, “Unambiguous pyramid cost volumes fusion for stereo matching,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023

  76. [76]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), pp. 8748–8763, PMLR, 2021

  77. [77]

    DeepDriving: Learning affordance for direct perception in autonomous driving,

    C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning affordance for direct perception in autonomous driving,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730, 2015

  78. [78]

    Open challenges in deep stereo: the booster dataset,

    P. Zama Ramirez, F. Tosi, M. Poggi, S. Salti, L. Di Stefano, and S. Mattoccia, “Open challenges in deep stereo: the booster dataset,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. CVPR

  79. [79]

    Virtual KITTI 2

    Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020