arxiv: 2604.10218 · v1 · submitted 2026-04-11 · 💻 cs.CV

Recognition: unknown

SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation

Yun Wang , Zhengjie Yang , Jiahao Zheng , Zhanjie Zhang , Dapeng Oliver Wu , Yulan Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised stereo matchingvision foundation modelsdata augmentationstereo matchingdisparity estimationfeature pyramid networkconsistency regularizationdepth estimation

0 comments

The pith

SMFormer uses vision foundation models and data augmentation to let self-supervised stereo matching compete with supervised methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the accuracy shortfall in self-supervised stereo matching, where the usual assumption that left and right image points share identical appearance often fails under real lighting or other changes. It does this by feeding features from a vision foundation model through a pyramid network for more stable representations, then adding a data-augmentation step that forces the network to produce consistent features and disparity maps between normal and transformed views. A reader would care because labeled training data for stereo depth is expensive to collect, so a method that works nearly as well without labels could scale to more applications. If the approach holds, self-supervised models become viable alternatives for depth estimation in varied environments where supervised training is impractical.

Core claim

SMFormer incorporates a vision foundation model together with a feature pyramid network to supply discriminative features that resist disturbances. It adds a data-augmentation procedure that explicitly enforces consistency between features extracted from standard and illumination-altered samples and that regularizes disparity outputs between strongly augmented inputs and their unaugmented counterparts. On standard benchmarks this yields state-of-the-art accuracy among self-supervised stereo methods and performance on par with supervised approaches, including better results than some supervised baselines on the Booster dataset.

What carries the argument

Vision foundation model integrated with a feature pyramid network for robust feature extraction, plus a data-augmentation pipeline that enforces feature and disparity consistency across transformations.

If this is right

Self-supervised stereo matching can reach accuracy levels previously attainable only with ground-truth disparity labels.
The approach improves handling of illumination changes and other real-world variations without extra supervision.
Gains appear across multiple benchmarks, with particular strength on difficult cases such as Booster.
Consistency regularization between augmented and standard samples becomes a reliable training signal for disparity networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same foundation-model-plus-consistency pattern could transfer to other geometric tasks that currently rely on photometric assumptions, such as optical flow estimation.
Reducing dependence on labeled stereo pairs would lower the barrier to deploying depth systems in new domains where annotation is costly.
Testing the framework with different foundation-model backbones or additional geometric augmentations would reveal how much of the gain is tied to the specific model chosen.
If the method scales to video sequences, it could support self-supervised video depth estimation with temporal consistency added as another regularization term.

Load-bearing premise

The vision foundation model combined with the feature pyramid network produces features robust enough to real-world disturbances and the data-augmentation step enforces consistency without introducing biases that lower disparity accuracy.

What would settle it

Running SMFormer on a new stereo dataset containing extreme unmodeled disturbances and observing that its error rates remain substantially higher than those of leading supervised methods would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.10218 by Dapeng Oliver Wu, Jiahao Zheng, Yulan Guo, Yun Wang, Zhanjie Zhang, Zhengjie Yang.

**Figure 2.** Figure 2: Hard cases in Booster [9] with reflective and texture-less regions. (c), (e) indicates Winner-Take-All (WTA) disparity from the feature correlation (1/4 scale) achieved by the dot products among left and right features before the 3D CNN cost aggregation. WTA disparity enhanced with pre-trained VFM (SAM [1]) contains less noise than the original ones without VFM priors. Compared with (d), (f) achieves bette… view at source ↗

**Figure 3.** Figure 3: An overview of SMFormer. During training, the framework adopts a data augmentation branch with an augmented pair (the upper part of subfigure [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The visualization of the reconstructed left and the left image on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Self and cross-view attentions are used to learn left and right [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: A pipeline of valid checks. T denotes the horizontal flip operation. Positive and Negative Pairs. Contrastive learning aims to construct positive and negative samples, in which the representations in the positive samples stay close to each other while the negative ones are far apart. We first define the query pair features Q L,R p in the standard branch as the anchor points (red points in [PITH_FULL_IMAG… view at source ↗

**Figure 8.** Figure 8: Visualization results achieved by our method and other self-supervised stereo matching methods on KITTI benchmarks. Our [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization results achieved by our method and other supervised methods. Disparity maps are generated from Middlebury & ETH3D benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization results achieved by our method and the baseline method. Note that the baseline method uses the CFNet backbone and employs the [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Visual comparison of SMFormer on Middlebury PianoL pair. Top row: the predicted disparity maps with different combinations of the proposed [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison with different pre-trained weights on KITTI 2012, [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

read the original abstract

Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMFormer shows a workable way to use VFMs plus augmentation consistency to lift self-supervised stereo results toward supervised levels on some benchmarks, but the abstract leaves the key assumptions untested.

read the letter

The main point here is that SMFormer swaps in a vision foundation model inside an FPN for feature extraction and adds a dual consistency regularizer through data augmentation. This lets them move past pure photometric losses that break under real-world lighting or texture changes, and the reported numbers put it at or above several supervised baselines on Booster while leading other self-supervised methods elsewhere. The augmentation part explicitly ties feature and disparity outputs between standard and transformed views, which is a direct response to the robustness problem they flag in the abstract. That combination is new enough as a packaged recipe, even if the pieces (VFM features, consistency losses) have appeared separately before. It does a clean job of stating the photometric assumption failure and offering a practical workaround without inventing new architectures from scratch. The Booster result is the strongest signal; if it holds, it suggests the approach can handle harder real data than typical KITTI or Middlebury setups. The soft spots are exactly where the stress test points: VFMs are pretrained on monocular 2D tasks, so their features may not automatically supply the cross-view geometric invariance stereo needs, and the augmentation could easily shift disparity geometry or over-constrain the network if the transforms are not perfectly disparity-preserving. Without ablations, training curves, or error breakdowns in the abstract, it's impossible to tell whether the gains come from the VFM, the consistency terms, or just careful hyperparameter choices on these particular datasets. The paper is for people building practical self-supervised 3D pipelines who want a drop-in improvement over photometric baselines. It is worth a serious referee because the empirical direction is clear and the problem it targets is real, even though the current write-up needs the full experiments and controls to stand up.

Referee Report

2 major / 1 minor

Summary. The paper proposes SMFormer, a self-supervised stereo matching framework that integrates a Vision Foundation Model (VFM) with a Feature Pyramid Network (FPN) to generate discriminative and robust features against real-world disturbances, replacing reliance on photometric consistency. It further introduces a data augmentation mechanism that enforces feature consistency under illumination variations and disparity prediction consistency between strongly augmented and standard samples. Experiments are claimed to show SOTA performance among self-supervised methods on mainstream benchmarks, competitive results with supervised methods, and outperformance of some supervised SOTA methods such as CFNet on the challenging Booster benchmark.

Significance. If the empirical results and underlying assumptions hold, the work would indicate that monocular-pretrained VFMs can supply cross-view geometric features for stereo when combined with FPN and augmentation-based regularization, substantially narrowing the accuracy gap with supervised stereo matching by mitigating photometric consistency failures. This would represent a meaningful step toward more reliable self-supervised geometric vision pipelines.

major comments (2)

[Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.
[Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.

minor comments (1)

[Abstract] The abstract refers to 'multiple mainstream benchmarks' without naming them beyond Booster; adding the full list of evaluated datasets would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the claims with additional targeted evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.

Authors: We appreciate this observation on the load-bearing nature of the VFM+FPN claim. Our experiments across multiple benchmarks, including strong results on Booster, show that the integrated features yield robust stereo performance where photometric methods fail. To directly address the request for evidence of cross-view geometric encoding, we will add in the revision: (1) feature similarity metrics across stereo views under controlled disturbances (e.g., illumination and noise), and (2) ablation studies isolating the FPN's adaptation of monocular VFM features for disparity estimation. These will help confirm the gains are not dataset-specific. revision: yes
Referee: [Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.

Authors: We agree that explicit verification is needed to ensure the augmentation regularizers preserve geometry and do not degrade unaugmented accuracy. Our reported results already indicate that models trained with the full augmentation pipeline achieve strong performance on standard (unaugmented) test sets, which indirectly supports lack of harmful bias. In the revised manuscript we will add direct analyses: side-by-side disparity map comparisons and endpoint-error metrics on unaugmented validation splits with/without the consistency losses, plus checks for geometric distortion (e.g., smoothness and edge preservation). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external VFM and benchmark evaluation

full rationale

The paper presents an empirical framework that integrates a pre-trained Vision Foundation Model with FPN for feature extraction and introduces data augmentation for consistency enforcement in self-supervised stereo matching. Performance claims rest on experimental results across benchmarks rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are shown reducing to inputs by construction. The central claims are falsifiable via external benchmarks and do not invoke load-bearing self-citations or self-definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach appears to rest on standard deep-learning components and the unstated assumption that foundation-model features transfer robustly to stereo tasks.

pith-pipeline@v0.9.0 · 5520 in / 1186 out tokens · 58298 ms · 2026-05-10T16:44:15.830517+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4015–4026, 2023

2023
[2]

CFNet: Cascade and fused cost volume for robust stereo matching,

Z. Shen, Y . Dai, and Z. Rao, “CFNet: Cascade and fused cost volume for robust stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13906–13915, 2021

2021
[3]

Iterative geometry encoding volume for stereo matching,

G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21919–21928, 2023

2023
[4]

Practical stereo matching via cascaded recurrent network with adaptive correlation,

J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16263–16272, 2022

2022
[5]

RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,

L. Lipson, Z. Teed, and J. Deng, “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,”2021 International Conference on 3D Vision (3DV), pp. 218–227, 2021

2021
[6]

SPNet: Learning stereo matching with slanted plane aggregation,

Y . Wang, L. Wang, H. Wang, and Y . Guo, “SPNet: Learning stereo matching with slanted plane aggregation,”IEEE Robotics and Automa- tion Letters, 2022

2022
[7]

Exploring fine-grained sparsity in convolutional neural networks for efficient inference,

L. Wang, Y . Guo, X. Dong, Y . Wang, X. Ying, Z. Lin, and W. An, “Exploring fine-grained sparsity in convolutional neural networks for efficient inference,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 4, pp. 4474–4493, 2022

2022
[8]

Stereo processing by semiglobal matching and mutual information,

H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 328–341, 2007

2007
[9]

Open challenges in deep stereo: the booster dataset,

P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Open challenges in deep stereo: the booster dataset,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21168–21178, 2022

2022
[10]

Par- allax attention for unsupervised stereo correspondence learning,

L. Wang, Y . Guo, Y . Wang, Z. Liang, Z. Lin, J. Yang, and W. An, “Par- allax attention for unsupervised stereo correspondence learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

2020
[11]

Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,

P. Liu, I. King, M. R. Lyu, and J. Xu, “Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 6648–6657, 2020

2020
[12]

Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,

J. Zhang, K. A. Skinner, R. Vasudevan, and M. Johnson-Roberson, “Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1162–1169, 2019

2019
[13]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...

2024
[14]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024

work page internal anchor Pith review arXiv 2024
[15]

Eva-02: A visual representation for neon genesis

Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva-02: A visual representation for neon genesis,”arXiv preprint arXiv:2303.11331, 2023

work page arXiv 2023
[16]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[17]

Playing to vision foundation model’s strengths in stereo matching,

C.-W. Liu, Q. Chen, and R. Fan, “Playing to vision foundation model’s strengths in stereo matching,”arXiv preprint arXiv:2404.06261, 2024

work page arXiv 2024
[18]

Learning representa- tions from foundation models for domain generalized stereo matching,

Y . Zhang, L. Wang, K. Li, Y . Wang, and Y . Guo, “Learning representa- tions from foundation models for domain generalized stereo matching,” inEuropean Conference on Computer Vision (ECCV), pp. 146–162, Springer, 2025

2025
[19]

Finetune like you pretrain: Improved finetuning of zero-shot vision models,

S. Goyal, A. Kumar, S. Garg, Z. Kolter, and A. Raghunathan, “Finetune like you pretrain: Improved finetuning of zero-shot vision models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19338–19347, 2023

2023
[20]

Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,

R. Dutt, L. Ericsson, P. Sanchez, S. A. Tsaftaris, and T. Hospedales, “Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,”arXiv preprint arXiv:2305.08252, 2023

work page arXiv 2023
[21]

Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,

P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Br ´egier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud, “Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 17969–17980, 2023

2023
[22]

Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,

Y . Wang, L. Wang, K. Li, Y . Zhang, D. O. Wu, and Y . Guo, “Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,”IEEE Transactions on Image Processing (TIP), 2024

2024
[23]

Deep stereo matching with hysteresis attention and supervised cost volume construction,

K. Zeng, Y . Wang, J. Mao, C. Liu, W. Peng, and Y . Yang, “Deep stereo matching with hysteresis attention and supervised cost volume construction,”IEEE Transactions on Image Processing (TIP), vol. 31, pp. 812–822, 2021

2021
[24]

Active disparity sampling for stereo matching with adjoint network,

C. Zhang, G. Meng, K. Tian, B. Ni, and S. Xiang, “Active disparity sampling for stereo matching with adjoint network,”IEEE Transactions on Image Processing (TIP), 2023

2023
[25]

Selective-stereo: Adaptive frequency information selection for stereo matching,

X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive frequency information selection for stereo matching,”arXiv preprint arXiv:2403.00486, 2024

work page arXiv 2024
[26]

Defom-stereo: Depth foundation model based stereo matching,

H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang, “Defom-stereo: Depth foundation model based stereo matching,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 21857–21867, 2025

2025
[27]

Foundationstereo: Zero-shot stereo matching,

B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” 2025

2025
[28]

All-in-one: Transferring vision foundation models into stereo matching,

J. Zhou, H. Zhang, J. Yuan, P. Ye, T. Chen, H. Jiang, M. Chen, and Y . Zhang, “All-in-one: Transferring vision foundation models into stereo matching,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 10797–10805, 2025

2025
[29]

Learning robust stereo matching in the wild with selective mixture-of-experts,

Y . Wang, L. Wang, C. Zhang, Y . Zhang, Z. Zhang, A. Ma, C. Fan, T. L. Lam, and J. Hu, “Learning robust stereo matching in the wild with selective mixture-of-experts,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 21276–21287, 2025

2025
[30]

Self-supervised learning for stereo matching with self-improving ability,

Y . Zhong, Y . Dai, and H. Li, “Self-supervised learning for stereo matching with self-improving ability,”CoRR, vol. abs/1709.00930, 2017

work page arXiv 2017
[31]

Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,

Y . Wang, P. Wang, Z. Yang, C. Luo, Y . Yang, and W. Xu, “Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8063–8073, 2019

2019
[32]

Unsupervised occlusion-aware stereo matching with directed disparity smoothing,

A. Li, Z. Yuan, Y . Ling, W. Chi, S. Zhang, and C. Zhang, “Unsupervised occlusion-aware stereo matching with directed disparity smoothing,” IEEE Transactions on Intelligent Transportation Systems (TITS), vol. 23, no. 7, pp. 7457–7468, 2021

2021
[33]

Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,

Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang, “Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 15529– 15538, 2021

2021
[34]

Chitransformer: Towards reliable stereo from cues,

Q. Su and S. Ji, “Chitransformer: Towards reliable stereo from cues,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1939–1949, 2022

1939
[35]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

Nerf-supervised deep stereo,

F. Tosi, A. Tonioni, D. De Gregorio, and M. Poggi, “Nerf-supervised deep stereo,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 855–866, 2023

2023
[37]

Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,

H. Xu, Z. Zhou, Y . Qiao, W. Kang, and Q. Wu, “Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 3030–3038, 2021

2021
[38]

Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,

D. Chang, A. Bo ˇziˇc, T. Zhang, Q. Yan, Y . Chen, S. S ¨usstrunk, and M. Nießner, “Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,” inEuropean Conference on Computer Vision (ECCV), pp. 665–680, Springer, 2022

2022
[39]

Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,

Y . Wang, J. Zheng, C. Zhang, Z. Zhang, K. Li, Y . Zhang, and J. Hu, “Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 8178–8186, 2025

2025
[40]

Rose: Robust self-supervised stereo matching under adverse weather condi- tions,

Y . Wang, J. Hu, J. Hou, C. Zhang, R. Yang, and D. O. Wu, “Rose: Robust self-supervised stereo matching under adverse weather condi- tions,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025

2025
[41]

Pyramid stereo matching network,

J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418, 2018

2018
[42]

Pcw-net: Pyramid combination and warping cost volume for stereo matching,

Z. Shen, Y . Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “Pcw-net: Pyramid combination and warping cost volume for stereo matching,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 280–297, Springer, 2022

2022
[43]

Cvcnet: Learning cost volume compression for efficient stereo matching,

Y . Guo, Y . Wang, L. Wang, Z. Wang, and C. Cheng, “Cvcnet: Learning cost volume compression for efficient stereo matching,”IEEE Transac- tions on Multimedia (TMM), vol. 25, pp. 7786–7799, 2022

2022
[44]

Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,

Y . Wang, K. Li, L. Wang, J. Hu, D. O. Wu, and Y . Guo, “Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,”IEEE Transactions on Image Processing (TIP), 2025

2025
[45]

AANet: Adaptive aggregation network for efficient stereo matching,

H. Xu and J. Zhang, “AANet: Adaptive aggregation network for efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1959–1968, 2020

1959
[46]

Hda-net: Horizontal deformable attention network for stereo matching,

Q. Zhang, X. Zhang, B. Li, Y . Chen, and A. Ming, “Hda-net: Horizontal deformable attention network for stereo matching,” inProceedings of the 29th ACM International Conference on Multimedia (ACMMM), pp. 32– 40, 2021

2021
[47]

High- frequency stereo matching network,

H. Zhao, H. Zhou, Y . Zhang, J. Chen, Y . Yang, and Y . Zhao, “High- frequency stereo matching network,” inProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 1327– 1336, 2023

2023
[48]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[49]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020

2020
[50]

Exploring simple siamese representation learning,

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15750–15758, 2021

2021
[51]

Contrastive learning with stronger augmenta- tions,

X. Wang and G.-J. Qi, “Contrastive learning with stronger augmenta- tions,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 5, pp. 5549–5560, 2022

2022
[52]

Improved Baselines with Momentum Contrastive Learning

X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with mo- mentum contrastive learning,”arXiv preprint arXiv:2003.04297, 2020

work page internal anchor Pith review arXiv 2003
[53]

Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,

Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16684–16693, 2021

2021
[54]

Revisiting domain generalized stereo matching networks from a feature consistency perspective,

J. Zhang, X. Wang, X. Bai, C. Wang, L. Huang, Y . Chen, L. Gu, J. Zhou, T. Harada, and E. R. Hancock, “Revisiting domain generalized stereo matching networks from a feature consistency perspective,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13001–13011, 2022

2022
[55]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing (TIP), vol. 13, no. 4, pp. 600–612, 2004

2004
[56]

Sense: Self-evolving learning for self-supervised monocular depth estimation,

G. Li, R. Huang, H. Li, Z. You, and W. Chen, “Sense: Self-evolving learning for self-supervised monocular depth estimation,”IEEE Trans- actions on Image Processing (TIP), 2023

2023
[57]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, 2022

2022
[58]

Masked representation learning for domain generalized stereo matching,

Z. Rao, B. Xiong, M. He, Y . Dai, R. He, Z. Shen, and X. Li, “Masked representation learning for domain generalized stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5435–5444, 2023

2023
[59]

Faster r-cnn: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 6, pp. 1137–1149, 2016

2016
[60]

Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,

Y . Ding, Q. Zhu, X. Liu, W. Yuan, H. Zhang, and C. Zhang, “Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,” inEuropean Conference on Computer Vision (ECCV), pp. 630–646, Springer, 2022

2022
[61]

Flownet: Learning optical flow with convolutional networks,

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766, 2015

2015
[62]

Are we ready for autonomous driving? the KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012

2012
[63]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070, 2015

2015
[64]

End-to-end learning of geometry and context for deep stereo regression,

A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 66–75, 2017

2017
[65]

Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,

J. Xing, Z. Qi, J. Dong, J. Cai, and H. Liu, “Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,” inProceedings of the IEEE European Conference on Computer Vision (ECCV), Springer, 2020

2020
[66]

Sgm-nets: Semi-global matching with neural networks,

A. Seki and M. Pollefeys, “Sgm-nets: Semi-global matching with neural networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 231–240, 2017

2017
[67]

Occlusion aware stereo matching via cooperative un- supervised learning,

A. Li and Z. Yuan, “Occlusion aware stereo matching via cooperative un- supervised learning,” inAsian Conference on Computer Vision (ACCV), pp. 197–213, Springer, 2018

2018
[68]

Digging into uncertainty-based pseudo-label for robust stereo matching,

Z. Shen, X. Song, Y . Dai, D. Zhou, Z. Rao, and L. Zhang, “Digging into uncertainty-based pseudo-label for robust stereo matching,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 1–18, 2023

2023
[69]

Los: Local structure-guided stereo matching,

K. Li, L. Wang, Y . Zhang, K. Xue, S. Zhou, and Y . Guo, “Los: Local structure-guided stereo matching,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 19746– 19756, 2024

2024
[70]

Mocha-stereo: Motif channel attention network for stereo matching,

Z. Chen, W. Long, H. Yao, Y . Zhang, B. Wang, Y . Qin, and J. Wu, “Mocha-stereo: Motif channel attention network for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27768–27777, 2024

2024
[71]

Neural markov random field for stereo matching,

T. Guan, C. Wang, and Y .-H. Liu, “Neural markov random field for stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 5459–5469, 2024

2024
[72]

High-resolution stereo datasets with subpixel-accurate ground truth,

D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Ne ˇsi´c, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGerman conference on pattern recog- nition (GCPR), pp. 31–42, Springer, 2014

2014
[73]

A multi-view stereo benchmark with high-resolution images and multi-camera videos,

T. Sch ¨ops, J. L. Sch ¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2538–2547, 2017

2017
[74]

Attention concatenation volume for accurate and efficient stereo matching,

G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12981–12990, 2022

2022
[75]

Unambiguous pyramid cost volumes fusion for stereo matching,

Q. Chen, B. Ge, and J. Quan, “Unambiguous pyramid cost volumes fusion for stereo matching,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023

2023
[76]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), pp. 8748–8763, PMLR, 2021

2021
[77]

DeepDriving: Learning affordance for direct perception in autonomous driving,

C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning affordance for direct perception in autonomous driving,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730, 2015

2015
[78]

Open challenges in deep stereo: the booster dataset,

P. Zama Ramirez, F. Tosi, M. Poggi, S. Salti, L. Di Stefano, and S. Mattoccia, “Open challenges in deep stereo: the booster dataset,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. CVPR

2022
[79]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review arXiv 2001