Recognition: unknown
SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3
The pith
SMFormer uses vision foundation models and data augmentation to let self-supervised stereo matching compete with supervised methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SMFormer incorporates a vision foundation model together with a feature pyramid network to supply discriminative features that resist disturbances. It adds a data-augmentation procedure that explicitly enforces consistency between features extracted from standard and illumination-altered samples and that regularizes disparity outputs between strongly augmented inputs and their unaugmented counterparts. On standard benchmarks this yields state-of-the-art accuracy among self-supervised stereo methods and performance on par with supervised approaches, including better results than some supervised baselines on the Booster dataset.
What carries the argument
Vision foundation model integrated with a feature pyramid network for robust feature extraction, plus a data-augmentation pipeline that enforces feature and disparity consistency across transformations.
If this is right
- Self-supervised stereo matching can reach accuracy levels previously attainable only with ground-truth disparity labels.
- The approach improves handling of illumination changes and other real-world variations without extra supervision.
- Gains appear across multiple benchmarks, with particular strength on difficult cases such as Booster.
- Consistency regularization between augmented and standard samples becomes a reliable training signal for disparity networks.
Where Pith is reading between the lines
- The same foundation-model-plus-consistency pattern could transfer to other geometric tasks that currently rely on photometric assumptions, such as optical flow estimation.
- Reducing dependence on labeled stereo pairs would lower the barrier to deploying depth systems in new domains where annotation is costly.
- Testing the framework with different foundation-model backbones or additional geometric augmentations would reveal how much of the gain is tied to the specific model chosen.
- If the method scales to video sequences, it could support self-supervised video depth estimation with temporal consistency added as another regularization term.
Load-bearing premise
The vision foundation model combined with the feature pyramid network produces features robust enough to real-world disturbances and the data-augmentation step enforces consistency without introducing biases that lower disparity accuracy.
What would settle it
Running SMFormer on a new stereo dataset containing extreme unmodeled disturbances and observing that its error rates remain substantially higher than those of leading supervised methods would show the central claim does not hold.
Figures
read the original abstract
Recent self-supervised stereo matching methods have made significant progress. They typically rely on the photometric consistency assumption, which presumes corresponding points across views share the same appearance. However, this assumption could be compromised by real-world disturbances, resulting in invalid supervisory signals and a significant accuracy gap compared to supervised methods. To address this issue, we propose SMFormer, a framework integrating more reliable self-supervision guided by the Vision Foundation Model (VFM) and data augmentation. We first incorporate the VFM with the Feature Pyramid Network (FPN), providing a discriminative and robust feature representation against disturbance in various scenarios. We then devise an effective data augmentation mechanism that ensures robustness to various transformations. The data augmentation mechanism explicitly enforces consistency between learned features and those influenced by illumination variations. Additionally, it regularizes the output consistency between disparity predictions of strong augmented samples and those generated from standard samples. Experiments on multiple mainstream benchmarks demonstrate that our SMFormer achieves state-of-the-art (SOTA) performance among self-supervised methods and even competes on par with supervised ones. Remarkably, in the challenging Booster benchmark, SMFormer even outperforms some SOTA supervised methods, such as CFNet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SMFormer, a self-supervised stereo matching framework that integrates a Vision Foundation Model (VFM) with a Feature Pyramid Network (FPN) to generate discriminative and robust features against real-world disturbances, replacing reliance on photometric consistency. It further introduces a data augmentation mechanism that enforces feature consistency under illumination variations and disparity prediction consistency between strongly augmented and standard samples. Experiments are claimed to show SOTA performance among self-supervised methods on mainstream benchmarks, competitive results with supervised methods, and outperformance of some supervised SOTA methods such as CFNet on the challenging Booster benchmark.
Significance. If the empirical results and underlying assumptions hold, the work would indicate that monocular-pretrained VFMs can supply cross-view geometric features for stereo when combined with FPN and augmentation-based regularization, substantially narrowing the accuracy gap with supervised stereo matching by mitigating photometric consistency failures. This would represent a meaningful step toward more reliable self-supervised geometric vision pipelines.
major comments (2)
- [Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.
- [Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.
minor comments (1)
- [Abstract] The abstract refers to 'multiple mainstream benchmarks' without naming them beyond Booster; adding the full list of evaluated datasets would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the claims with additional targeted evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that VFM+FPN supplies 'discriminative and robust feature representation against disturbance in various scenarios' is load-bearing for replacing photometric consistency; because VFMs are pretrained on monocular 2D tasks, the manuscript must demonstrate (via targeted ablations or geometric invariance tests) that these features encode the necessary cross-view information under real-world disturbances, or the SOTA gains risk being dataset-specific.
Authors: We appreciate this observation on the load-bearing nature of the VFM+FPN claim. Our experiments across multiple benchmarks, including strong results on Booster, show that the integrated features yield robust stereo performance where photometric methods fail. To directly address the request for evidence of cross-view geometric encoding, we will add in the revision: (1) feature similarity metrics across stereo views under controlled disturbances (e.g., illumination and noise), and (2) ablation studies isolating the FPN's adaptation of monocular VFM features for disparity estimation. These will help confirm the gains are not dataset-specific. revision: yes
-
Referee: [Abstract] Abstract: The data-augmentation mechanism is described as explicitly enforcing 'consistency between learned features and those influenced by illumination variations' plus 'output consistency between disparity predictions of strong augmented samples and those generated from standard samples'; the paper must show that these regularizers do not alter disparity geometry or introduce harmful biases that degrade unaugmented accuracy, as this directly underpins the self-supervised performance claims.
Authors: We agree that explicit verification is needed to ensure the augmentation regularizers preserve geometry and do not degrade unaugmented accuracy. Our reported results already indicate that models trained with the full augmentation pipeline achieve strong performance on standard (unaugmented) test sets, which indirectly supports lack of harmful bias. In the revised manuscript we will add direct analyses: side-by-side disparity map comparisons and endpoint-error metrics on unaugmented validation splits with/without the consistency losses, plus checks for geometric distortion (e.g., smoothness and edge preservation). revision: yes
Circularity Check
No circularity: empirical pipeline relies on external VFM and benchmark evaluation
full rationale
The paper presents an empirical framework that integrates a pre-trained Vision Foundation Model with FPN for feature extraction and introduces data augmentation for consistency enforcement in self-supervised stereo matching. Performance claims rest on experimental results across benchmarks rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are shown reducing to inputs by construction. The central claims are falsifiable via external benchmarks and do not invoke load-bearing self-citations or self-definitional loops.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4015–4026, 2023
2023
-
[2]
CFNet: Cascade and fused cost volume for robust stereo matching,
Z. Shen, Y . Dai, and Z. Rao, “CFNet: Cascade and fused cost volume for robust stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13906–13915, 2021
2021
-
[3]
Iterative geometry encoding volume for stereo matching,
G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21919–21928, 2023
2023
-
[4]
Practical stereo matching via cascaded recurrent network with adaptive correlation,
J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16263–16272, 2022
2022
-
[5]
RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,
L. Lipson, Z. Teed, and J. Deng, “RAFT-Stereo: Multilevel recurrent field transforms for stereo matching,”2021 International Conference on 3D Vision (3DV), pp. 218–227, 2021
2021
-
[6]
SPNet: Learning stereo matching with slanted plane aggregation,
Y . Wang, L. Wang, H. Wang, and Y . Guo, “SPNet: Learning stereo matching with slanted plane aggregation,”IEEE Robotics and Automa- tion Letters, 2022
2022
-
[7]
Exploring fine-grained sparsity in convolutional neural networks for efficient inference,
L. Wang, Y . Guo, X. Dong, Y . Wang, X. Ying, Z. Lin, and W. An, “Exploring fine-grained sparsity in convolutional neural networks for efficient inference,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 4, pp. 4474–4493, 2022
2022
-
[8]
Stereo processing by semiglobal matching and mutual information,
H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 328–341, 2007
2007
-
[9]
Open challenges in deep stereo: the booster dataset,
P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Open challenges in deep stereo: the booster dataset,” inPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21168–21178, 2022
2022
-
[10]
Par- allax attention for unsupervised stereo correspondence learning,
L. Wang, Y . Guo, Y . Wang, Z. Liang, Z. Lin, J. Yang, and W. An, “Par- allax attention for unsupervised stereo correspondence learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020
2020
-
[11]
Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,
P. Liu, I. King, M. R. Lyu, and J. Xu, “Flow2stereo: Effective self- supervised learning of optical flow and stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 6648–6657, 2020
2020
-
[12]
Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,
J. Zhang, K. A. Skinner, R. Vasudevan, and M. Johnson-Roberson, “Dispsegnet: Leveraging semantics for end-to-end learning of disparity estimation from stereo imagery,”IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1162–1169, 2019
2019
-
[13]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features withou...
2024
-
[14]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024
work page internal anchor Pith review arXiv 2024
-
[15]
Eva-02: A visual representation for neon genesis
Y . Fang, Q. Sun, X. Wang, T. Huang, X. Wang, and Y . Cao, “Eva-02: A visual representation for neon genesis,”arXiv preprint arXiv:2303.11331, 2023
-
[16]
Dust3r: Geometric 3d vision made easy,
S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[17]
Playing to vision foundation model’s strengths in stereo matching,
C.-W. Liu, Q. Chen, and R. Fan, “Playing to vision foundation model’s strengths in stereo matching,”arXiv preprint arXiv:2404.06261, 2024
-
[18]
Learning representa- tions from foundation models for domain generalized stereo matching,
Y . Zhang, L. Wang, K. Li, Y . Wang, and Y . Guo, “Learning representa- tions from foundation models for domain generalized stereo matching,” inEuropean Conference on Computer Vision (ECCV), pp. 146–162, Springer, 2025
2025
-
[19]
Finetune like you pretrain: Improved finetuning of zero-shot vision models,
S. Goyal, A. Kumar, S. Garg, Z. Kolter, and A. Raghunathan, “Finetune like you pretrain: Improved finetuning of zero-shot vision models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19338–19347, 2023
2023
-
[20]
Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,
R. Dutt, L. Ericsson, P. Sanchez, S. A. Tsaftaris, and T. Hospedales, “Parameter-efficient fine-tuning for medical image analysis: The missed opportunity,”arXiv preprint arXiv:2305.08252, 2023
-
[21]
Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,
P. Weinzaepfel, T. Lucas, V . Leroy, Y . Cabon, V . Arora, R. Br ´egier, G. Csurka, L. Antsfeld, B. Chidlovskii, and J. Revaud, “Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 17969–17980, 2023
2023
-
[22]
Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,
Y . Wang, L. Wang, K. Li, Y . Zhang, D. O. Wu, and Y . Guo, “Cost vol- ume aggregation in stereo matching revisited: A disparity classification perspective,”IEEE Transactions on Image Processing (TIP), 2024
2024
-
[23]
Deep stereo matching with hysteresis attention and supervised cost volume construction,
K. Zeng, Y . Wang, J. Mao, C. Liu, W. Peng, and Y . Yang, “Deep stereo matching with hysteresis attention and supervised cost volume construction,”IEEE Transactions on Image Processing (TIP), vol. 31, pp. 812–822, 2021
2021
-
[24]
Active disparity sampling for stereo matching with adjoint network,
C. Zhang, G. Meng, K. Tian, B. Ni, and S. Xiang, “Active disparity sampling for stereo matching with adjoint network,”IEEE Transactions on Image Processing (TIP), 2023
2023
-
[25]
Selective-stereo: Adaptive frequency information selection for stereo matching,
X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive frequency information selection for stereo matching,”arXiv preprint arXiv:2403.00486, 2024
-
[26]
Defom-stereo: Depth foundation model based stereo matching,
H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang, “Defom-stereo: Depth foundation model based stereo matching,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 21857–21867, 2025
2025
-
[27]
Foundationstereo: Zero-shot stereo matching,
B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” 2025
2025
-
[28]
All-in-one: Transferring vision foundation models into stereo matching,
J. Zhou, H. Zhang, J. Yuan, P. Ye, T. Chen, H. Jiang, M. Chen, and Y . Zhang, “All-in-one: Transferring vision foundation models into stereo matching,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 10797–10805, 2025
2025
-
[29]
Learning robust stereo matching in the wild with selective mixture-of-experts,
Y . Wang, L. Wang, C. Zhang, Y . Zhang, Z. Zhang, A. Ma, C. Fan, T. L. Lam, and J. Hu, “Learning robust stereo matching in the wild with selective mixture-of-experts,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 21276–21287, 2025
2025
-
[30]
Self-supervised learning for stereo matching with self-improving ability,
Y . Zhong, Y . Dai, and H. Li, “Self-supervised learning for stereo matching with self-improving ability,”CoRR, vol. abs/1709.00930, 2017
-
[31]
Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,
Y . Wang, P. Wang, Z. Yang, C. Luo, Y . Yang, and W. Xu, “Unos: Uni- fied unsupervised optical-flow and stereo-depth estimation by watching videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8063–8073, 2019
2019
-
[32]
Unsupervised occlusion-aware stereo matching with directed disparity smoothing,
A. Li, Z. Yuan, Y . Ling, W. Chi, S. Zhang, and C. Zhang, “Unsupervised occlusion-aware stereo matching with directed disparity smoothing,” IEEE Transactions on Intelligent Transportation Systems (TITS), vol. 23, no. 7, pp. 7457–7468, 2021
2021
-
[33]
Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,
Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang, “Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation,” inProceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 15529– 15538, 2021
2021
-
[34]
Chitransformer: Towards reliable stereo from cues,
Q. Su and S. Ji, “Chitransformer: Towards reliable stereo from cues,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1939–1949, 2022
1939
-
[35]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[36]
Nerf-supervised deep stereo,
F. Tosi, A. Tonioni, D. De Gregorio, and M. Poggi, “Nerf-supervised deep stereo,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 855–866, 2023
2023
-
[37]
Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,
H. Xu, Z. Zhou, Y . Qiao, W. Kang, and Q. Wu, “Self-supervised multi- view stereo via effective co-segmentation and data-augmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 3030–3038, 2021
2021
-
[38]
Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,
D. Chang, A. Bo ˇziˇc, T. Zhang, Q. Yan, Y . Chen, S. S ¨usstrunk, and M. Nießner, “Rc-mvsnet: Unsupervised multi-view stereo with neu- 14 ral rendering,” inEuropean Conference on Computer Vision (ECCV), pp. 665–680, Springer, 2022
2022
-
[39]
Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,
Y . Wang, J. Zheng, C. Zhang, Z. Zhang, K. Li, Y . Zhang, and J. Hu, “Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 39, pp. 8178–8186, 2025
2025
-
[40]
Rose: Robust self-supervised stereo matching under adverse weather condi- tions,
Y . Wang, J. Hu, J. Hou, C. Zhang, R. Yang, and D. O. Wu, “Rose: Robust self-supervised stereo matching under adverse weather condi- tions,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025
2025
-
[41]
Pyramid stereo matching network,
J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418, 2018
2018
-
[42]
Pcw-net: Pyramid combination and warping cost volume for stereo matching,
Z. Shen, Y . Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “Pcw-net: Pyramid combination and warping cost volume for stereo matching,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 280–297, Springer, 2022
2022
-
[43]
Cvcnet: Learning cost volume compression for efficient stereo matching,
Y . Guo, Y . Wang, L. Wang, Z. Wang, and C. Cheng, “Cvcnet: Learning cost volume compression for efficient stereo matching,”IEEE Transac- tions on Multimedia (TMM), vol. 25, pp. 7786–7799, 2022
2022
-
[44]
Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,
Y . Wang, K. Li, L. Wang, J. Hu, D. O. Wu, and Y . Guo, “Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,”IEEE Transactions on Image Processing (TIP), 2025
2025
-
[45]
AANet: Adaptive aggregation network for efficient stereo matching,
H. Xu and J. Zhang, “AANet: Adaptive aggregation network for efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1959–1968, 2020
1959
-
[46]
Hda-net: Horizontal deformable attention network for stereo matching,
Q. Zhang, X. Zhang, B. Li, Y . Chen, and A. Ming, “Hda-net: Horizontal deformable attention network for stereo matching,” inProceedings of the 29th ACM International Conference on Multimedia (ACMMM), pp. 32– 40, 2021
2021
-
[47]
High- frequency stereo matching network,
H. Zhao, H. Zhou, Y . Zhang, J. Chen, Y . Yang, and Y . Zhao, “High- frequency stereo matching network,” inProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), pp. 1327– 1336, 2023
2023
-
[48]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[49]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738, 2020
2020
-
[50]
Exploring simple siamese representation learning,
X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15750–15758, 2021
2021
-
[51]
Contrastive learning with stronger augmenta- tions,
X. Wang and G.-J. Qi, “Contrastive learning with stronger augmenta- tions,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 5, pp. 5549–5560, 2022
2022
-
[52]
Improved Baselines with Momentum Contrastive Learning
X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with mo- mentum contrastive learning,”arXiv preprint arXiv:2003.04297, 2020
work page internal anchor Pith review arXiv 2003
-
[53]
Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,
Z. Xie, Y . Lin, Z. Zhang, Y . Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16684–16693, 2021
2021
-
[54]
Revisiting domain generalized stereo matching networks from a feature consistency perspective,
J. Zhang, X. Wang, X. Bai, C. Wang, L. Huang, Y . Chen, L. Gu, J. Zhou, T. Harada, and E. R. Hancock, “Revisiting domain generalized stereo matching networks from a feature consistency perspective,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13001–13011, 2022
2022
-
[55]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing (TIP), vol. 13, no. 4, pp. 600–612, 2004
2004
-
[56]
Sense: Self-evolving learning for self-supervised monocular depth estimation,
G. Li, R. Huang, H. Li, Z. You, and W. Chen, “Sense: Self-evolving learning for self-supervised monocular depth estimation,”IEEE Trans- actions on Image Processing (TIP), 2023
2023
-
[57]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16000–16009, 2022
2022
-
[58]
Masked representation learning for domain generalized stereo matching,
Z. Rao, B. Xiong, M. He, Y . Dai, R. He, Z. Shen, and X. Li, “Masked representation learning for domain generalized stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5435–5444, 2023
2023
-
[59]
Faster r-cnn: Towards real- time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 39, no. 6, pp. 1137–1149, 2016
2016
-
[60]
Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,
Y . Ding, Q. Zhu, X. Liu, W. Yuan, H. Zhang, and C. Zhang, “Kd- mvs: Knowledge distillation based self-supervised learning for multi- view stereo,” inEuropean Conference on Computer Vision (ECCV), pp. 630–646, Springer, 2022
2022
-
[61]
Flownet: Learning optical flow with convolutional networks,
A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766, 2015
2015
-
[62]
Are we ready for autonomous driving? the KITTI vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361, 2012
2012
-
[63]
Object scene flow for autonomous vehicles,
M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070, 2015
2015
-
[64]
End-to-end learning of geometry and context for deep stereo regression,
A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 66–75, 2017
2017
-
[65]
Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,
J. Xing, Z. Qi, J. Dong, J. Cai, and H. Liu, “Mabnet: a lightweight stereo network based on multibranch adjustable bottleneck module,” inProceedings of the IEEE European Conference on Computer Vision (ECCV), Springer, 2020
2020
-
[66]
Sgm-nets: Semi-global matching with neural networks,
A. Seki and M. Pollefeys, “Sgm-nets: Semi-global matching with neural networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 231–240, 2017
2017
-
[67]
Occlusion aware stereo matching via cooperative un- supervised learning,
A. Li and Z. Yuan, “Occlusion aware stereo matching via cooperative un- supervised learning,” inAsian Conference on Computer Vision (ACCV), pp. 197–213, Springer, 2018
2018
-
[68]
Digging into uncertainty-based pseudo-label for robust stereo matching,
Z. Shen, X. Song, Y . Dai, D. Zhou, Z. Rao, and L. Zhang, “Digging into uncertainty-based pseudo-label for robust stereo matching,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 30, no. 2, pp. 1–18, 2023
2023
-
[69]
Los: Local structure-guided stereo matching,
K. Li, L. Wang, Y . Zhang, K. Xue, S. Zhou, and Y . Guo, “Los: Local structure-guided stereo matching,” inProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 19746– 19756, 2024
2024
-
[70]
Mocha-stereo: Motif channel attention network for stereo matching,
Z. Chen, W. Long, H. Yao, Y . Zhang, B. Wang, Y . Qin, and J. Wu, “Mocha-stereo: Motif channel attention network for stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27768–27777, 2024
2024
-
[71]
Neural markov random field for stereo matching,
T. Guan, C. Wang, and Y .-H. Liu, “Neural markov random field for stereo matching,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 5459–5469, 2024
2024
-
[72]
High-resolution stereo datasets with subpixel-accurate ground truth,
D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Ne ˇsi´c, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGerman conference on pattern recog- nition (GCPR), pp. 31–42, Springer, 2014
2014
-
[73]
A multi-view stereo benchmark with high-resolution images and multi-camera videos,
T. Sch ¨ops, J. L. Sch ¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2538–2547, 2017
2017
-
[74]
Attention concatenation volume for accurate and efficient stereo matching,
G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12981–12990, 2022
2022
-
[75]
Unambiguous pyramid cost volumes fusion for stereo matching,
Q. Chen, B. Ge, and J. Quan, “Unambiguous pyramid cost volumes fusion for stereo matching,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2023
2023
-
[76]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), pp. 8748–8763, PMLR, 2021
2021
-
[77]
DeepDriving: Learning affordance for direct perception in autonomous driving,
C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “DeepDriving: Learning affordance for direct perception in autonomous driving,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2722–2730, 2015
2015
-
[78]
Open challenges in deep stereo: the booster dataset,
P. Zama Ramirez, F. Tosi, M. Poggi, S. Salti, L. Di Stefano, and S. Mattoccia, “Open challenges in deep stereo: the booster dataset,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2022. CVPR
2022
-
[79]
Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020
work page internal anchor Pith review arXiv 2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.