Self-Supervised Learning for Stereo Matching with Self-Improving Ability

Yiran Zhong , Yuchao Dai , Hongdong Li

Authors on Pith no claims yet

classification 💻 cs.CV

keywords stereomatchingdisparitymanymapsdeep-learningdensedifferent

read the original abstract

Exiting deep-learning based dense stereo matching methods often rely on ground-truth disparity maps as the training signals, which are however not always available in many situations. In this paper, we design a simple convolutional neural network architecture that is able to learn to compute dense disparity maps directly from the stereo inputs. Training is performed in an end-to-end fashion without the need of ground-truth disparity maps. The idea is to use image warping error (instead of disparity-map residuals) as the loss function to drive the learning process, aiming to find a depth-map that minimizes the warping error. While this is a simple concept well-known in stereo matching, to make it work in a deep-learning framework, many non-trivial challenges must be overcome, and in this work we provide effective solutions. Our network is self-adaptive to different unseen imageries as well as to different camera settings. Experiments on KITTI and Middlebury stereo benchmark datasets show that our method outperforms many state-of-the-art stereo matching methods with a margin, and at the same time significantly faster.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SMFormer: Empowering Self-supervised Stereo Matching via Foundation Models and Data Augmentation
cs.CV 2026-04 unverdicted novelty 5.0

SMFormer achieves state-of-the-art self-supervised stereo matching by using vision foundation models for disturbance-resistant features and data augmentation to enforce output consistency, rivaling or exceeding some s...
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
cs.CV 2026-04 unverdicted novelty 5.0

GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.