End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Haihua Lu; Li Zhang; Quanhong Wang; Yong Zhao

arxiv: 1906.10399 · v1 · pith:SQPMH3HQnew · submitted 2019-06-25 · 💻 cs.CV

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Li Zhang , Quanhong Wang , Haihua Lu , Yong Zhao This is my paper

Pith reviewed 2026-05-25 17:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo matchingdisparity estimationmulti-scale featuresconvolutional neural networkguidance mechanismconsistency checkScene FlowKITTI

0 comments

The pith

A multi-scale features network fuses semantic context with fine details to improve stereo disparity estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MSFNet as an end-to-end CNN for stereo matching that addresses the limited fusion of contextual semantic information and fine details in prior networks. It encodes multi-scale features through a structure that merges semantic and detail information via both element-wise addition and concatenation, adds a guidance mechanism to emphasize unreliable regions, and formulates consistency checking as an error map derived from low-stage features to refine the initial disparity. A sympathetic reader would care because stereo matching supplies depth for 3D vision systems, and better automatic handling of scale and reliability could reduce reliance on manual post-processing. If the claims hold, the approach would deliver higher accuracy on standard benchmarks by letting the network focus computation where it matters most.

Core claim

The central claim is that the Multi-scale Features Network (MSFNet) encodes rich semantic information and fine-grained details by fusing multi-scale features, combines the advantages of element-wise addition and concatenation to merge semantics with details, introduces a guidance mechanism to focus on unreliable regions, formulates the consistency check as an error map from low-stage features, and adopts consistency checking between the left feature and the synthetic left feature to refine the initial disparity, thereby achieving state-of-the-art performance on stereo matching tasks.

What carries the argument

The Multi-scale Features Network (MSFNet) that fuses multi-scale features to merge semantic information with details while using a guidance mechanism and consistency-based refinement.

If this is right

The network achieves state-of-the-art performance on the Scene Flow and KITTI 2015 benchmarks.
The guidance mechanism directs the network to allocate more capacity to unreliable disparity regions automatically.
Consistency checking between left and synthetic left features refines the initial disparity estimate using fine-grained low-stage details.
Combining element-wise addition with concatenation improves the integration of semantic context and local details over either operation alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion and guidance pattern could be adapted to other dense prediction problems such as optical flow or semantic segmentation where scale balance is critical.
If the error-map consistency step proves stable, it might serve as a lightweight alternative to traditional left-right consistency checks in deployed stereo systems.
Further tests on datasets with domain shift, such as varying illumination or sensor noise, would clarify how much the automatic focus mechanism reduces the need for manual tuning.

Load-bearing premise

The multi-scale fusion, guidance mechanism, and consistency check will produce reliable gains on real-world data without extensive per-dataset hyperparameter search or post-hoc exclusions.

What would settle it

A controlled test on a new stereo dataset where the full MSFNet pipeline, trained end-to-end with the described components, fails to exceed the accuracy of prior single-scale or non-guided networks under identical evaluation protocols.

Figures

Figures reproduced from arXiv: 1906.10399 by Haihua Lu, Li Zhang, Quanhong Wang, Yong Zhao.

**Figure 1.** Figure 1: Architecture overview of proposed MSFNet. All of the four steps for stereo matching are incorporated into a single network. Given a stereo pair, the disparity map is the output. The outputs of the first module MSFM are Local Prior Feature and Local Details. Detailed structures of SGRM is displayed in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture overview of our submodule SGRM, which stacks three Guidance Residual Module. It takes the Local Details and the initial disparity as inputs to produce the residual for the initial disparity. The final disparity map is represented by the summation of the initial disparity and residual. checking. Inspired by FlowNet ( Dosovitskiy et al. (2015)), we use the warping operation as a transfer to obt… view at source ↗

**Figure 3.** Figure 3: Comparisons of different structures for stereo matching on Scene Flow dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on KITTI 2015 test set. From left: left stereo input image, disparity prediction, error map. The reason why EdgeStereo method is accurate than our model is that the edge information can provide more details of object. In other words, detection task is as an auxiliary task in EdgeStereo method. At the same time, however, with the addition of edge detection network, the complexity and com… view at source ↗

read the original abstract

Deep neural networks have shown excellent performance in stereo matching task. Recently CNN-based methods have shown that stereo matching can be formulated as a supervised learning task. However, less attention is paid on the fusion of contextual semantic information and details. To tackle this problem, we propose a network for disparity estimation based on abundant contextual details and semantic information, called Multi-scale Features Network (MSFNet). First, we design a new structure to encode rich semantic information and fine-grained details by fusing multi-scale features. And we combine the advantages of element-wise addition and concatenation, which is conducive to merge semantic information with details. Second, a guidance mechanism is introduced to guide the network to automatically focus more on the unreliable regions. Third, we formulate the consistency check as an error map, obtained by the low stage features with fine-grained details. Finally, we adopt the consistency checking between the left feature and the synthetic left feature to refine the initial disparity. Experiments on Scene Flow and KITTI 2015 benchmark demonstrated that the proposed method can achieve the state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSFNet combines multi-scale fusion via add+concat, a guidance module, and consistency-as-error-map for stereo matching, but the SOTA claim on Scene Flow and KITTI rests on unshown ablations and fixed-hyperparameter results.

read the letter

The paper's core idea is MSFNet, an end-to-end network that encodes multi-scale features by mixing element-wise addition with concatenation, routes a guidance signal to emphasize unreliable regions, and derives a consistency error map from low-stage features to refine the initial disparity estimate. This is a straightforward engineering assembly of existing stereo-CNN building blocks rather than a new theoretical framing. The description is clear enough that someone could reimplement the architecture from the text. What the work does reasonably well is keep everything differentiable and avoid separate post-processing stages at inference time. The soft spots sit in the validation. The abstract asserts state-of-the-art numbers on Scene Flow and KITTI 2015, yet supplies no tables, no per-component ablations, no error bars, and no statement that hyperparameters stayed fixed across the two datasets. The stress-test concern about unverified generalization therefore lands: without those controls it is hard to tell whether the reported gains come from the proposed modules or from benchmark-specific tuning. If the full manuscript contains the missing comparisons and they survive scrutiny, the contribution becomes more solid; on the current evidence the results look under-supported. This paper is for CV groups that maintain stereo pipelines and need another baseline to compare against. A reader already working on KITTI submissions might extract the architecture description and try it, but the work is unlikely to shift practice on its own. I would send it to peer review because the task is central and the architecture is reproducible in principle, though the referees will need to see the quantitative evidence before any stronger claim can be accepted.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MSFNet, an end-to-end CNN for stereo matching. It encodes rich semantic information and fine-grained details via multi-scale feature fusion that combines element-wise addition with concatenation, introduces a guidance mechanism to focus on unreliable regions, and formulates a consistency check as an error map derived from low-stage features to refine the initial disparity. Experiments are reported to achieve state-of-the-art performance on the Scene Flow and KITTI 2015 benchmarks.

Significance. If the quantitative results hold under fixed hyperparameters and the proposed modules are shown to contribute via controlled experiments, the work would provide a practical advance in fusing contextual and detail information for disparity estimation. The empirical design does not rely on parameter-free derivations or machine-checked proofs, so significance rests entirely on the robustness and reproducibility of the benchmark gains.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The central SOTA claim on both Scene Flow and KITTI 2015 is load-bearing for the contribution, yet no ablation studies are described that isolate the multi-scale fusion (addition + concatenation), guidance mechanism, or consistency-check error map. Without these, gains cannot be distinguished from per-dataset hyperparameter search or post-processing choices.
[§3 (Method) and §4] §3 (Method) and §4: The manuscript does not state whether hyperparameters (learning rate, fusion weights, consistency threshold) were held fixed across the synthetic Scene Flow and real KITTI 2015 datasets or whether selective post-hoc exclusions were applied; this directly affects the generalization assumption underlying the SOTA result.

minor comments (2)

The abstract would be strengthened by reporting at least the primary error metrics (e.g., >3 px or EPE) achieved on each benchmark to allow immediate assessment of the SOTA claim.
[§3 (Method)] Notation for the guidance mechanism and error-map formulation should be made explicit (e.g., define the mathematical operation that produces the error map from low-stage features) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for clearer experimental validation. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [§4 (Experiments)] The central SOTA claim on both Scene Flow and KITTI 2015 is load-bearing for the contribution, yet no ablation studies are described that isolate the multi-scale fusion (addition + concatenation), guidance mechanism, or consistency-check error map. Without these, gains cannot be distinguished from per-dataset hyperparameter search or post-processing choices.

Authors: We agree that explicit ablation studies isolating each proposed component would strengthen the paper and better support the SOTA claims. The current manuscript reports overall performance but does not include controlled ablations for the multi-scale fusion strategy, guidance mechanism, or consistency-check error map. We will add these experiments in the revision, using the same training protocol to quantify the contribution of each module. revision: yes
Referee: [§3 (Method) and §4] The manuscript does not state whether hyperparameters (learning rate, fusion weights, consistency threshold) were held fixed across the synthetic Scene Flow and real KITTI 2015 datasets or whether selective post-hoc exclusions were applied; this directly affects the generalization assumption underlying the SOTA result.

Authors: We will revise §4 to explicitly state that all hyperparameters, including learning rate, fusion weights, and consistency threshold, were held fixed across both datasets with no dataset-specific tuning or selective post-hoc exclusions applied. This setup was used to support the generalization claim; the revision will make the protocol fully transparent. revision: yes

Circularity Check

0 steps flagged

Empirical CNN architecture exhibits no circular derivation

full rationale

The paper proposes MSFNet as an end-to-end learned multi-scale CNN for stereo matching, describing feature fusion via addition+concatenation, a guidance mechanism, and consistency-check error maps. No equations, parameters fitted to a target quantity, or self-citation chains are invoked to derive performance; results are reported from direct training and evaluation on Scene Flow and KITTI 2015. The central claim is therefore an empirical observation rather than a reduction of any predicted quantity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5718 in / 970 out tokens · 21005 ms · 2026-05-25T17:00:44.368426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Signature verification using a "siamese" time delay neural network

Jane Bromley, Isabelle Guyon, Yann Lecun, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In International Conference on Neural Information Processing Systems, pages 737--744, 1993

work page 1993
[2]

Multiscale convolutional neural networks for visioncbased classification of cells

Pierre Buyssens, Abderrahim Elmoataz, and Olivier Lzoray. Multiscale convolutional neural networks for visioncbased classification of cells. 7725: 0 342--352, 2012

work page 2012
[3]

Semantic image segmentation with deep convolutional nets and fully connected crfs

Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. Computer Science, 0 (4): 0 357--361, 2014

work page 2014
[4]

Liang Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. Attention to scale: Scale-aware semantic image segmentation. pages 3640--3649, 2015

work page 2015
[5]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. 2018

work page 2018
[6]

Flownet: Learning optical flow with convolutional networks

Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pages 2758--2766, 2015

work page 2015
[7]

Detect, replace, refine: Deep structured prediction for pixel wise labeling

Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. pages 7187--7196, 2016

work page 2016
[8]

Displets: Resolving stereo ambiguities using object knowledge

Fatma Guney and Andreas Geiger. Displets: Resolving stereo ambiguities using object knowledge. In Computer Vision and Pattern Recognition, pages 4165--4175, 2015

work page 2015
[9]

Spatial pyramid pooling in deep convolutional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 37 0 (9): 0 1904--1916, 2014

work page 1904
[10]

Hirschmuller

H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pages 807--814, 2005

work page 2005
[11]

End-to-end learning of geometry and context for deep stereo regression

Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and Peter Henry. End-to-end learning of geometry and context for deep stereo regression. pages 66--75, 2017

work page 2017
[12]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014

work page 2014
[13]

Schwing, and Raquel Urtasun

Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5695--5703, 2016

work page 2016
[14]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip H?usser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Computer Vision and Pattern Recognition, pages 4040--4048, 2016

work page 2016
[15]

Cascade residual learning: A two-stage convolutional neural network for stereo matching

Jiahao Pang, Wenxiu Sun, Jimmy Sj Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. 2017

work page 2017
[16]

Look wider to match image patches with convolutional neural networks

Haesol Park and Kyoung Mu Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, PP 0 (99): 0 1--1, 2017

work page 2017
[17]

Yolov3: An incremental improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. 2018

work page 2018
[18]

Improved stereo matching with constant highway networks and reflective confidence learning

Amit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. pages 6901--6910, 2016

work page 2016
[19]

Edgestereo: A context integrated residual pyramid network for stereo matching

Xiao Song, Xu Zhao, Hanwen Hu, and Liangji Fang. Edgestereo: A context integrated residual pyramid network for stereo matching. 2018

work page 2018
[20]

Holistically-nested edge detection

Saining Xie and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125 0 (1-3): 0 3--18, 2017

work page 2017
[21]

Accurate optical flow via direct cost volume processing

Jia Xu, Rene Ranftl, and Vladlen Koltun. Accurate optical flow via direct cost volume processing. pages 5807--5815, 2017

work page 2017
[22]

Efficient joint segmentation, occlusion labeling, stereo and flow estimation

Koichiro Yamaguchi, David Mcallester, and Raquel Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pages 756--771, 2014

work page 2014
[23]

Deep stereo matching with explicit cost aggregation sub-architecture

Lidong Yu, Yucheng Wang, Yuwei Wu, and Yunde Jia. Deep stereo matching with explicit cost aggregation sub-architecture. 2018

work page 2018
[24]

Stereo matching by training a convolutional neural network to compare image patches

Jure Z bontar and Yann Lecun. Stereo matching by training a convolutional neural network to compare image patches. 17 0 (1): 0 2287--2318, 2015

work page 2015

[1] [1]

Signature verification using a "siamese" time delay neural network

Jane Bromley, Isabelle Guyon, Yann Lecun, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In International Conference on Neural Information Processing Systems, pages 737--744, 1993

work page 1993

[2] [2]

Multiscale convolutional neural networks for visioncbased classification of cells

Pierre Buyssens, Abderrahim Elmoataz, and Olivier Lzoray. Multiscale convolutional neural networks for visioncbased classification of cells. 7725: 0 342--352, 2012

work page 2012

[3] [3]

Semantic image segmentation with deep convolutional nets and fully connected crfs

Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. Computer Science, 0 (4): 0 357--361, 2014

work page 2014

[4] [4]

Liang Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. Attention to scale: Scale-aware semantic image segmentation. pages 3640--3649, 2015

work page 2015

[5] [5]

Encoder-decoder with atrous separable convolution for semantic image segmentation

Liang Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. 2018

work page 2018

[6] [6]

Flownet: Learning optical flow with convolutional networks

Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pages 2758--2766, 2015

work page 2015

[7] [7]

Detect, replace, refine: Deep structured prediction for pixel wise labeling

Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. pages 7187--7196, 2016

work page 2016

[8] [8]

Displets: Resolving stereo ambiguities using object knowledge

Fatma Guney and Andreas Geiger. Displets: Resolving stereo ambiguities using object knowledge. In Computer Vision and Pattern Recognition, pages 4165--4175, 2015

work page 2015

[9] [9]

Spatial pyramid pooling in deep convolutional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 37 0 (9): 0 1904--1916, 2014

work page 1904

[10] [10]

Hirschmuller

H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pages 807--814, 2005

work page 2005

[11] [11]

End-to-end learning of geometry and context for deep stereo regression

Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and Peter Henry. End-to-end learning of geometry and context for deep stereo regression. pages 66--75, 2017

work page 2017

[12] [12]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014

work page 2014

[13] [13]

Schwing, and Raquel Urtasun

Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5695--5703, 2016

work page 2016

[14] [14]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip H?usser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Computer Vision and Pattern Recognition, pages 4040--4048, 2016

work page 2016

[15] [15]

Cascade residual learning: A two-stage convolutional neural network for stereo matching

Jiahao Pang, Wenxiu Sun, Jimmy Sj Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. 2017

work page 2017

[16] [16]

Look wider to match image patches with convolutional neural networks

Haesol Park and Kyoung Mu Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, PP 0 (99): 0 1--1, 2017

work page 2017

[17] [17]

Yolov3: An incremental improvement

Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. 2018

work page 2018

[18] [18]

Improved stereo matching with constant highway networks and reflective confidence learning

Amit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. pages 6901--6910, 2016

work page 2016

[19] [19]

Edgestereo: A context integrated residual pyramid network for stereo matching

Xiao Song, Xu Zhao, Hanwen Hu, and Liangji Fang. Edgestereo: A context integrated residual pyramid network for stereo matching. 2018

work page 2018

[20] [20]

Holistically-nested edge detection

Saining Xie and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125 0 (1-3): 0 3--18, 2017

work page 2017

[21] [21]

Accurate optical flow via direct cost volume processing

Jia Xu, Rene Ranftl, and Vladlen Koltun. Accurate optical flow via direct cost volume processing. pages 5807--5815, 2017

work page 2017

[22] [22]

Efficient joint segmentation, occlusion labeling, stereo and flow estimation

Koichiro Yamaguchi, David Mcallester, and Raquel Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pages 756--771, 2014

work page 2014

[23] [23]

Deep stereo matching with explicit cost aggregation sub-architecture

Lidong Yu, Yucheng Wang, Yuwei Wu, and Yunde Jia. Deep stereo matching with explicit cost aggregation sub-architecture. 2018

work page 2018

[24] [24]

Stereo matching by training a convolutional neural network to compare image patches

Jure Z bontar and Yann Lecun. Stereo matching by training a convolutional neural network to compare image patches. 17 0 (1): 0 2287--2318, 2015

work page 2015