pith. sign in

arxiv: 1906.10399 · v1 · pith:SQPMH3HQnew · submitted 2019-06-25 · 💻 cs.CV

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Pith reviewed 2026-05-25 17:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo matchingdisparity estimationmulti-scale featuresconvolutional neural networkguidance mechanismconsistency checkScene FlowKITTI
0
0 comments X

The pith

A multi-scale features network fuses semantic context with fine details to improve stereo disparity estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MSFNet as an end-to-end CNN for stereo matching that addresses the limited fusion of contextual semantic information and fine details in prior networks. It encodes multi-scale features through a structure that merges semantic and detail information via both element-wise addition and concatenation, adds a guidance mechanism to emphasize unreliable regions, and formulates consistency checking as an error map derived from low-stage features to refine the initial disparity. A sympathetic reader would care because stereo matching supplies depth for 3D vision systems, and better automatic handling of scale and reliability could reduce reliance on manual post-processing. If the claims hold, the approach would deliver higher accuracy on standard benchmarks by letting the network focus computation where it matters most.

Core claim

The central claim is that the Multi-scale Features Network (MSFNet) encodes rich semantic information and fine-grained details by fusing multi-scale features, combines the advantages of element-wise addition and concatenation to merge semantics with details, introduces a guidance mechanism to focus on unreliable regions, formulates the consistency check as an error map from low-stage features, and adopts consistency checking between the left feature and the synthetic left feature to refine the initial disparity, thereby achieving state-of-the-art performance on stereo matching tasks.

What carries the argument

The Multi-scale Features Network (MSFNet) that fuses multi-scale features to merge semantic information with details while using a guidance mechanism and consistency-based refinement.

If this is right

  • The network achieves state-of-the-art performance on the Scene Flow and KITTI 2015 benchmarks.
  • The guidance mechanism directs the network to allocate more capacity to unreliable disparity regions automatically.
  • Consistency checking between left and synthetic left features refines the initial disparity estimate using fine-grained low-stage details.
  • Combining element-wise addition with concatenation improves the integration of semantic context and local details over either operation alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion and guidance pattern could be adapted to other dense prediction problems such as optical flow or semantic segmentation where scale balance is critical.
  • If the error-map consistency step proves stable, it might serve as a lightweight alternative to traditional left-right consistency checks in deployed stereo systems.
  • Further tests on datasets with domain shift, such as varying illumination or sensor noise, would clarify how much the automatic focus mechanism reduces the need for manual tuning.

Load-bearing premise

The multi-scale fusion, guidance mechanism, and consistency check will produce reliable gains on real-world data without extensive per-dataset hyperparameter search or post-hoc exclusions.

What would settle it

A controlled test on a new stereo dataset where the full MSFNet pipeline, trained end-to-end with the described components, fails to exceed the accuracy of prior single-scale or non-guided networks under identical evaluation protocols.

Figures

Figures reproduced from arXiv: 1906.10399 by Haihua Lu, Li Zhang, Quanhong Wang, Yong Zhao.

Figure 1
Figure 1. Figure 1: Architecture overview of proposed MSFNet. All of the four steps for stereo match￾ing are incorporated into a single network. Given a stereo pair, the disparity map is the output. The outputs of the first module MSFM are Local Prior Feature and Local Details. Detailed structures of SGRM is displayed in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture overview of our submodule SGRM, which stacks three Guidance Residual Module. It takes the Local Details and the initial disparity as inputs to produce the residual for the initial disparity. The final disparity map is repre￾sented by the summation of the initial disparity and residual. checking. Inspired by FlowNet ( Dosovitskiy et al. (2015)), we use the warping operation as a transfer to obt… view at source ↗
Figure 3
Figure 3. Figure 3: Comparisons of different structures for stereo matching on Scene Flow dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on KITTI 2015 test set. From left: left stereo input image, disparity prediction, error map. The reason why EdgeStereo method is accurate than our model is that the edge information can provide more details of object. In other words, detection task is as an auxiliary task in EdgeStereo method. At the same time, however, with the addition of edge detection network, the complexity and com… view at source ↗
read the original abstract

Deep neural networks have shown excellent performance in stereo matching task. Recently CNN-based methods have shown that stereo matching can be formulated as a supervised learning task. However, less attention is paid on the fusion of contextual semantic information and details. To tackle this problem, we propose a network for disparity estimation based on abundant contextual details and semantic information, called Multi-scale Features Network (MSFNet). First, we design a new structure to encode rich semantic information and fine-grained details by fusing multi-scale features. And we combine the advantages of element-wise addition and concatenation, which is conducive to merge semantic information with details. Second, a guidance mechanism is introduced to guide the network to automatically focus more on the unreliable regions. Third, we formulate the consistency check as an error map, obtained by the low stage features with fine-grained details. Finally, we adopt the consistency checking between the left feature and the synthetic left feature to refine the initial disparity. Experiments on Scene Flow and KITTI 2015 benchmark demonstrated that the proposed method can achieve the state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MSFNet, an end-to-end CNN for stereo matching. It encodes rich semantic information and fine-grained details via multi-scale feature fusion that combines element-wise addition with concatenation, introduces a guidance mechanism to focus on unreliable regions, and formulates a consistency check as an error map derived from low-stage features to refine the initial disparity. Experiments are reported to achieve state-of-the-art performance on the Scene Flow and KITTI 2015 benchmarks.

Significance. If the quantitative results hold under fixed hyperparameters and the proposed modules are shown to contribute via controlled experiments, the work would provide a practical advance in fusing contextual and detail information for disparity estimation. The empirical design does not rely on parameter-free derivations or machine-checked proofs, so significance rests entirely on the robustness and reproducibility of the benchmark gains.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments): The central SOTA claim on both Scene Flow and KITTI 2015 is load-bearing for the contribution, yet no ablation studies are described that isolate the multi-scale fusion (addition + concatenation), guidance mechanism, or consistency-check error map. Without these, gains cannot be distinguished from per-dataset hyperparameter search or post-processing choices.
  2. [§3 (Method) and §4] §3 (Method) and §4: The manuscript does not state whether hyperparameters (learning rate, fusion weights, consistency threshold) were held fixed across the synthetic Scene Flow and real KITTI 2015 datasets or whether selective post-hoc exclusions were applied; this directly affects the generalization assumption underlying the SOTA result.
minor comments (2)
  1. The abstract would be strengthened by reporting at least the primary error metrics (e.g., >3 px or EPE) achieved on each benchmark to allow immediate assessment of the SOTA claim.
  2. [§3 (Method)] Notation for the guidance mechanism and error-map formulation should be made explicit (e.g., define the mathematical operation that produces the error map from low-stage features) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for clearer experimental validation. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The central SOTA claim on both Scene Flow and KITTI 2015 is load-bearing for the contribution, yet no ablation studies are described that isolate the multi-scale fusion (addition + concatenation), guidance mechanism, or consistency-check error map. Without these, gains cannot be distinguished from per-dataset hyperparameter search or post-processing choices.

    Authors: We agree that explicit ablation studies isolating each proposed component would strengthen the paper and better support the SOTA claims. The current manuscript reports overall performance but does not include controlled ablations for the multi-scale fusion strategy, guidance mechanism, or consistency-check error map. We will add these experiments in the revision, using the same training protocol to quantify the contribution of each module. revision: yes

  2. Referee: [§3 (Method) and §4] The manuscript does not state whether hyperparameters (learning rate, fusion weights, consistency threshold) were held fixed across the synthetic Scene Flow and real KITTI 2015 datasets or whether selective post-hoc exclusions were applied; this directly affects the generalization assumption underlying the SOTA result.

    Authors: We will revise §4 to explicitly state that all hyperparameters, including learning rate, fusion weights, and consistency threshold, were held fixed across both datasets with no dataset-specific tuning or selective post-hoc exclusions applied. This setup was used to support the generalization claim; the revision will make the protocol fully transparent. revision: yes

Circularity Check

0 steps flagged

Empirical CNN architecture exhibits no circular derivation

full rationale

The paper proposes MSFNet as an end-to-end learned multi-scale CNN for stereo matching, describing feature fusion via addition+concatenation, a guidance mechanism, and consistency-check error maps. No equations, parameters fitted to a target quantity, or self-citation chains are invoked to derive performance; results are reported from direct training and evaluation on Scene Flow and KITTI 2015. The central claim is therefore an empirical observation rather than a reduction of any predicted quantity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5718 in / 970 out tokens · 21005 ms · 2026-05-25T17:00:44.368426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Signature verification using a "siamese" time delay neural network

    Jane Bromley, Isabelle Guyon, Yann Lecun, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In International Conference on Neural Information Processing Systems, pages 737--744, 1993

  2. [2]

    Multiscale convolutional neural networks for visioncbased classification of cells

    Pierre Buyssens, Abderrahim Elmoataz, and Olivier Lzoray. Multiscale convolutional neural networks for visioncbased classification of cells. 7725: 0 342--352, 2012

  3. [3]

    Semantic image segmentation with deep convolutional nets and fully connected crfs

    Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. Computer Science, 0 (4): 0 357--361, 2014

  4. [4]

    Liang Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L. Yuille. Attention to scale: Scale-aware semantic image segmentation. pages 3640--3649, 2015

  5. [5]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. 2018

  6. [6]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision, pages 2758--2766, 2015

  7. [7]

    Detect, replace, refine: Deep structured prediction for pixel wise labeling

    Spyros Gidaris and Nikos Komodakis. Detect, replace, refine: Deep structured prediction for pixel wise labeling. pages 7187--7196, 2016

  8. [8]

    Displets: Resolving stereo ambiguities using object knowledge

    Fatma Guney and Andreas Geiger. Displets: Resolving stereo ambiguities using object knowledge. In Computer Vision and Pattern Recognition, pages 4165--4175, 2015

  9. [9]

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell, 37 0 (9): 0 1904--1916, 2014

  10. [10]

    Hirschmuller

    H. Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, pages 807--814, 2005

  11. [11]

    End-to-end learning of geometry and context for deep stereo regression

    Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and Peter Henry. End-to-end learning of geometry and context for deep stereo regression. pages 66--75, 2017

  12. [12]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014

  13. [13]

    Schwing, and Raquel Urtasun

    Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5695--5703, 2016

  14. [14]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip H?usser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Computer Vision and Pattern Recognition, pages 4040--4048, 2016

  15. [15]

    Cascade residual learning: A two-stage convolutional neural network for stereo matching

    Jiahao Pang, Wenxiu Sun, Jimmy Sj Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A two-stage convolutional neural network for stereo matching. 2017

  16. [16]

    Look wider to match image patches with convolutional neural networks

    Haesol Park and Kyoung Mu Lee. Look wider to match image patches with convolutional neural networks. IEEE Signal Processing Letters, PP 0 (99): 0 1--1, 2017

  17. [17]

    Yolov3: An incremental improvement

    Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. 2018

  18. [18]

    Improved stereo matching with constant highway networks and reflective confidence learning

    Amit Shaked and Lior Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. pages 6901--6910, 2016

  19. [19]

    Edgestereo: A context integrated residual pyramid network for stereo matching

    Xiao Song, Xu Zhao, Hanwen Hu, and Liangji Fang. Edgestereo: A context integrated residual pyramid network for stereo matching. 2018

  20. [20]

    Holistically-nested edge detection

    Saining Xie and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125 0 (1-3): 0 3--18, 2017

  21. [21]

    Accurate optical flow via direct cost volume processing

    Jia Xu, Rene Ranftl, and Vladlen Koltun. Accurate optical flow via direct cost volume processing. pages 5807--5815, 2017

  22. [22]

    Efficient joint segmentation, occlusion labeling, stereo and flow estimation

    Koichiro Yamaguchi, David Mcallester, and Raquel Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pages 756--771, 2014

  23. [23]

    Deep stereo matching with explicit cost aggregation sub-architecture

    Lidong Yu, Yucheng Wang, Yuwei Wu, and Yunde Jia. Deep stereo matching with explicit cost aggregation sub-architecture. 2018

  24. [24]

    Stereo matching by training a convolutional neural network to compare image patches

    Jure Z bontar and Yann Lecun. Stereo matching by training a convolutional neural network to compare image patches. 17 0 (1): 0 2287--2318, 2015