pith. machine review for the scientific record. sign in

arxiv: 2604.20393 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo matchingvision transformermulti-granularity feature networkcost volumerecurrent unitdisparity estimationlocal-global enhancement
0
0 comments X

The pith

MLG-Stereo adds multi-granularity features, local-global cost volumes, and guided recurrent units to ViT stereo matching so it can predict details at any image resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision Transformer methods for stereo matching have shown strong robustness and zero-shot performance but still lag CNNs at recovering fine details and processing images whose resolution differs from training data. The paper introduces a full pipeline that keeps global context while adding explicit local handling at three stages: feature extraction, cost volume construction, and iterative disparity refinement. If these additions work as described, ViT-based stereo can match or exceed CNN accuracy on standard benchmarks without sacrificing scale flexibility. The central claim is that this systematic local-global design closes the remaining gap between the two families of networks.

Core claim

MLG-Stereo extends global modeling past the encoder by building a Multi-Granularity Feature Network that extracts both global context and local geometry from arbitrary-resolution inputs, a Local-Global Cost Volume that encodes both correlated local matches and global-aware information, and a Local-Global Guided Recurrent Unit that refines disparity locally while guided by global cues, producing competitive results on Middlebury and KITTI-2015 and leading results on KITTI-2012.

What carries the argument

The combination of Multi-Granularity Feature Network, Local-Global Cost Volume, and Local-Global Guided Recurrent Unit that carries global context through every stage while preserving local geometric detail.

If this is right

  • ViT stereo methods can now process test images at resolutions different from training without scale-induced errors.
  • Detail recovery in textured or occluded regions improves while global consistency is retained.
  • The same three-stage pattern can be reused on other dense prediction tasks that need both wide context and fine local accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-global staging might reduce the resolution sensitivity seen in ViT models for optical flow or monocular depth estimation.
  • A controlled test that varies input resolution continuously would directly quantify how much the new components reduce the training-inference scale gap.
  • Replacing the ViT backbone with a different global feature extractor could reveal whether the gains depend on the transformer architecture itself.

Load-bearing premise

The added networks and units close the gap between global modeling and local detail without introducing resolution-specific artifacts or overfitting to the chosen benchmarks.

What would settle it

Measure whether accuracy falls sharply when the same model is tested on image pairs whose resolution or scene statistics lie well outside the Middlebury and KITTI training distributions.

Figures

Figures reproduced from arXiv: 2604.20393 by Feng Xu, Haoyu Zhang, Jiakang Yuan, Jingyi Zhou, Lin Zhang, Peng Ye, Tao Chen.

Figure 1
Figure 1. Figure 1: Overview of existing typical stereo matching frameworks. (a) CNN [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of our MLG-Stereo. The workflow of MLG-Stereo is to provide a pair of images, first extract the features of the images [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) The Feature Fusion Network in MGFN. We use convolution [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison on the MiddEval v3 test set. Our model not only can infer ultra-high resolution images like CNN models, but also has better [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison on the KITTI-2015 and KITTI-2012 test set. Compared to CNN-based models, our model has a stronger predictive ability in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of image features extracted by encoder after PCA, from left to right: without patch image encoder, without full image encoder, with both [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The variation of model performance on the Middlebury dataset with [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Challenging cases analysis on the Middlebury benchmark. We sort the validation set by per-image error and visualize the worst-performing samples [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
read the original abstract

With the development of deep learning, ViT-based stereo matching methods have made significant progress due to their remarkable robustness and zero-shot ability. However, due to the limitations of ViTs in handling resolution sensitivity and their relative neglect of local information, the ability of ViT-based methods to predict details and handle arbitrary-resolution images is still weaker than that of CNN-based methods. To address these shortcomings, we propose MLG-Stereo, a systematic pipeline-level design that extends global modeling beyond the encoder stage. First, we propose a Multi-Granularity Feature Network to effectively balance global context and local geometric information, enabling comprehensive feature extraction from images of arbitrary resolution and bridging the gap between training and inference scales. Then, a Local-Global Cost Volume is constructed to capture both locally-correlated and global-aware matching information. Finally, a Local-Global Guided Recurrent Unit is introduced to iteratively optimize the disparity locally under the guidance of global information. Extensive experiments are conducted on multiple benchmark datasets, demonstrating that our MLG-Stereo exhibits highly competitive performance on the Middlebury and KITTI-2015 benchmarks compared to contemporaneous leading methods, and achieves outstanding results in the KITTI-2012 dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MLG-Stereo, a ViT-based stereo matching architecture that introduces a Multi-Granularity Feature Network to balance global context and local geometric information for arbitrary-resolution inputs, a Local-Global Cost Volume to capture both local correlations and global-aware matching cues, and a Local-Global Guided Recurrent Unit for iterative disparity optimization under global guidance. The central claim is that this pipeline-level extension of global modeling beyond the encoder addresses ViT limitations in resolution sensitivity and local detail handling, yielding highly competitive results on Middlebury and KITTI-2015 benchmarks and outstanding results on KITTI-2012.

Significance. If the benchmark gains prove robust and generalizable, the work could meaningfully advance practical ViT deployment in stereo matching by mitigating resolution mismatches between training and inference. The multi-stage local-global design offers a coherent architectural response to known ViT-CNN trade-offs, and the emphasis on arbitrary-resolution capability is a timely contribution if supported by targeted experiments.

major comments (3)
  1. Abstract: The performance claims rest on unspecified experimental controls with no error bars, ablation tables, or quantitative deltas provided, which is load-bearing because the central assertion that the three modules close the ViT resolution and local-detail gaps cannot be evaluated without evidence that each component contributes measurably beyond baselines.
  2. Multi-Granularity Feature Network description: The scale-bridging claim (bridging training and inference scales for arbitrary-resolution inference) lacks direct verification via resolution-shift experiments or cross-benchmark generalization tests; without such tests the architecture may simply fit the Middlebury/KITTI training crops rather than decouple performance from resolution.
  3. Local-Global Cost Volume and Recurrent Unit sections: Interaction details between these modules and the feature network are insufficient to confirm they avoid introducing interpolation artifacts or forcing the cost volume to compensate for upstream resolution sensitivity, undermining the pipeline-level design claim.
minor comments (2)
  1. Abstract: Include at least one key quantitative result (e.g., EPE or D1-all on a benchmark) to ground the qualitative descriptors 'highly competitive' and 'outstanding'.
  2. Notation: Define all acronyms (ViT, EPE, etc.) on first use and ensure consistent terminology for 'local-global' across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the experimental evidence and architectural clarity while preserving the core contributions of MLG-Stereo.

read point-by-point responses
  1. Referee: Abstract: The performance claims rest on unspecified experimental controls with no error bars, ablation tables, or quantitative deltas provided, which is load-bearing because the central assertion that the three modules close the ViT resolution and local-detail gaps cannot be evaluated without evidence that each component contributes measurably beyond baselines.

    Authors: We agree that the abstract would be strengthened by including key quantitative deltas and references to supporting evidence. The full manuscript already contains comparison tables on Middlebury, KITTI-2012, and KITTI-2015 along with module ablations; however, to directly address the concern we will revise the abstract to highlight specific improvements (e.g., EPE and D1 reductions versus strong ViT and CNN baselines) and ensure the experimental section reports error bars from repeated runs where feasible. We will also expand the ablation subsection to explicitly quantify each module's contribution. revision: yes

  2. Referee: Multi-Granularity Feature Network description: The scale-bridging claim (bridging training and inference scales for arbitrary-resolution inference) lacks direct verification via resolution-shift experiments or cross-benchmark generalization tests; without such tests the architecture may simply fit the Middlebury/KITTI training crops rather than decouple performance from resolution.

    Authors: The Multi-Granularity Feature Network combines ViT global context with local geometric cues at multiple scales precisely to support arbitrary-resolution inputs. While results across Middlebury (high-resolution) and KITTI (lower-resolution) datasets provide supporting evidence of generalization, we acknowledge that dedicated resolution-shift tests would offer stronger verification. We will therefore add targeted experiments: training on standard crops and evaluating on upsampled/downsampled versions of the same scenes, plus explicit cross-benchmark resolution transfer tests, to demonstrate that performance is not merely an artifact of training-crop fitting. revision: yes

  3. Referee: Local-Global Cost Volume and Recurrent Unit sections: Interaction details between these modules and the feature network are insufficient to confirm they avoid introducing interpolation artifacts or forcing the cost volume to compensate for upstream resolution sensitivity, undermining the pipeline-level design claim.

    Authors: We will augment the manuscript with a detailed data-flow diagram and additional textual description clarifying the interfaces between the Multi-Granularity Feature Network, Local-Global Cost Volume, and Local-Global Guided Recurrent Unit. We will also include a short analysis (supported by qualitative visualizations) showing that the local-global cost construction operates on the already multi-scale features without introducing interpolation artifacts, and that the recurrent refinement leverages global guidance to improve rather than merely compensate for any upstream resolution effects. This will make the pipeline-level extension of global modeling explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal validated on external benchmarks

full rationale

The paper describes an empirical neural architecture (MLG-Stereo) with three proposed modules: Multi-Granularity Feature Network, Local-Global Cost Volume, and Local-Global Guided Recurrent Unit. Performance is asserted via direct comparison on standard external datasets (Middlebury, KITTI-2012, KITTI-2015). No equations, derivations, or predictions are present that reduce to fitted parameters or self-referential definitions by construction. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the abstract or described claims. The central assertions remain falsifiable through independent benchmark evaluation and do not collapse into tautological renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the unproven premise that the three proposed modules jointly solve ViT resolution sensitivity and local neglect without side effects; no free parameters are explicitly named beyond standard network training, and no new physical entities are invented.

axioms (1)
  • domain assumption ViTs inherently neglect local geometric information and are sensitive to resolution changes between train and inference.
    Stated in the opening of the abstract as the motivation for the work.
invented entities (3)
  • Multi-Granularity Feature Network no independent evidence
    purpose: Balance global context and local geometric information for arbitrary-resolution feature extraction.
    New module introduced to address ViT limitations; no independent evidence provided beyond claimed performance.
  • Local-Global Cost Volume no independent evidence
    purpose: Capture locally-correlated and global-aware matching information.
    New construction for the cost volume stage; no external validation cited.
  • Local-Global Guided Recurrent Unit no independent evidence
    purpose: Iteratively optimize disparity locally under global guidance.
    New recurrent refinement module; evidence is only the final benchmark numbers.

pith-pipeline@v0.9.0 · 5529 in / 1453 out tokens · 49265 ms · 2026-05-10T00:08:30.141666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Computer stereo vision for autonomous driving,

    R. Fan, L. Wang, M. J. Bocus, and I. Pitas, “Computer stereo vision for autonomous driving,”arXiv preprint arXiv:2012.03194, 2020

  2. [2]

    Re-parameterized real- time stereo matching network based on mixed cost volumes toward autonomous driving,

    B. Yao, W. Wei, J. Huang, B. Liang, and J. Li, “Re-parameterized real- time stereo matching network based on mixed cost volumes toward autonomous driving,”IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12, pp. 14 914–14 926, 2023

  3. [3]

    An application of stereo matching algorithm based on transfer learning on robots in multiple scenes,

    Y . Bi, C. Li, X. Tong, G. Wang, and H. Sun, “An application of stereo matching algorithm based on transfer learning on robots in multiple scenes,”Scientific Reports, vol. 13, no. 1, p. 12739, 2023

  4. [4]

    A new fast and robust stereo matching algorithm for robotic systems,

    M. Samadi and M. F. Othman, “A new fast and robust stereo matching algorithm for robotic systems,” inThe 9th International Conference on Computing and InformationTechnology (IC2IT2013) 9th-10th May 2013 King Mongkut’s University of Technology North Bangkok. Springer, 2013, pp. 281–290

  5. [5]

    Exploring stereovision-based 3-d scene reconstruction for augmented reality,

    G.-Y . Nie, Y . Liu, C. Wang, Y . Liu, and Y . Wang, “Exploring stereovision-based 3-d scene reconstruction for augmented reality,” in 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2019, pp. 1100–1101

  6. [6]

    Stereo matching in time: 100+ fps video stereo matching for extended reality,

    Z. Cheng, J. Yang, and H. Li, “Stereo matching in time: 100+ fps video stereo matching for extended reality,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 8719– 8728

  7. [7]

    End-to-end learning of geometry and context for deep stereo regression,

    A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inICCV, 2017, pp. 66–75

  8. [8]

    Pyramid stereo matching network,

    J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in CVPR, 2018, pp. 5410–5418

  9. [9]

    Group-wise correlation stereo network,

    X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correlation stereo network,” inCVPR, 2019, pp. 3273–3282

  10. [10]

    Raft-stereo: Multilevel recurrent field transforms for stereo matching,

    L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in3DV. IEEE, 2021, pp. 218–227

  11. [11]

    High- frequency stereo matching network,

    H. Zhao, H. Zhou, Y . Zhang, J. Chen, Y . Yang, and Y . Zhao, “High- frequency stereo matching network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1327–1336

  12. [12]

    Eai-stereo: Error aware iterative network for stereo matching,

    H. Zhao, H. Zhou, Y . Zhang, Y . Zhao, Y . Yang, and T. Ouyang, “Eai-stereo: Error aware iterative network for stereo matching,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 315–332

  13. [13]

    Iterative geometry encoding volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023, pp. 21 919– 21 928. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 Bad 2: 10.2 Bad 2: 4.93 Bad 2: 4.83 Bad 2: 4.81 Bad 2: 4.26 Bad 2: 9.38 Bad 2: 5.00 Bad...

  14. [14]

    Selective-stereo: Adaptive frequency information selection for stereo matching,

    X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive frequency information selection for stereo matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 701–19 710

  15. [15]

    All-in-one: Transferring vision foundation models into stereo matching,

    J. Zhou, H. Zhang, J. Yuan, P. Ye, T. Chen, H. Jiang, M. Chen, and Y . Zhang, “All-in-one: Transferring vision foundation models into stereo matching,”arXiv preprint arXiv:2412.09912, 2024

  16. [16]

    Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers,

    Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Tay- lor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 6197–6206

  17. [17]

    Context-enhanced stereo transformer,

    W. Guo, Z. Li, Y . Yang, Z. Wang, R. H. Taylor, M. Unberath, A. Yuille, and Y . Li, “Context-enhanced stereo transformer,” inEuropean Confer- ence on Computer Vision. Springer, 2022, pp. 263–279

  18. [18]

    Unifying flow, stereo and depth estimation,

    H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger, “Unifying flow, stereo and depth estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  19. [19]

    Playing to vision foundation model’s strengths in stereo matching,

    C.-W. Liu, Q. Chen, and R. Fan, “Playing to vision foundation model’s strengths in stereo matching,”IEEE Transactions on Intelligent Vehicles, 2024

  20. [20]

    Fast approximate energy min- imization via graph cuts,

    Y . Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min- imization via graph cuts,”IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 11, pp. 1222–1239, 2001. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  21. [21]

    Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure,

    A. Klaus, M. Sormann, and K. Karner, “Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure,” in 18th International Conference on Pattern Recognition (ICPR’06), vol. 3. IEEE, 2006, pp. 15–18

  22. [22]

    Stereo matching using belief propagation,

    J. Sun, N.-N. Zheng, and H.-Y . Shum, “Stereo matching using belief propagation,”IEEE Transactions on pattern analysis and machine intelligence, vol. 25, no. 7, pp. 787–800, 2003

  23. [23]

    Stereo match- ing with color-weighted correlation, hierarchical belief propagation, and occlusion handling,

    Q. Yang, L. Wang, R. Yang, H. Stew ´enius, and D. Nist´er, “Stereo match- ing with color-weighted correlation, hierarchical belief propagation, and occlusion handling,”IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 3, pp. 492–504, 2008

  24. [24]

    Real-time correlation- based stereo vision with reduced border errors,

    H. Hirschm ¨uller, P. R. Innocent, and J. Garibaldi, “Real-time correlation- based stereo vision with reduced border errors,”International Journal of Computer Vision, vol. 47, pp. 229–246, 2002

  25. [25]

    A hierarchical symmetric stereo algorithm using dynamic programming,

    G. Van Meerbergen, M. Vergauwen, M. Pollefeys, and L. Van Gool, “A hierarchical symmetric stereo algorithm using dynamic programming,” International Journal of Computer Vision, vol. 47, pp. 275–285, 2002

  26. [26]

    Accurate and efficient stereo processing by semi- global matching and mutual information,

    H. Hirschmuller, “Accurate and efficient stereo processing by semi- global matching and mutual information,” in2005 IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp. 807–814

  27. [27]

    Computing the stereo matching cost with a convolutional neural network,

    J. Zbontar and Y . LeCun, “Computing the stereo matching cost with a convolutional neural network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1592–1599

  28. [28]

    Pyramid stereo matching network,

    J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5410–5418

  29. [29]

    Raft: Recurrent all-pairs field transforms for optical flow,

    Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inECCV. Springer, 2020, pp. 402–419

  30. [30]

    Iterative geometry encoding volume for stereo matching,

    G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023, pp. 21 919– 21 928

  31. [31]

    Foundationstereo: Zero-shot stereo matching,

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5249– 5260

  32. [32]

    Defom-stereo: Depth foundation model based stereo matching,

    H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang, “Defom-stereo: Depth foundation model based stereo matching,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 857–21 867

  33. [33]

    Monster: Marry monodepth to stereo unleashes power,

    J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y . Deng, J. Zang, Y . Chen, Z. Cai, and X. Yang, “Monster: Marry monodepth to stereo unleashes power,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6273–6282

  34. [34]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

  35. [35]

    Swin transformer v2: Scaling up capacity and resolution,

    Z. Liu, H. Hu, Y . Lin, Z. Yao, Z. Xie, Y . Wei, J. Ning, Y . Cao, Z. Zhang, L. Dong, F. Wei, and B. Guo, “Swin transformer v2: Scaling up capacity and resolution,” inInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  36. [36]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the International Conference on Computer Vision (ICCV), 2021

  37. [37]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...

  38. [38]

    Swin-unet: Unet-like pure transformer for medical image segmenta- tion,

    H. Cao, Y . Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” inEuropean conference on computer vision. Springer, 2022, pp. 205–218

  39. [39]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inCVPR, 2024

  40. [40]

    Depth Anything V2

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

  41. [41]

    Depth anything: Unleash- ing the power of large-scale unlabeled data

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,”arXiv preprint arXiv:2401.10891, 2024

  42. [42]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,”arXiv preprint arXiv:2410.02073, 2024

  43. [43]

    Bowen Jing, Bonnie Berger, and Tommi Jaakkola

    A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhameret al., “Perceiver io: A general architecture for structured inputs & outputs,”arXiv preprint arXiv:2107.14795, 2021

  44. [44]

    Twins: Revisiting the design of spatial attention in vision trans- formers,

    X. Chu, Z. Tian, Y . Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision trans- formers,”Advances in neural information processing systems, vol. 34, pp. 9355–9366, 2021

  45. [45]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,

    N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4040– 4048

  46. [46]

    Virtual kitti 2,

    Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” 2020

  47. [47]

    Virtual worlds as proxy for multi-object tracking analysis,

    A. Gaidon, Q. Wang, Y . Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” inProceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 4340–4349

  48. [48]

    High-resolution stereo datasets with subpixel-accurate ground truth,

    D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Ne ˇsi´c, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGCPR. Springer, 2014, pp. 31– 42

  49. [49]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR. IEEE, 2012, pp. 3354–3361

  50. [50]

    Object scene flow for autonomous vehicles,

    M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inCVPR, 2015, pp. 3061–3070

  51. [51]

    Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,

    Y . Wang, K. Li, L. Wang, J. Hu, D. Oliver Wu, and Y . Guo, “Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,”IEEE Transactions on Image Processing, vol. 34, pp. 1204– 1218, 2025

  52. [52]

    Practical stereo matching via cascaded recurrent network with adaptive correlation,

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inCVPR, 2022, pp. 16 263–16 272

  53. [53]

    Consistency-aware self-training for iterative-based stereo matching,

    J. Zhou, P. Ye, H. Zhang, J. Yuan, R. Qiang, L. YangChenXu, W. Cailin, F. Xu, and T. Chen, “Consistency-aware self-training for iterative-based stereo matching,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 641–16 650

  54. [54]

    Igev++: Iterative multi-range geometry encoding volumes for stereo matching,

    G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang, “Igev++: Iterative multi-range geometry encoding volumes for stereo matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  55. [55]

    Cfnet: Cascade and fused cost volume for robust stereo matching,

    Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 906–13 915

  56. [56]

    Region separable stereo matching,

    J. Cheng, X. Yang, Y . Pu, and P. Guo, “Region separable stereo matching,”IEEE Transactions on Multimedia, vol. 25, pp. 4880–4893, 2022

  57. [57]

    Unambiguous pyramid cost volumes fusion for stereo matching,

    Q. Chen, B. Ge, and J. Quan, “Unambiguous pyramid cost volumes fusion for stereo matching,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 10, pp. 9223–9236, 2023

  58. [58]

    Deep stereo net- work with mrf-based cost aggregation,

    K. Zeng, H. Zhang, W. Wang, Y . Wang, and J. Mao, “Deep stereo net- work with mrf-based cost aggregation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2426–2438, 2023

  59. [59]

    Region-aware driven distribution optimization for stereo matching,

    L. Zhu, E. Rigall, Y . Gao, Z. Zhang, Y . Bai, and J. Dong, “Region-aware driven distribution optimization for stereo matching,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  60. [60]

    Tartanair: A dataset to push the limits of visual slam,

    W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916

  61. [61]

    Falling things: A synthetic dataset for 3d object detection and pose estimation,

    J. Tremblay, T. To, and S. Birchfield, “Falling things: A synthetic dataset for 3d object detection and pose estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2038–2041

  62. [62]

    Instereo2k: a large real dataset for stereo matching in indoor scenes,

    W. Bao, W. Wang, Y . Xu, Y . Guo, S. Hong, and X. Zhang, “Instereo2k: a large real dataset for stereo matching in indoor scenes,”Science China Information Sciences, vol. 63, pp. 1–11, 2020

  63. [63]

    Hierarchical deep stereo matching on high-resolution images,

    G. Yang, J. Manela, M. Happold, and D. Ramanan, “Hierarchical deep stereo matching on high-resolution images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5515–5524

  64. [64]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14 Haoyu Zhangreceived the B.S. degree in electronic science and technology from the School of Infor- mation Science and ...