Recognition: no theorem link
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
Pith reviewed 2026-05-10 17:48 UTC · model grok-4.3
The pith
Surface normals provide domain-invariant geometric cues that improve zero-shot generalization in stereo matching from synthetic to real data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that augmenting stereo matching networks with surface normals as domain-invariant, object-intrinsic geometric cues, fused through a gated contextual-geometric module that filters misleading image textures, plus specular-transparent augmentation and sparse spatial-dual-matching attentions, enables models trained solely on synthetic data such as SceneFlow to achieve lower error rates on real datasets while running faster and supporting high-resolution inference.
What carries the argument
The Gated Contextual-Geometric Fusion module that adaptively suppresses unreliable contextual cues from image features and fuses the remainder with normal-driven geometric features to build domain-invariant representations.
If this is right
- Reduces disparity errors by 30% on ETH3D compared to prior methods.
- Achieves 8.5% lower errors on the non-Lambertian Booster dataset.
- Improves results by 14.1% on KITTI-2015 relative to comparable baselines.
- Runs 19.2% faster than the preceding GREAT-IGEV model.
- Supports 3K-resolution inference with disparity ranges up to 768.
Where Pith is reading between the lines
- The same normal-based reinforcement could be tested in related cross-domain tasks such as optical flow or monocular depth estimation where texture cues also shift.
- End-to-end joint training of normal estimation with the stereo network might remove the need for separate normal inputs at inference time.
- The sparse attention patterns may transfer to other dense prediction problems that require both global context and low compute.
Load-bearing premise
Surface normals can be obtained or estimated reliably enough in real scenes to serve as consistent, domain-invariant cues without introducing new errors.
What would settle it
A controlled test on the Booster or ETH3D dataset where the full model with normal inputs produces higher disparity errors than an otherwise identical image-only baseline in non-Lambertian or occluded regions.
Figures
read the original abstract
Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GREATEN, a stereo matching architecture that augments image-based features with surface normals as domain-invariant geometric cues to address Syn-to-Real generalization gaps. The framework comprises a Gated Contextual-Geometric Fusion (GCGF) module for adaptive fusion, a Specular-Transparent Augmentation (STA) strategy for non-Lambertian robustness, and three sparse attention variants (SSA, SDMA, SVA) for efficiency. Trained solely on synthetic SceneFlow data, GREATEN-IGEV reports error reductions of 30% on ETH3D, 8.5% on Booster, and 14.1% on KITTI-2015 relative to recent baselines, while also claiming 19.2% faster inference and support for 3K resolution.
Significance. If the empirical gains are reproducible and attributable to the geometric cues rather than implementation details, the work would provide a practical route to stronger zero-shot transfer in stereo without real-world fine-tuning. The emphasis on efficiency via sparse attention and the explicit handling of non-Lambertian regions via STA are concrete strengths that could influence downstream applications in robotics and 3D reconstruction.
major comments (3)
- [Experimental results] Experimental section (results tables and text): the reported percentage reductions (30% on ETH3D, 8.5% on Booster, 14.1% on KITTI) are given as single-point comparisons without error bars, standard deviations across runs, or statistical significance tests. This weakens the central claim that the normal-augmented model reliably outperforms the cited baselines.
- [GCGF module and experimental setup] Method description of GCGF and data pipeline: the framework treats surface normals as given inputs for both training and real test images, yet provides no description of how normals are computed or estimated on real benchmarks (ETH3D, Booster, KITTI). Because the weakest link in the Syn-to-Real argument is the reliability of these cues under real illumination and sensor noise, this omission is load-bearing for the generalization claim.
- [Ablation experiments] Ablation study (if present) or supplementary material: without component-wise ablations isolating the contribution of GCGF versus STA versus the sparse attentions, it is impossible to determine whether the observed gains stem from the geometric fusion or from other architectural changes relative to the GREAT-IGEV baseline.
minor comments (2)
- [Abstract] The acronym GREATEN-IGEV is introduced without an explicit expansion or reference to the underlying IGEV backbone in the abstract; a brief parenthetical clarification would improve readability.
- [Figure 2 or equivalent] Figure captions for the architecture diagram should explicitly label the three sparse attention blocks (SSA, SDMA, SVA) and the gating thresholds to match the textual description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. These observations highlight important aspects for strengthening the presentation of our results and methods. We address each major comment point by point below and will incorporate the necessary revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: Experimental section (results tables and text): the reported percentage reductions (30% on ETH3D, 8.5% on Booster, 14.1% on KITTI) are given as single-point comparisons without error bars, standard deviations across runs, or statistical significance tests. This weakens the central claim that the normal-augmented model reliably outperforms the cited baselines.
Authors: We agree that single-point comparisons limit the strength of the claims. In the revised manuscript, we will conduct additional training runs with varied random seeds to report mean performance and standard deviations for the key metrics. We will also include statistical significance testing (such as paired t-tests) against the baselines to substantiate the reliability of the reported improvements. revision: yes
-
Referee: Method description of GCGF and data pipeline: the framework treats surface normals as given inputs for both training and real test images, yet provides no description of how normals are computed or estimated on real benchmarks (ETH3D, Booster, KITTI). Because the weakest link in the Syn-to-Real argument is the reliability of these cues under real illumination and sensor noise, this omission is load-bearing for the generalization claim.
Authors: This is a valid observation regarding a missing detail in the experimental setup. Although the manuscript describes normals as inputs, we will expand the data pipeline section in the revision to explicitly describe the normal estimation method applied to each real-world benchmark, including the specific pre-trained estimator, any adaptation steps, and preprocessing. We will also add a brief discussion of the expected robustness of these estimates to real-world variations in illumination and noise. revision: yes
-
Referee: Ablation study (if present) or supplementary material: without component-wise ablations isolating the contribution of GCGF versus STA versus the sparse attentions, it is impossible to determine whether the observed gains stem from the geometric fusion or from other architectural changes relative to the GREAT-IGEV baseline.
Authors: We recognize that isolating the contribution of each proposed component is essential for attributing the performance gains. The current manuscript provides baseline comparisons but lacks exhaustive component ablations. In the revised version, we will include detailed ablation experiments (in the main text or supplementary material) that evaluate the model with and without GCGF, STA, and each sparse attention variant individually, thereby clarifying the source of the Syn-to-Real improvements. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical architecture (GCGF module, STA augmentation, sparse attentions SSA/SDMA/SVA) that fuses surface normals as additional geometric input with image features, trained exclusively on synthetic SceneFlow data and evaluated via direct error reductions on external real-world benchmarks (ETH3D, Booster, KITTI-2015). No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction or uniqueness result back to the same fitted quantities or self-citations by construction. References to prior GREAT-IGEV/GREAT-Stereo work are limited to runtime and capability comparisons rather than load-bearing justifications for the central Syn-to-Real gains. The performance numbers are reported as measured outcomes against independent baselines, rendering the derivation chain self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- Gating thresholds and attention sparsity ratios
axioms (1)
- domain assumption Surface normals are domain-invariant, object-intrinsic, and more discriminative than image textures for cross-domain stereo matching
Reference graph
Works this paper leans on
-
[1]
Accurate and efficient stereo processing by semi- global matching and mutual information,
H. Hirschmuller, “Accurate and efficient stereo processing by semi- global matching and mutual information,” in2005 IEEE computer soci- ety conference on computer vision and pattern recognition (CVPR’05), vol. 2. IEEE, 2005, pp. 807–814
2005
-
[2]
A multi-view stereo benchmark with high- resolution images and multi-camera videos,
T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, “A multi-view stereo benchmark with high- resolution images and multi-camera videos,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3260– 3269
2017
-
[3]
High-resolution stereo datasets with subpixel-accurate ground truth,
D. Scharstein, H. Hirschm ¨uller, Y . Kitajima, G. Krathwohl, N. Ne ˇsi´c, X. Wang, and P. Westling, “High-resolution stereo datasets with subpixel-accurate ground truth,” inGerman conference on pattern recog- nition. Springer, 2014, pp. 31–42
2014
-
[4]
Are we ready for autonomous driving? the kitti vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3354–3361
2012
-
[5]
Open challenges in deep stereo: the booster dataset,
P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Open challenges in deep stereo: the booster dataset,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 168–21 178
2022
-
[6]
Depth anything v2,
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024
2024
-
[7]
Selective-stereo: Adaptive frequency information selection for stereo matching,
X. Wang, G. Xu, H. Jia, and X. Yang, “Selective-stereo: Adaptive frequency information selection for stereo matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 701–19 710
2024
-
[8]
Iterative geometry encoding volume for stereo matching,
G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2023, pp. 21 919– 21 928
2023
-
[9]
Object scene flow for autonomous vehicles,
M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3061–3070
2015
-
[10]
Monster: Marry monodepth to stereo unleashes power,
J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y . Deng, J. Zang, Y . Chen, Z. Cai, and X. Yang, “Monster: Marry monodepth to stereo unleashes power,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6273–6282
2025
-
[11]
End-to-end learning of geometry and context for deep stereo regression,
A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 66–75
2017
-
[12]
Pyramid stereo matching network,
J.-R. Chang and Y .-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5410–5418
2018
-
[13]
Group-wise corre- lation stereo network,
X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise corre- lation stereo network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3273–3282
2019
-
[14]
Attention concatenation volume for accurate and efficient stereo matching,
G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 981–12 990
2022
-
[15]
Accurate and efficient stereo matching via attention concatenation volume,
G. Xu, Y . Wang, J. Cheng, J. Tang, and X. Yang, “Accurate and efficient stereo matching via attention concatenation volume,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2461– 2474, 2023
2023
-
[16]
Cost volume aggregation in stereo matching revisited: A disparity classifica- tion perspective,
Y . Wang, L. Wang, K. Li, Y . Zhang, D. O. Wu, and Y . Guo, “Cost volume aggregation in stereo matching revisited: A disparity classifica- tion perspective,”IEEE Transactions on Image Processing, vol. 33, pp. 6425–6438, 2024
2024
-
[17]
Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,
Y . Wang, K. Li, L. Wang, J. Hu, D. O. Wu, and Y . Guo, “Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment,”IEEE Transactions on Image Processing, 2025
2025
-
[18]
Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,
Y . Wang, J. Zheng, C. Zhang, Z. Zhang, K. Li, Y . Zhang, and J. Hu, “Dualnet: Robust self-supervised stereo matching with pseudo-label supervision,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 8, 2025, pp. 8178–8186
2025
-
[19]
Raft-stereo: Multilevel recurrent field transforms for stereo matching,
L. Lipson, Z. Teed, and J. Deng, “Raft-stereo: Multilevel recurrent field transforms for stereo matching,” in2021 International conference on 3D vision (3DV). IEEE, 2021, pp. 218–227
2021
-
[20]
Practical stereo matching via cascaded recurrent network with adaptive correlation,
J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 263–16 272
2022
-
[21]
Spnet: Learning stereo matching with slanted plane aggregation,
Y . Wang, L. Wang, H. Wang, and Y . Guo, “Spnet: Learning stereo matching with slanted plane aggregation,”IEEE Robotics and Automa- tion Letters, vol. 7, no. 3, pp. 6258–6265, 2022
2022
-
[22]
Parameterized cost volume for stereo matching,
J. Zeng, C. Yao, L. Yu, Y . Wu, and Y . Jia, “Parameterized cost volume for stereo matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 347–18 357
2023
-
[23]
Global regulation and excitation via attention tuning for stereo matching,
J. Li, X. Chen, Z. Jiang, Q. Zhou, Y .-H. Li, and J. Wang, “Global regulation and excitation via attention tuning for stereo matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 25 539–25 549
2025
-
[25]
Hierarchical deep stereo matching on high-resolution images,
G. Yang, J. Manela, M. Happold, and D. Ramanan, “Hierarchical deep stereo matching on high-resolution images,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5515–5524
2019
-
[26]
Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature,
B. Liu, H. Yu, and G. Qi, “Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 13 012–13 021
2022
-
[27]
Synthetic-to-real domain adaptation joint spatial feature transform for stereo matching,
X. Li, Y . Fan, Z. Rao, G. Lv, and S. Liu, “Synthetic-to-real domain adaptation joint spatial feature transform for stereo matching,”IEEE Signal Processing Letters, vol. 29, pp. 60–64, 2021
2021
-
[28]
Domain generalized stereo matching via hierarchical visual transformation,
T. Chang, X. Yang, T. Zhang, and M. Wang, “Domain generalized stereo matching via hierarchical visual transformation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9559–9568
2023
-
[29]
Masked representation learning for domain generalized stereo matching,
Z. Rao, B. Xiong, M. He, Y . Dai, R. He, Z. Shen, and X. Li, “Masked representation learning for domain generalized stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5435–5444
2023
-
[30]
Generative adversarial networks,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020
2020
-
[31]
Masked au- toencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
2022
-
[32]
Domain- invariant stereo matching networks,
F. Zhang, X. Qi, R. Yang, V . Prisacariu, B. Wah, and P. Torr, “Domain- invariant stereo matching networks,” inEuropean conference on com- puter vision. Springer, 2020, pp. 420–439
2020
-
[33]
Learning representa- tions from foundation models for domain generalized stereo matching,
Y . Zhang, L. Wang, K. Li, Y . Wang, and Y . Guo, “Learning representa- tions from foundation models for domain generalized stereo matching,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 146– 162
2024
-
[34]
Defom-stereo: Depth foundation model based stereo matching,
H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang, “Defom-stereo: Depth foundation model based stereo matching,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 857–21 867
2025
-
[35]
Foundationstereo: Zero-shot stereo matching,
B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield, “Foundationstereo: Zero-shot stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 5249–5260
2025
-
[36]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail,
L. Bartolomei, F. Tosi, M. Poggi, and S. Mattoccia, “Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 1013–1027
2025
-
[38]
Bridgedepth: Bridging monocular and stereo reasoning with latent alignment,
T. Guan, J. Guo, C. Wang, and Y .-H. Liu, “Bridgedepth: Bridging monocular and stereo reasoning with latent alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 27 681–27 691
2025
-
[39]
Region separable stereo matching,
J. Cheng, X. Yang, Y . Pu, and P. Guo, “Region separable stereo matching,”IEEE Transactions on Multimedia, vol. 25, pp. 4880–4893, 2022
2022
-
[40]
Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation,
J. Cheng, G. Xu, P. Guo, and X. Yang, “Coatrsnet: Fully exploiting convolution and attention for stereo matching by region separation,” 14 International Journal of Computer Vision, vol. 132, no. 1, pp. 56–73, 2024
2024
-
[41]
Pcw-net: Pyramid combination and warping cost volume for stereo matching,
Z. Shen, Y . Dai, X. Song, Z. Rao, D. Zhou, and L. Zhang, “Pcw-net: Pyramid combination and warping cost volume for stereo matching,” in European conference on computer vision. Springer, 2022, pp. 280–297
2022
-
[42]
Learning robust stereo matching in the wild with selective mixture-of-experts,
Y . Wang, L. Wang, C. Zhang, Y . Zhang, Z. Zhang, A. Ma, C. Fan, T. L. Lam, and J. Hu, “Learning robust stereo matching in the wild with selective mixture-of-experts,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 276–21 287
2025
-
[43]
Aanet: Adaptive aggregation network for efficient stereo matching,
H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1959–1968
2020
-
[44]
Ga-net: Guided aggregation net for end-to-end stereo matching,
F. Zhang, V . Prisacariu, R. Yang, and P. H. Torr, “Ga-net: Guided aggregation net for end-to-end stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 185–194
2019
-
[45]
A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,
N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4040– 4048
2016
-
[46]
Cascade cost volume for high-resolution multi-view stereo and stereo matching,
X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan, “Cascade cost volume for high-resolution multi-view stereo and stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2495–2504
2020
-
[47]
Cfnet: Cascade and fused cost volume for robust stereo matching,
Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 906–13 915
2021
-
[48]
Raft: Recurrent all-pairs field transforms for optical flow,
Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inEuropean conference on computer vision. Springer, 2020, pp. 402–419
2020
-
[49]
High- frequency stereo matching network,
H. Zhao, H. Zhou, Y . Zhang, J. Chen, Y . Yang, and Y . Zhao, “High- frequency stereo matching network,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 1327– 1336
2023
-
[50]
Igev++: Iterative multi-range geometry encoding volumes for stereo matching,
G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang, “Igev++: Iterative multi-range geometry encoding volumes for stereo matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[51]
Self-supervised learning for stereo matching with self-improving ability,
Y . Zhong, Y . Dai, and H. Li, “Self-supervised learning for stereo matching with self-improving ability,”arXiv preprint arXiv:1709.00930, 2017
-
[52]
Activestere- onet: End-to-end self-supervised learning for active stereo systems,
Y . Zhang, S. Khamis, C. Rhemann, J. Valentin, A. Kowdle, V . Tankovich, M. Schoenberg, S. Izadi, T. Funkhouser, and S. Fanello, “Activestere- onet: End-to-end self-supervised learning for active stereo systems,” in Proceedings of the european conference on computer vision (ECCV), 2018, pp. 784–801
2018
-
[53]
Pvstereo: Pyramid voting module for end-to-end self-supervised stereo matching,
H. Wang, R. Fan, P. Cai, and M. Liu, “Pvstereo: Pyramid voting module for end-to-end self-supervised stereo matching,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4353–4360, 2021
2021
-
[54]
Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction,
Z. Rao, M. He, Y . Dai, and Z. Shen, “Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction,”The Visual Computer, vol. 38, no. 1, pp. 77–93, 2022
2022
-
[55]
Uncertainty guided adaptive warping for robust and efficient stereo matching,
J. Jing, J. Li, P. Xiong, J. Liu, S. Liu, Y . Guo, X. Deng, M. Xu, L. Jiang, and L. Sigal, “Uncertainty guided adaptive warping for robust and efficient stereo matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3318–3327
2023
-
[56]
Adaptively identify and refine ill-posed regions for accurate stereo matching,
C. Liu, L. Sun, X. Ning, J. Xu, L. Yu, K. Zhang, and W. Li, “Adaptively identify and refine ill-posed regions for accurate stereo matching,” Neural Networks, vol. 178, p. 106394, 2024
2024
-
[57]
Global occlusion-aware transformer for robust stereo matching,
Z. Liu, Y . Li, and M. Okutomi, “Global occlusion-aware transformer for robust stereo matching,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 3535–3544
2024
-
[58]
Uncertainty estimation for stereo matching based on evidential deep learning,
C. Wang, X. Wang, J. Zhang, L. Zhang, X. Bai, X. Ning, J. Zhou, and E. Hancock, “Uncertainty estimation for stereo matching based on evidential deep learning,”pattern recognition, vol. 124, p. 108498, 2022
2022
-
[59]
Learning to esti- mate hidden motions with global motion aggregation,
S. Jiang, D. Campbell, Y . Lu, H. Li, and R. Hartley, “Learning to esti- mate hidden motions with global motion aggregation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9772–9781
2021
-
[60]
Mobilenetv2: Inverted residuals and linear bottlenecks,
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
2018
-
[61]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,”arXiv preprint arXiv:2010.04159, 2020
work page internal anchor Pith review arXiv 2010
-
[62]
Vision transformer with deformable attention,
Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision transformer with deformable attention,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4794–4803
2022
-
[63]
Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 3, pp. 2020–2036, 2024
2020
-
[64]
Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,
Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 376– 393
2024
-
[65]
Gaussianformer-2: Probabilistic gaussian superposition for effi- cient 3d occupancy prediction,
Y . Huang, A. Thammatadatrakoon, W. Zheng, Y . Zhang, D. Du, and J. Lu, “Gaussianformer-2: Probabilistic gaussian superposition for effi- cient 3d occupancy prediction,” inProceedings of the computer vision and pattern recognition conference, 2025, pp. 27 477–27 486
2025
-
[66]
Attention is all you need,
V . Ashish, “Attention is all you need,”Advances in neural information processing systems, vol. 30, p. I, 2017
2017
-
[67]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[68]
Rethinking skip connection with layer normalization in transformers and resnets,
F. Liu, X. Ren, Z. Zhang, X. Sun, and Y . Zou, “Rethinking skip connection with layer normalization in transformers and resnets,”arXiv preprint arXiv:2105.07205, 2021
-
[69]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,
M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 912–10 922
2021
-
[70]
Diving into the fusion of monocular priors for generalized stereo matching,
C. Yao, L. Yu, Z. Liu, J. Zeng, Y . Wu, and Y . Jia, “Diving into the fusion of monocular priors for generalized stereo matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14 887–14 897
2025
-
[71]
Stereogen: High-quality stereo image generation from a single image.arXiv e- prints, pp
X. Wang, H. Yang, H. Wang, J. Cheng, G. Xu, M. Lin, and X. Yang, “Promptstereo: Zero-shot stereo matching via structure and motion prompts,”arXiv preprint arXiv:2603.01650, 2026
-
[72]
Monster++: Unified stereo matching, multi- view stereo, and real-time stereo with monodepth priors,
J. Cheng, W. Liao, Z. Cai, L. Liu, G. Xu, X. Wang, Y . Wang, Z. Yuan, Y . Deng, J. Zanget al., “Monster++: Unified stereo matching, multi- view stereo, and real-time stereo with monodepth priors,”arXiv preprint arXiv:2501.08643, 2025
-
[73]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[74]
Super-convergence: Very fast training of neural networks using large learning rates,
L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” inArtificial intelligence and machine learning for multi-domain operations applications, vol. 11006. SPIE, 2019, pp. 369–386
2019
-
[75]
Falling things: A synthetic dataset for 3d object detection and pose estimation,
J. Tremblay, T. To, and S. Birchfield, “Falling things: A synthetic dataset for 3d object detection and pose estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 2038–2041
2018
-
[76]
Tartanair: A dataset to push the limits of visual slam,
W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916
2020
-
[77]
A naturalistic open source movie for optical flow evaluation,
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inEuropean conference on computer vision. Springer, 2012, pp. 611–625
2012
-
[78]
Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020
work page internal anchor Pith review arXiv 2001
-
[79]
Instereo2k: a large real dataset for stereo matching in indoor scenes,
W. Bao, W. Wang, Y . Xu, Y . Guo, S. Hong, and X. Zhang, “Instereo2k: a large real dataset for stereo matching in indoor scenes,”Science China Information Sciences, vol. 63, no. 11, p. 212101, 2020
2020
-
[80]
Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation,
E. Ilg, T. Saikia, M. Keuper, and T. Brox, “Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 614–630
2018
-
[81]
Learning for disparity estimation through feature constancy,
Z. Liang, Y . Feng, Y . Guo, H. Liu, W. Chen, L. Qiao, L. Zhou, and J. Zhang, “Learning for disparity estimation through feature constancy,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2811–2820
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.