LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching

Oliver Turner; Viorela Ila; Yiran Wang

arxiv: 2606.25437 · v1 · pith:MQBENCO7new · submitted 2026-06-24 · 💻 cs.CV

LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching

Yiran Wang , Oliver Turner , Viorela Ila This is my paper

Pith reviewed 2026-06-25 21:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo matchinglinear attentionvision foundation modelsdisparity estimationunderwater visioniterative refinementglobal aggregationcross-domain generalization

0 comments

The pith

LinStereo replaces local recurrence in stereo matching with a linear-cost global attention module that spreads reliable disparities from clear to degraded regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard VFM-based iterative stereo pipelines collapse multi-scale backbone features into single-level correlations, leave geometric priors unused at initialization, and restrict context propagation to local operations. These limitations become acute under degraded photometric conditions such as underwater scenes. LinStereo introduces a Position-Aware Linear Attention module that performs global aggregation at linear cost, supported by Hierarchical Semantic Cost Volumes drawn from the VFM feature hierarchy and a Depth Prior Initialization that supplies a metrically calibrated starting point from monocular depth. The design lets reliable disparity estimates propagate from well-matched areas into problematic ones while preserving disparity structure. Reported results show state-of-the-art-level accuracy on conventional benchmarks together with large gains on two underwater datasets.

Core claim

The paper establishes that the Position-Aware Linear Attention module, when combined with scale-aligned Hierarchical Semantic Cost Volumes and Depth Prior Initialization inside an iterative stereo pipeline built on Depth Anything V3, achieves global context propagation at linear complexity, yielding the best overall accuracy on standard benchmarks and consistent error reductions of 28 percent AbsRel on TartanAir-UW and 26 percent on SQUID.

What carries the argument

Position-Aware Linear Attention (PALA) module, which replaces local recurrence with global aggregation of disparity information at linear computational cost while preserving structure.

If this is right

Reliable disparity estimates can be propagated across the entire image without quadratic global attention cost.
Multi-scale VFM features can be used directly as scale-aligned correlations rather than collapsed to a single level.
Monocular depth estimates can serve as an effective metrically calibrated initialization for stereo refinement.
Performance gains hold across both standard indoor/outdoor benchmarks and severely degraded underwater domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear global aggregation pattern could be tested on related dense correspondence tasks such as optical flow where long-range propagation is also needed.
If the propagation mechanism works, future stereo pipelines may be able to reduce the number of local refinement iterations while maintaining accuracy.
The approach highlights that hierarchical cost volumes can bridge the gap between monocular priors and stereo geometry without requiring additional network branches.

Load-bearing premise

The underlying VFM features and hierarchical cost volumes supply enough reliable signal for the attention module to propagate accurate disparity estimates from well-matched regions into areas with degraded photometric cues.

What would settle it

Apply the pipeline to a controlled test set in which large contiguous regions have their initial matches deliberately corrupted or removed and measure whether final disparity accuracy still exceeds local-recurrence baselines.

Figures

Figures reproduced from arXiv: 2606.25437 by Oliver Turner, Viorela Ila, Yiran Wang.

**Figure 1.** Figure 1: Detailed Architecture of LinStereo: Our model replaces the ConvGRU update operator with the Position-Aware Linear Attention (PALA) updater, which enables global spatial reasoning over cost volume features at linear complexity by attention. The attention path applies kernel-activated queries and keys with 2D rotary positional encoding, while a parallel context modulation branch provides adaptive gating fo… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on standard and underwater stereo benchmarks. progressive build-up of the PALA block from a position-agnostic linear attention baseline (A). Each addition yields consistent gains on both benchmarks, with global spatial encoding (A→B) providing the largest single improvement, confirming that restoring positional structure is the most critical enhancement to vanilla linear attention. … view at source ↗

**Figure 3.** Figure 3: Synthesis of the SeaStereo-Dataset: The Blender Python API automated the rendering pipeline by iterating through each configuration of features (camera path, camera model, water type, depth and ShapeNet object subset). Each configuration generated stereo image pairs (IL, IR) and disparity maps (DL, DR). and placed within the scene. This process was repeated for three iterations per configuration, with diff… view at source ↗

**Figure 4.** Figure 4: SeaStereo-Dataset Examples: Left camera renders from SeaStereo-Dataset, illustrating variation in seafloor depth and water type. These examples demonstrate the controlled changes in underwater visibility and environmental configuration used during dataset generation. Left Right Disparity Left Right Disparity Left Right Disparity [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Additional SeaStereo-Dataset samples. Each row shows three sets of stereo RGB images (left, right) and their corresponding disparity annotations under different Jerlov water types, seafloor depths, and cameras. The dataset captures diverse underwater conditions ranging from clear to highly turbid water. than simplifying the architecture, we reduce the number of refinement iterations required for convergenc… view at source ↗

**Figure 6.** Figure 6: Additional qualitative comparison on Booster (Q). [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative comparison on ETH3D. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative comparison on KITTI 2012. [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative comparison on KITTI 2015. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparison on TartanAir-UW (1/2). [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparison on TartanAir-UW (2/2). [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on SQUID. [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results on the laboratory tank dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on SeaStereo-Dataset (1/2). [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison on SeaStereo-Dataset (2/2). [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

read the original abstract

Existing Vision Foundation Model (VFM)-based iterative stereo pipelines under-exploit three information pathways: multi-scale backbone features are collapsed into single-level correlations, geometric priors remain untapped at initialization, and context propagates only locally. These gaps widen under degraded photometric cues, making underwater scenes a stringent generalization test. To address this, we propose LinStereo, built upon Depth Anything V3, whose core is a Position-Aware Linear Attention (PALA) module that replaces local recurrence with global aggregation at linear cost, propagating reliable estimates from well-matched regions into degraded areas while preserving disparity structure. PALA is made effective by two enabling components: Hierarchical Semantic Cost Volumes (HSCV), which supply scale-aligned correlations from the VFM feature hierarchy, and a Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. LinStereo achieves state-of-the-art-level accuracy on standard benchmarks and strong cross-domain generalization, particularly on underwater scene where severe photometric degradation makes stereo matching particularly challenging, attaining the best overall accuracy with consistent gains 28% lower AbsRel on TartanAir-UW, 26% on SQUID, a real-world underwater dataset).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LinStereo swaps local recurrence for linear global attention in a VFM stereo pipeline and reports solid gains on underwater data, but the experimental support is thin on baselines and ablations.

read the letter

The core move is replacing the usual local recurrence in iterative stereo with a Position-Aware Linear Attention module that does global aggregation at linear cost. They pair it with hierarchical semantic cost volumes from the VFM feature pyramid and a monocular depth prior for initialization. That combination is the actual new piece.

It targets a genuine pain point: photometric degradation in underwater scenes where standard local methods lose signal. The reported drops in AbsRel on TartanAir-UW and SQUID are the strongest part of the abstract, and the architecture description is coherent.

The soft spots are the usual ones for an abstract-only view. No detail on which exact baselines were used, how large the gains are after proper ablations of each component, or whether the linear attention actually preserves fine disparity structure rather than just averaging. The claim that it propagates reliable estimates into degraded regions rests on the VFM features and HSCV supplying enough signal, but that needs the full experiments and visualizations to check.

This is for groups already running VFM-based stereo or working on cross-domain 3D vision. The efficiency claim and the domain results are worth a referee's time even if the paper needs tighter controls on the numbers.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes LinStereo, an iterative stereo matching method built on Depth Anything V3. Its core contributions are the Position-Aware Linear Attention (PALA) module for global context aggregation at linear cost, Hierarchical Semantic Cost Volumes (HSCV) to supply scale-aligned correlations from the VFM feature hierarchy, and Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. The central empirical claim is state-of-the-art-level accuracy on standard benchmarks together with strong cross-domain generalization on underwater scenes, specifically 28% lower AbsRel on TartanAir-UW and 26% on the real-world SQUID dataset.

Significance. If the reported accuracy gains and cross-domain improvements hold under rigorous evaluation, the work would show that linear-complexity global attention combined with semantic cost volumes and geometric priors can improve propagation of disparity estimates into photometrically degraded regions while remaining computationally tractable. This would be a meaningful advance for stereo matching in challenging environments such as underwater vision.

major comments (1)

Abstract: the performance numbers (28% lower AbsRel on TartanAir-UW, 26% on SQUID) and the claim of 'state-of-the-art-level accuracy' are presented without any description of experimental protocol, baselines, metrics, or error analysis. Because these numbers constitute the central empirical claim, the absence of supporting experimental details prevents evaluation of whether the reported gains are load-bearing or reproducible.

minor comments (1)

Abstract, final sentence: 'underwater scene where' should read 'underwater scenes where'; the closing parenthesis after 'dataset' is misplaced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract regarding our central empirical claims. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the performance numbers (28% lower AbsRel on TartanAir-UW, 26% on SQUID) and the claim of 'state-of-the-art-level accuracy' are presented without any description of experimental protocol, baselines, metrics, or error analysis. Because these numbers constitute the central empirical claim, the absence of supporting experimental details prevents evaluation of whether the reported gains are load-bearing or reproducible.

Authors: We agree that the abstract would benefit from additional context to support the reported performance numbers and SOTA-level claim. In the revised version, we will update the abstract to briefly specify the evaluation benchmarks (TartanAir-UW and SQUID), the primary metric (AbsRel), and a reference to the full experimental protocol, baselines, and analysis detailed in Section 4. This change will improve evaluability while preserving conciseness. The full details, including comparisons and error breakdowns, remain in the Experiments section as before. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical architecture (PALA for global linear attention, HSCV for hierarchical correlations, DPI for monocular initialization) built on an existing VFM backbone and reports benchmark accuracy numbers as experimental outcomes. No derivation chain, equations, or first-principles claims are present that reduce by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. The central claims rest on measured performance rather than identities internal to the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about the reliability of Depth Anything V3 features and the effectiveness of the newly introduced modules; no explicit free parameters are named in the abstract, and the modules themselves constitute new entities without external validation.

axioms (2)

domain assumption Depth Anything V3 provides reliable multi-scale features suitable for building hierarchical semantic cost volumes
The pipeline is built directly upon this VFM.
domain assumption Monocular depth estimates can be converted into a metrically calibrated warm-start disparity map
This underpins the DPI component.

invented entities (3)

Position-Aware Linear Attention (PALA) no independent evidence
purpose: Replace local recurrence with global aggregation at linear cost
Core new module introduced to address context propagation
Hierarchical Semantic Cost Volumes (HSCV) no independent evidence
purpose: Supply scale-aligned correlations from VFM feature hierarchy
New cost-volume construction method
Depth Prior Initialization (DPI) no independent evidence
purpose: Convert monocular depth into metrically calibrated stereo initialization
New initialization strategy

pith-pipeline@v0.9.1-grok · 5741 in / 1341 out tokens · 36437 ms · 2026-06-25T21:03:43.752207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 38 canonical work pages · 4 internal anchors

[1]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Akkaynak, D., Treibitz, T.: A revised underwater image formation model. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–

2018
[2]

IEEE (2018).https://doi.org/10.1109/CVPR.2018.00703

work page doi:10.1109/cvpr.2018.00703 2018
[3]

In: IEEE/RJS International Conference on Intelligent RObots and Systems

Bangunharcana, A., Cho, J.W., Lee, S., Kweon, I.S., Kim, K.S., Kim, S.: Correlate- and-excite: Real-time stereo matching via guided cost volume excitation. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 3542–
[4]

IEEE, IEEE (2021).https://doi.org/10.1109/IROS51168.2021.9635909

work page doi:10.1109/iros51168.2021.9635909 2021
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

Berman, D., Levy, D., Avidan, S., Treibitz, T.: Underwater single image color restoration using haze-lines and a new quantitative dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

work page doi:10.1109/tpami.2020.2977624 2018
[7]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: arXiv.org (2024).https://doi.org/10.48550/arXiv.2410.02073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.02073 2024
[8]

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)

Pith/arXiv arXiv 2015
[9]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5410–5418. IEEE (2018).https://doi.org/10.1109/CVPR.2018.00567

work page doi:10.1109/cvpr.2018.00567 2018
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cheng, J., Liu, L., Xu, G., Wang, X., Zhang, Z., Deng, Y., Zang, J., Chen, Y., Cai, Z., Yang, X.: Monster: Marry monodepth to stereo unleashes power. In: Computer Vision and Pattern Recognition. pp. 6273–6282. IEEE (2025).https://doi.org/ 10.1109/CVPR52734.2025.00588

work page doi:10.1109/cvpr52734.2025.00588 2025
[12]

In: IEEE International Con- ference on Computer Vision

Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: Deeppruner: Learning effi- cient stereo matching via differentiable patchmatch. In: IEEE International Con- ference on Computer Vision. pp. 4383–4392. IEEE (2019).https://doi.org/10. 1109/ICCV.2019.00448

arXiv 2019
[13]

In: 2012 IEEE Conference on Computer Vision and Pattern Recognition

Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3354–3361. IEEE (2012).https://doi.org/10. 1109/CVPR.2012.6248074

arXiv 2012
[14]

In: Image Processing On Line

Gómez, Á.: GA-Net: Guided aggregation net for end-to-end stereo matching. In: Image Processing On Line. pp. 185–194 (2023).https://doi.org/10.5201/ipol. 2023.441

work page doi:10.5201/ipol 2023
[15]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high- resolution multi-view stereo and stereo matching. In: Computer Vision and Pattern Recognition. pp. 2495–2504 (2019).https://doi.org/10.1109/CVPR42600.2020. 00257 LinStereo: Linear-Complexity Global Attention for Stereo Matching 17

work page doi:10.1109/cvpr42600.2020 2019
[16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Guan, T., Guo, J., Wang, C., Liu, Y.: Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. In: IEEE International Conference on Computer Vision. pp. 27681–27691. IEEE (2025).https://doi.org/10.1109/ICCV51701. 2025.02570

work page doi:10.1109/iccv51701 2025
[17]

UniDepth: Universal Monocular Metric Depth Estimation

Guan, T., Wang, C., Liu, Y.H.: Neural markov random field for stereo matching. In: Computer Vision and Pattern Recognition. pp. 5459–5469. IEEE (June 2024). https://doi.org/10.1109/CVPR52733.2024.00522

work page doi:10.1109/cvpr52733.2024.00522 2024
[18]

In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

Guo, X., Zhang, C., Nie, D., Zheng, W., Zhang, Y., Chen, L.: Lightstereo: Chan- nel boost is all you need for efficient 2d cost aggregation. In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

arXiv 2024
[19]

In: Computer Vision and Pattern Recognition

Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo net- work. In: Computer Vision and Pattern Recognition. pp. 3268–3277. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00339

work page doi:10.1109/cvpr.2019.00339 2019
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

work page doi:10.1109/tpami.2024.3444912 2024
[21]

In: Computer Vision and Pattern Recognition

Jiang, H., Lou, Z., Ding, L., Xu, R., Tan, M., Jiang, W., Huang, R.: DEFOM- Stereo: Depth foundation model based stereo matching. In: Computer Vision and Pattern Recognition. pp. 21857–21867. IEEE (2025).https://doi.org/10.1109/ CVPR52734.2025.02036

arXiv 2025
[22]

In: 2017 IEEE International Conference on Computer Vision (ICCV)

Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 66–75. IEEE (2017).https://doi.org/10.1109/iccv.2017.17

work page doi:10.1109/iccv.2017.17 2017
[23]

In: European Conference on Computer Vision

Khamis, S., Fanello, S., Rhemann, C., Kowdle, A., Valentin, J., Izadi, S.: Stere- onet: Guided hierarchical refinement for real-time edge-aware depth prediction. In: European Conference on Computer Vision. pp. 596–613. Springer International Publishing (2018).https://doi.org/10.1007/978-3-030-01267-0_35

work page doi:10.1007/978-3-030-01267-0_35 2018
[24]

nouns": [...],

Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., Liu, J., Fan, H., Liu, S.: Practicalstereomatchingviacascadedrecurrentnetworkwithadaptivecorrelation. In: Computer Vision and Pattern Recognition. pp. 16242–16251. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01578

work page doi:10.1109/cvpr52688.2022.01578 2022
[25]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv.org (2025).https: //doi.org/10.48550/arXiv.2511.10647

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.10647 2025
[26]

IEEE International Conference on 3D Vision (3DV) , volume =

Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: International Conference on 3D Vision. pp. 218–227. IEEE, IEEE (2021).https://doi.org/10.1109/3DV53792.2021.00032

work page doi:10.1109/3dv53792.2021.00032 2021
[27]

IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

Lv, Q., Dong, J., Li, Y., Chen, S., Yu, H., Zhang, S., Wang, W.: Uwstereo: A large synthetic dataset for underwater stereo matching. IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

arXiv 2024
[28]

In: Computer Vision and Pattern Recognition

Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Computer Vision and Pattern Recognition. pp. 4040– 4048 (2015).https://doi.org/10.1109/CVPR.2016.438 18 Y. Wang et al

work page doi:10.1109/cvpr.2016.438 2015
[29]

In: Computer Vision and Pattern Recognition

Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Computer Vision and Pattern Recognition. pp. 3061–3070. IEEE (2015).https://doi.org/ 10.1109/CVPR.2015.7298925

work page doi:10.1109/cvpr.2015.7298925 2015
[30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. (2023).https:// doi.org/10.48550/arXiv.2304.07193

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2023
[31]

In: Computer Vision and Pattern Recog- nition

Ramirez,P.Z.,Tosi,F.,Poggi,M.,Salti,S.,Mattoccia,S.,Stefano,L.D.:Openchal- lenges in deep stereo: the booster dataset. In: Computer Vision and Pattern Recog- nition. pp. 21136–21146. IEEE (2022).https://doi.org/10.1109/CVPR52688. 2022.02049

work page doi:10.1109/cvpr52688 2022
[32]

In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Mon- treal, QC, Canada, October 10-17, 2021

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: IEEE International Conference on Computer Vision. pp. 12159–12168. IEEE (October 2021).https://doi.org/10.1109/ICCV48922.2021.01196

work page doi:10.1109/iccv48922.2021.01196 2021
[33]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

work page doi:10.1109/tpami.2020.3019967 2019
[34]

In: German Conference on Pattern Recognition

Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nesic, N., Wang, X., Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth. In: German Conference on Pattern Recognition. pp. 31–42. Springer International Publishing (2014).https://doi.org/10.1007/978-3-319-11752-2_3

work page doi:10.1007/978-3-319-11752-2_3 2014
[35]

IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-cameravideos.In:ComputerVisionandPatternRecognition.pp.2538–2547. IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

work page doi:10.1109/cvpr.2017.272 2017
[36]

In: IEEE Workshop/Winter Conference on Applications of Computer Vision

Shamsafar,F.,Woerz,S.,Rahim,R.,Zell,A.:Mobilestereonet:Towardslightweight deep networks for stereo matching. In: IEEE Workshop/Winter Conference on Applications of Computer Vision. pp. 2417–2426 (2021).https://doi.org/10. 1109/WACV51458.2022.00075

arXiv 2021
[37]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXX

Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., Zhang, L.: Pcw-net: Pyramid combi- nation and warping cost volume for stereo matching. In: European Conference on Computer Vision. pp. 280–297. Springer (2020).https://doi.org/10.1007/978- 3-031-19824-3_17

work page doi:10.1007/978- 2020
[38]

2024 , issue_date =

Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2021).https://doi. org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2021
[39]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Tankovich, V., Häne, C., Fanello, S., Zhang, Y., Izadi, S., Bouaziz, S.: Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In: Computer Vision and Pattern Recognition. pp. 14362–14372 (2020).https://doi. org/10.1109/CVPR46437.2021.01413

work page doi:10.1109/cvpr46437.2021.01413 2020
[40]

The Newer College Dataset: Handheld LiDAR, inertial and vision with ground truth,

Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 4909–4916. IEEE (2020).https://doi.org/10.1109/IROS45743.2020.9341801

work page doi:10.1109/iros45743.2020.9341801 2020
[41]

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

Wang, X., Xu, G., Jia, H., Yang, X.: Selective-stereo: Adaptive frequency infor- mation selection for stereo matching. In: Computer Vision and Pattern Recogni- tion. pp. 19701–19710. IEEE (June 2024).https://doi.org/10.1109/CVPR52733. 2024.01863 LinStereo: Linear-Complexity Global Attention for Stereo Matching 19

work page doi:10.1109/cvpr52733 2024
[42]

Wang, Y., Li, K., Wang, L., Hu, J., Wu, D.O., Guo, Y.: Adstereo: Efficient stereo matchingwithadaptivedownsamplinganddisparityalignment.IEEETransactions on Image Processing34, 1204–1218 (2025).https://doi.org/10.1109/TIP.2025. 3540282

work page doi:10.1109/tip.2025 2025
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., Birchfield, S.: Foundation- Stereo: Zero-shot stereo matching. In: Computer Vision and Pattern Recognition. pp. 5249–5260. IEEE (2025).https://doi.org/10.1109/CVPR52734.2025.00495

work page doi:10.1109/cvpr52734.2025.00495 2025
[44]

Cambridge University Press, 2nd edn

Wrobel, B.P.: Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edn. (2001)

2001
[45]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, B., Xu, Y., Yang, X., Jia, W., Guo, Y.: Bilateral grid learning for stereo matching networks. In: Computer Vision and Pattern Recognition. pp. 12492– 12501. IEEE (2021).https://doi.org/10.1109/CVPR46437.2021.01231

work page doi:10.1109/cvpr46437.2021.01231 2021
[46]

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Computer Vision and Pattern Recognition. pp. 21919–21928. IEEE (June 2023).https://doi.org/10.1109/CVPR52729.2023.02099

work page doi:10.1109/cvpr52729.2023.02099 2023
[47]

IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi

Xu, G., Wang, X., Zhang, Z., Cheng, J., Liao, C., Yang, X.: Igev++: Iterative multi-range geometry encoding volumes for stereo matching. IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi. org/10.1109/TPAMI.2025.3569218

work page doi:10.1109/tpami.2025.3569218 2024
[48]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10

Xu, G., Wang, Y., Cheng, J., Tang, J., Yang, X.: Accurate and efficient stereo matching via attention concatenation volume. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10. 1109/TPAMI.2023.3335480

arXiv 2022
[49]

arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

Xu, G., Zhou, H., Yang, X.: Cgi-stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

arXiv 2023
[50]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

arXiv 2022
[51]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Xu, H., Zhang, J.: Aanet: Adaptive aggregation network for efficient stereo match- ing. In: Computer Vision and Pattern Recognition. pp. 1956–1965. IEEE (2020). https://doi.org/10.1109/cvpr42600.2020.00203

work page doi:10.1109/cvpr42600.2020.00203 1956
[52]

In: Computer Vision and Pat- tern Recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Computer Vision and Pat- tern Recognition. pp. 10371–10381. IEEE (2024).https://doi.org/10.1109/ CVPR52733.2024.00987

arXiv 2024
[53]

Depth Anything V2

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything V2. In: Neural Information Processing Systems. pp. 21875–21911. Neu- ral Information Processing Systems Foundation, Inc. (NeurIPS) (2024).https: //doi.org/10.48550/arXiv.2406.09414

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.09414 2024
[54]

In: IEEE International Conference on Computer Vision

Yao, C., Yu, L., Liu, Z., Zeng, J., Wu, Y., Jia, Y.: Diving into the fusion of monoc- ular priors for generalized stereo matching. In: IEEE International Conference on Computer Vision. pp. 14887–14897. IEEE (2025).https://doi.org/10.1109/ ICCV51701.2025.01381

arXiv 2025
[55]

Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y

Zhu, L., Gao, Y., Zhang, J., Li, Y., Li, X.: Reliable and effective stereo matching for underwater scenes. Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y. Wang et al. Appendix A Additional Ablation Studies We supplement the ablation studies in the main text with additional hyper- parameter sweeps covering HSCV structure (Tab. 8...

work page doi:10.3390/rs16234570 2024

[1] [1]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Akkaynak, D., Treibitz, T.: A revised underwater image formation model. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–

2018

[2] [2]

IEEE (2018).https://doi.org/10.1109/CVPR.2018.00703

work page doi:10.1109/cvpr.2018.00703 2018

[3] [3]

In: IEEE/RJS International Conference on Intelligent RObots and Systems

Bangunharcana, A., Cho, J.W., Lee, S., Kweon, I.S., Kim, K.S., Kim, S.: Correlate- and-excite: Real-time stereo matching via guided cost volume excitation. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 3542–

[4] [4]

IEEE, IEEE (2021).https://doi.org/10.1109/IROS51168.2021.9635909

work page doi:10.1109/iros51168.2021.9635909 2021

[5] [6]

IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

Berman, D., Levy, D., Avidan, S., Treibitz, T.: Underwater single image color restoration using haze-lines and a new quantitative dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

work page doi:10.1109/tpami.2020.2977624 2018

[6] [7]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: arXiv.org (2024).https://doi.org/10.48550/arXiv.2410.02073

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.02073 2024

[7] [8]

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)

Pith/arXiv arXiv 2015

[8] [9]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5410–5418. IEEE (2018).https://doi.org/10.1109/CVPR.2018.00567

work page doi:10.1109/cvpr.2018.00567 2018

[9] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Cheng, J., Liu, L., Xu, G., Wang, X., Zhang, Z., Deng, Y., Zang, J., Chen, Y., Cai, Z., Yang, X.: Monster: Marry monodepth to stereo unleashes power. In: Computer Vision and Pattern Recognition. pp. 6273–6282. IEEE (2025).https://doi.org/ 10.1109/CVPR52734.2025.00588

work page doi:10.1109/cvpr52734.2025.00588 2025

[10] [12]

In: IEEE International Con- ference on Computer Vision

Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: Deeppruner: Learning effi- cient stereo matching via differentiable patchmatch. In: IEEE International Con- ference on Computer Vision. pp. 4383–4392. IEEE (2019).https://doi.org/10. 1109/ICCV.2019.00448

arXiv 2019

[11] [13]

In: 2012 IEEE Conference on Computer Vision and Pattern Recognition

Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3354–3361. IEEE (2012).https://doi.org/10. 1109/CVPR.2012.6248074

arXiv 2012

[12] [14]

In: Image Processing On Line

Gómez, Á.: GA-Net: Guided aggregation net for end-to-end stereo matching. In: Image Processing On Line. pp. 185–194 (2023).https://doi.org/10.5201/ipol. 2023.441

work page doi:10.5201/ipol 2023

[13] [15]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high- resolution multi-view stereo and stereo matching. In: Computer Vision and Pattern Recognition. pp. 2495–2504 (2019).https://doi.org/10.1109/CVPR42600.2020. 00257 LinStereo: Linear-Complexity Global Attention for Stereo Matching 17

work page doi:10.1109/cvpr42600.2020 2019

[14] [16]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Guan, T., Guo, J., Wang, C., Liu, Y.: Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. In: IEEE International Conference on Computer Vision. pp. 27681–27691. IEEE (2025).https://doi.org/10.1109/ICCV51701. 2025.02570

work page doi:10.1109/iccv51701 2025

[15] [17]

UniDepth: Universal Monocular Metric Depth Estimation

Guan, T., Wang, C., Liu, Y.H.: Neural markov random field for stereo matching. In: Computer Vision and Pattern Recognition. pp. 5459–5469. IEEE (June 2024). https://doi.org/10.1109/CVPR52733.2024.00522

work page doi:10.1109/cvpr52733.2024.00522 2024

[16] [18]

In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

Guo, X., Zhang, C., Nie, D., Zheng, W., Zhang, Y., Chen, L.: Lightstereo: Chan- nel boost is all you need for efficient 2d cost aggregation. In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

arXiv 2024

[17] [19]

In: Computer Vision and Pattern Recognition

Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo net- work. In: Computer Vision and Pattern Recognition. pp. 3268–3277. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00339

work page doi:10.1109/cvpr.2019.00339 2019

[18] [20]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

work page doi:10.1109/tpami.2024.3444912 2024

[19] [21]

In: Computer Vision and Pattern Recognition

Jiang, H., Lou, Z., Ding, L., Xu, R., Tan, M., Jiang, W., Huang, R.: DEFOM- Stereo: Depth foundation model based stereo matching. In: Computer Vision and Pattern Recognition. pp. 21857–21867. IEEE (2025).https://doi.org/10.1109/ CVPR52734.2025.02036

arXiv 2025

[20] [22]

In: 2017 IEEE International Conference on Computer Vision (ICCV)

Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 66–75. IEEE (2017).https://doi.org/10.1109/iccv.2017.17

work page doi:10.1109/iccv.2017.17 2017

[21] [23]

In: European Conference on Computer Vision

Khamis, S., Fanello, S., Rhemann, C., Kowdle, A., Valentin, J., Izadi, S.: Stere- onet: Guided hierarchical refinement for real-time edge-aware depth prediction. In: European Conference on Computer Vision. pp. 596–613. Springer International Publishing (2018).https://doi.org/10.1007/978-3-030-01267-0_35

work page doi:10.1007/978-3-030-01267-0_35 2018

[22] [24]

nouns": [...],

Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., Liu, J., Fan, H., Liu, S.: Practicalstereomatchingviacascadedrecurrentnetworkwithadaptivecorrelation. In: Computer Vision and Pattern Recognition. pp. 16242–16251. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01578

work page doi:10.1109/cvpr52688.2022.01578 2022

[23] [25]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv.org (2025).https: //doi.org/10.48550/arXiv.2511.10647

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.10647 2025

[24] [26]

IEEE International Conference on 3D Vision (3DV) , volume =

Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: International Conference on 3D Vision. pp. 218–227. IEEE, IEEE (2021).https://doi.org/10.1109/3DV53792.2021.00032

work page doi:10.1109/3dv53792.2021.00032 2021

[25] [27]

IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

Lv, Q., Dong, J., Li, Y., Chen, S., Yu, H., Zhang, S., Wang, W.: Uwstereo: A large synthetic dataset for underwater stereo matching. IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

arXiv 2024

[26] [28]

In: Computer Vision and Pattern Recognition

Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Computer Vision and Pattern Recognition. pp. 4040– 4048 (2015).https://doi.org/10.1109/CVPR.2016.438 18 Y. Wang et al

work page doi:10.1109/cvpr.2016.438 2015

[27] [29]

In: Computer Vision and Pattern Recognition

Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Computer Vision and Pattern Recognition. pp. 3061–3070. IEEE (2015).https://doi.org/ 10.1109/CVPR.2015.7298925

work page doi:10.1109/cvpr.2015.7298925 2015

[28] [30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. (2023).https:// doi.org/10.48550/arXiv.2304.07193

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2023

[29] [31]

In: Computer Vision and Pattern Recog- nition

Ramirez,P.Z.,Tosi,F.,Poggi,M.,Salti,S.,Mattoccia,S.,Stefano,L.D.:Openchal- lenges in deep stereo: the booster dataset. In: Computer Vision and Pattern Recog- nition. pp. 21136–21146. IEEE (2022).https://doi.org/10.1109/CVPR52688. 2022.02049

work page doi:10.1109/cvpr52688 2022

[30] [32]

In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Mon- treal, QC, Canada, October 10-17, 2021

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: IEEE International Conference on Computer Vision. pp. 12159–12168. IEEE (October 2021).https://doi.org/10.1109/ICCV48922.2021.01196

work page doi:10.1109/iccv48922.2021.01196 2021

[31] [33]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

work page doi:10.1109/tpami.2020.3019967 2019

[32] [34]

In: German Conference on Pattern Recognition

Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nesic, N., Wang, X., Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth. In: German Conference on Pattern Recognition. pp. 31–42. Springer International Publishing (2014).https://doi.org/10.1007/978-3-319-11752-2_3

work page doi:10.1007/978-3-319-11752-2_3 2014

[33] [35]

IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-cameravideos.In:ComputerVisionandPatternRecognition.pp.2538–2547. IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

work page doi:10.1109/cvpr.2017.272 2017

[34] [36]

In: IEEE Workshop/Winter Conference on Applications of Computer Vision

Shamsafar,F.,Woerz,S.,Rahim,R.,Zell,A.:Mobilestereonet:Towardslightweight deep networks for stereo matching. In: IEEE Workshop/Winter Conference on Applications of Computer Vision. pp. 2417–2426 (2021).https://doi.org/10. 1109/WACV51458.2022.00075

arXiv 2021

[35] [37]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXX

Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., Zhang, L.: Pcw-net: Pyramid combi- nation and warping cost volume for stereo matching. In: European Conference on Computer Vision. pp. 280–297. Springer (2020).https://doi.org/10.1007/978- 3-031-19824-3_17

work page doi:10.1007/978- 2020

[36] [38]

2024 , issue_date =

Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2021).https://doi. org/10.1016/j.neucom.2023.127063

work page doi:10.1016/j.neucom.2023.127063 2021

[37] [39]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Tankovich, V., Häne, C., Fanello, S., Zhang, Y., Izadi, S., Bouaziz, S.: Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In: Computer Vision and Pattern Recognition. pp. 14362–14372 (2020).https://doi. org/10.1109/CVPR46437.2021.01413

work page doi:10.1109/cvpr46437.2021.01413 2020

[38] [40]

The Newer College Dataset: Handheld LiDAR, inertial and vision with ground truth,

Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 4909–4916. IEEE (2020).https://doi.org/10.1109/IROS45743.2020.9341801

work page doi:10.1109/iros45743.2020.9341801 2020

[39] [41]

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

Wang, X., Xu, G., Jia, H., Yang, X.: Selective-stereo: Adaptive frequency infor- mation selection for stereo matching. In: Computer Vision and Pattern Recogni- tion. pp. 19701–19710. IEEE (June 2024).https://doi.org/10.1109/CVPR52733. 2024.01863 LinStereo: Linear-Complexity Global Attention for Stereo Matching 19

work page doi:10.1109/cvpr52733 2024

[40] [42]

Wang, Y., Li, K., Wang, L., Hu, J., Wu, D.O., Guo, Y.: Adstereo: Efficient stereo matchingwithadaptivedownsamplinganddisparityalignment.IEEETransactions on Image Processing34, 1204–1218 (2025).https://doi.org/10.1109/TIP.2025. 3540282

work page doi:10.1109/tip.2025 2025

[41] [43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., Birchfield, S.: Foundation- Stereo: Zero-shot stereo matching. In: Computer Vision and Pattern Recognition. pp. 5249–5260. IEEE (2025).https://doi.org/10.1109/CVPR52734.2025.00495

work page doi:10.1109/cvpr52734.2025.00495 2025

[42] [44]

Cambridge University Press, 2nd edn

Wrobel, B.P.: Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edn. (2001)

2001

[43] [45]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xu, B., Xu, Y., Yang, X., Jia, W., Guo, Y.: Bilateral grid learning for stereo matching networks. In: Computer Vision and Pattern Recognition. pp. 12492– 12501. IEEE (2021).https://doi.org/10.1109/CVPR46437.2021.01231

work page doi:10.1109/cvpr46437.2021.01231 2021

[44] [46]

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Computer Vision and Pattern Recognition. pp. 21919–21928. IEEE (June 2023).https://doi.org/10.1109/CVPR52729.2023.02099

work page doi:10.1109/cvpr52729.2023.02099 2023

[45] [47]

IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi

Xu, G., Wang, X., Zhang, Z., Cheng, J., Liao, C., Yang, X.: Igev++: Iterative multi-range geometry encoding volumes for stereo matching. IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi. org/10.1109/TPAMI.2025.3569218

work page doi:10.1109/tpami.2025.3569218 2024

[46] [48]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10

Xu, G., Wang, Y., Cheng, J., Tang, J., Yang, X.: Accurate and efficient stereo matching via attention concatenation volume. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10. 1109/TPAMI.2023.3335480

arXiv 2022

[47] [49]

arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

Xu, G., Zhou, H., Yang, X.: Cgi-stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

arXiv 2023

[48] [50]

IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

arXiv 2022

[49] [51]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

Xu, H., Zhang, J.: Aanet: Adaptive aggregation network for efficient stereo match- ing. In: Computer Vision and Pattern Recognition. pp. 1956–1965. IEEE (2020). https://doi.org/10.1109/cvpr42600.2020.00203

work page doi:10.1109/cvpr42600.2020.00203 1956

[50] [52]

In: Computer Vision and Pat- tern Recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Computer Vision and Pat- tern Recognition. pp. 10371–10381. IEEE (2024).https://doi.org/10.1109/ CVPR52733.2024.00987

arXiv 2024

[51] [53]

Depth Anything V2

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything V2. In: Neural Information Processing Systems. pp. 21875–21911. Neu- ral Information Processing Systems Foundation, Inc. (NeurIPS) (2024).https: //doi.org/10.48550/arXiv.2406.09414

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.09414 2024

[52] [54]

In: IEEE International Conference on Computer Vision

Yao, C., Yu, L., Liu, Z., Zeng, J., Wu, Y., Jia, Y.: Diving into the fusion of monoc- ular priors for generalized stereo matching. In: IEEE International Conference on Computer Vision. pp. 14887–14897. IEEE (2025).https://doi.org/10.1109/ ICCV51701.2025.01381

arXiv 2025

[53] [55]

Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y

Zhu, L., Gao, Y., Zhang, J., Li, Y., Li, X.: Reliable and effective stereo matching for underwater scenes. Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y. Wang et al. Appendix A Additional Ablation Studies We supplement the ablation studies in the main text with additional hyper- parameter sweeps covering HSCV structure (Tab. 8...

work page doi:10.3390/rs16234570 2024