pith. sign in

arxiv: 2606.25437 · v1 · pith:MQBENCO7new · submitted 2026-06-24 · 💻 cs.CV

LinStereo: Linear-Complexity Global Attention for Multi-Scale Iterative Stereo Matching

Pith reviewed 2026-06-25 21:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo matchinglinear attentionvision foundation modelsdisparity estimationunderwater visioniterative refinementglobal aggregationcross-domain generalization
0
0 comments X

The pith

LinStereo replaces local recurrence in stereo matching with a linear-cost global attention module that spreads reliable disparities from clear to degraded regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard VFM-based iterative stereo pipelines collapse multi-scale backbone features into single-level correlations, leave geometric priors unused at initialization, and restrict context propagation to local operations. These limitations become acute under degraded photometric conditions such as underwater scenes. LinStereo introduces a Position-Aware Linear Attention module that performs global aggregation at linear cost, supported by Hierarchical Semantic Cost Volumes drawn from the VFM feature hierarchy and a Depth Prior Initialization that supplies a metrically calibrated starting point from monocular depth. The design lets reliable disparity estimates propagate from well-matched areas into problematic ones while preserving disparity structure. Reported results show state-of-the-art-level accuracy on conventional benchmarks together with large gains on two underwater datasets.

Core claim

The paper establishes that the Position-Aware Linear Attention module, when combined with scale-aligned Hierarchical Semantic Cost Volumes and Depth Prior Initialization inside an iterative stereo pipeline built on Depth Anything V3, achieves global context propagation at linear complexity, yielding the best overall accuracy on standard benchmarks and consistent error reductions of 28 percent AbsRel on TartanAir-UW and 26 percent on SQUID.

What carries the argument

Position-Aware Linear Attention (PALA) module, which replaces local recurrence with global aggregation of disparity information at linear computational cost while preserving structure.

If this is right

  • Reliable disparity estimates can be propagated across the entire image without quadratic global attention cost.
  • Multi-scale VFM features can be used directly as scale-aligned correlations rather than collapsed to a single level.
  • Monocular depth estimates can serve as an effective metrically calibrated initialization for stereo refinement.
  • Performance gains hold across both standard indoor/outdoor benchmarks and severely degraded underwater domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear global aggregation pattern could be tested on related dense correspondence tasks such as optical flow where long-range propagation is also needed.
  • If the propagation mechanism works, future stereo pipelines may be able to reduce the number of local refinement iterations while maintaining accuracy.
  • The approach highlights that hierarchical cost volumes can bridge the gap between monocular priors and stereo geometry without requiring additional network branches.

Load-bearing premise

The underlying VFM features and hierarchical cost volumes supply enough reliable signal for the attention module to propagate accurate disparity estimates from well-matched regions into areas with degraded photometric cues.

What would settle it

Apply the pipeline to a controlled test set in which large contiguous regions have their initial matches deliberately corrupted or removed and measure whether final disparity accuracy still exceeds local-recurrence baselines.

Figures

Figures reproduced from arXiv: 2606.25437 by Oliver Turner, Viorela Ila, Yiran Wang.

Figure 1
Figure 1. Figure 1: Detailed Architecture of LinStereo: Our model replaces the ConvGRU update operator with the Position-Aware Linear Attention (PALA) updater, which enables global spatial reasoning over cost volume features at linear complexity by at￾tention. The attention path applies kernel-activated queries and keys with 2D rotary positional encoding, while a parallel context modulation branch provides adaptive gat￾ing fo… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on standard and underwater stereo benchmarks. progressive build-up of the PALA block from a position-agnostic linear atten￾tion baseline (A). Each addition yields consistent gains on both benchmarks, with global spatial encoding (A→B) providing the largest single improvement, confirming that restoring positional structure is the most critical enhancement to vanilla linear attention. … view at source ↗
Figure 3
Figure 3. Figure 3: Synthesis of the SeaStereo-Dataset: The Blender Python API automated the rendering pipeline by iterating through each configuration of features (camera path, camera model, water type, depth and ShapeNet object subset). Each configuration generated stereo image pairs (IL, IR) and disparity maps (DL, DR). and placed within the scene. This process was repeated for three iterations per configuration, with diff… view at source ↗
Figure 4
Figure 4. Figure 4: SeaStereo-Dataset Examples: Left camera renders from SeaStereo-Dataset, illustrating variation in seafloor depth and water type. These examples demonstrate the controlled changes in underwater visibility and environmental configuration used during dataset generation. Left Right Disparity Left Right Disparity Left Right Disparity [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional SeaStereo-Dataset samples. Each row shows three sets of stereo RGB images (left, right) and their corresponding disparity annotations under different Jerlov water types, seafloor depths, and cameras. The dataset captures diverse underwater conditions ranging from clear to highly turbid water. than simplifying the architecture, we reduce the number of refinement iterations required for convergenc… view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative comparison on Booster (Q). [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative comparison on ETH3D. [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative comparison on KITTI 2012. [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative comparison on KITTI 2015. [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparison on TartanAir-UW (1/2). [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparison on TartanAir-UW (2/2). [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative comparison on SQUID. [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results on the laboratory tank dataset. [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison on SeaStereo-Dataset (1/2). [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative comparison on SeaStereo-Dataset (2/2). [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
read the original abstract

Existing Vision Foundation Model (VFM)-based iterative stereo pipelines under-exploit three information pathways: multi-scale backbone features are collapsed into single-level correlations, geometric priors remain untapped at initialization, and context propagates only locally. These gaps widen under degraded photometric cues, making underwater scenes a stringent generalization test. To address this, we propose LinStereo, built upon Depth Anything V3, whose core is a Position-Aware Linear Attention (PALA) module that replaces local recurrence with global aggregation at linear cost, propagating reliable estimates from well-matched regions into degraded areas while preserving disparity structure. PALA is made effective by two enabling components: Hierarchical Semantic Cost Volumes (HSCV), which supply scale-aligned correlations from the VFM feature hierarchy, and a Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. LinStereo achieves state-of-the-art-level accuracy on standard benchmarks and strong cross-domain generalization, particularly on underwater scene where severe photometric degradation makes stereo matching particularly challenging, attaining the best overall accuracy with consistent gains 28% lower AbsRel on TartanAir-UW, 26% on SQUID, a real-world underwater dataset).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes LinStereo, an iterative stereo matching method built on Depth Anything V3. Its core contributions are the Position-Aware Linear Attention (PALA) module for global context aggregation at linear cost, Hierarchical Semantic Cost Volumes (HSCV) to supply scale-aligned correlations from the VFM feature hierarchy, and Depth Prior Initialization (DPI) that converts monocular depth into a metrically calibrated warm start. The central empirical claim is state-of-the-art-level accuracy on standard benchmarks together with strong cross-domain generalization on underwater scenes, specifically 28% lower AbsRel on TartanAir-UW and 26% on the real-world SQUID dataset.

Significance. If the reported accuracy gains and cross-domain improvements hold under rigorous evaluation, the work would show that linear-complexity global attention combined with semantic cost volumes and geometric priors can improve propagation of disparity estimates into photometrically degraded regions while remaining computationally tractable. This would be a meaningful advance for stereo matching in challenging environments such as underwater vision.

major comments (1)
  1. Abstract: the performance numbers (28% lower AbsRel on TartanAir-UW, 26% on SQUID) and the claim of 'state-of-the-art-level accuracy' are presented without any description of experimental protocol, baselines, metrics, or error analysis. Because these numbers constitute the central empirical claim, the absence of supporting experimental details prevents evaluation of whether the reported gains are load-bearing or reproducible.
minor comments (1)
  1. Abstract, final sentence: 'underwater scene where' should read 'underwater scenes where'; the closing parenthesis after 'dataset' is misplaced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity in the abstract regarding our central empirical claims. We address this point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the performance numbers (28% lower AbsRel on TartanAir-UW, 26% on SQUID) and the claim of 'state-of-the-art-level accuracy' are presented without any description of experimental protocol, baselines, metrics, or error analysis. Because these numbers constitute the central empirical claim, the absence of supporting experimental details prevents evaluation of whether the reported gains are load-bearing or reproducible.

    Authors: We agree that the abstract would benefit from additional context to support the reported performance numbers and SOTA-level claim. In the revised version, we will update the abstract to briefly specify the evaluation benchmarks (TartanAir-UW and SQUID), the primary metric (AbsRel), and a reference to the full experimental protocol, baselines, and analysis detailed in Section 4. This change will improve evaluability while preserving conciseness. The full details, including comparisons and error breakdowns, remain in the Experiments section as before. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical architecture (PALA for global linear attention, HSCV for hierarchical correlations, DPI for monocular initialization) built on an existing VFM backbone and reports benchmark accuracy numbers as experimental outcomes. No derivation chain, equations, or first-principles claims are present that reduce by construction to fitted parameters, self-definitions, or self-citation load-bearing premises. The central claims rest on measured performance rather than identities internal to the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about the reliability of Depth Anything V3 features and the effectiveness of the newly introduced modules; no explicit free parameters are named in the abstract, and the modules themselves constitute new entities without external validation.

axioms (2)
  • domain assumption Depth Anything V3 provides reliable multi-scale features suitable for building hierarchical semantic cost volumes
    The pipeline is built directly upon this VFM.
  • domain assumption Monocular depth estimates can be converted into a metrically calibrated warm-start disparity map
    This underpins the DPI component.
invented entities (3)
  • Position-Aware Linear Attention (PALA) no independent evidence
    purpose: Replace local recurrence with global aggregation at linear cost
    Core new module introduced to address context propagation
  • Hierarchical Semantic Cost Volumes (HSCV) no independent evidence
    purpose: Supply scale-aligned correlations from VFM feature hierarchy
    New cost-volume construction method
  • Depth Prior Initialization (DPI) no independent evidence
    purpose: Convert monocular depth into metrically calibrated stereo initialization
    New initialization strategy

pith-pipeline@v0.9.1-grok · 5741 in / 1341 out tokens · 36437 ms · 2026-06-25T21:03:43.752207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 38 canonical work pages · 4 internal anchors

  1. [1]

    In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Akkaynak, D., Treibitz, T.: A revised underwater image formation model. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–

  2. [2]

    IEEE (2018).https://doi.org/10.1109/CVPR.2018.00703

  3. [3]

    In: IEEE/RJS International Conference on Intelligent RObots and Systems

    Bangunharcana, A., Cho, J.W., Lee, S., Kweon, I.S., Kim, K.S., Kim, S.: Correlate- and-excite: Real-time stereo matching via guided cost volume excitation. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 3542–

  4. [4]

    IEEE, IEEE (2021).https://doi.org/10.1109/IROS51168.2021.9635909

  5. [6]

    IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

    Berman, D., Levy, D., Avidan, S., Treibitz, T.: Underwater single image color restoration using haze-lines and a new quantitative dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence43(8), 2822–2837 (2018).https: //doi.org/10.1109/tpami.2020.2977624

  6. [7]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Bochkovskiy, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: arXiv.org (2024).https://doi.org/10.48550/arXiv.2410.02073

  7. [8]

    Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An Information-Rich 3D Model Repository. Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago (2015)

  8. [9]

    In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5410–5418. IEEE (2018).https://doi.org/10.1109/CVPR.2018.00567

  9. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cheng, J., Liu, L., Xu, G., Wang, X., Zhang, Z., Deng, Y., Zang, J., Chen, Y., Cai, Z., Yang, X.: Monster: Marry monodepth to stereo unleashes power. In: Computer Vision and Pattern Recognition. pp. 6273–6282. IEEE (2025).https://doi.org/ 10.1109/CVPR52734.2025.00588

  10. [12]

    In: IEEE International Con- ference on Computer Vision

    Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: Deeppruner: Learning effi- cient stereo matching via differentiable patchmatch. In: IEEE International Con- ference on Computer Vision. pp. 4383–4392. IEEE (2019).https://doi.org/10. 1109/ICCV.2019.00448

  11. [13]

    In: 2012 IEEE Conference on Computer Vision and Pattern Recognition

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3354–3361. IEEE (2012).https://doi.org/10. 1109/CVPR.2012.6248074

  12. [14]

    In: Image Processing On Line

    Gómez, Á.: GA-Net: Guided aggregation net for end-to-end stereo matching. In: Image Processing On Line. pp. 185–194 (2023).https://doi.org/10.5201/ipol. 2023.441

  13. [15]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high- resolution multi-view stereo and stereo matching. In: Computer Vision and Pattern Recognition. pp. 2495–2504 (2019).https://doi.org/10.1109/CVPR42600.2020. 00257 LinStereo: Linear-Complexity Global Attention for Stereo Matching 17

  14. [16]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Guan, T., Guo, J., Wang, C., Liu, Y.: Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. In: IEEE International Conference on Computer Vision. pp. 27681–27691. IEEE (2025).https://doi.org/10.1109/ICCV51701. 2025.02570

  15. [17]

    UniDepth: Universal Monocular Metric Depth Estimation

    Guan, T., Wang, C., Liu, Y.H.: Neural markov random field for stereo matching. In: Computer Vision and Pattern Recognition. pp. 5459–5469. IEEE (June 2024). https://doi.org/10.1109/CVPR52733.2024.00522

  16. [18]

    In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

    Guo, X., Zhang, C., Nie, D., Zheng, W., Zhang, Y., Chen, L.: Lightstereo: Chan- nel boost is all you need for efficient 2d cost aggregation. In: IEEE International Conference on Robotics and Automation (2024).https://doi.org/10.1109/ ICRA55743.2025.11127711

  17. [19]

    In: Computer Vision and Pattern Recognition

    Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo net- work. In: Computer Vision and Pattern Recognition. pp. 3268–3277. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00339

  18. [20]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

    Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024).https: //doi.org/10.1109/TPAMI.2024.3444912

  19. [21]

    In: Computer Vision and Pattern Recognition

    Jiang, H., Lou, Z., Ding, L., Xu, R., Tan, M., Jiang, W., Huang, R.: DEFOM- Stereo: Depth foundation model based stereo matching. In: Computer Vision and Pattern Recognition. pp. 21857–21867. IEEE (2025).https://doi.org/10.1109/ CVPR52734.2025.02036

  20. [22]

    In: 2017 IEEE International Conference on Computer Vision (ICCV)

    Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 66–75. IEEE (2017).https://doi.org/10.1109/iccv.2017.17

  21. [23]

    In: European Conference on Computer Vision

    Khamis, S., Fanello, S., Rhemann, C., Kowdle, A., Valentin, J., Izadi, S.: Stere- onet: Guided hierarchical refinement for real-time edge-aware depth prediction. In: European Conference on Computer Vision. pp. 596–613. Springer International Publishing (2018).https://doi.org/10.1007/978-3-030-01267-0_35

  22. [24]

    nouns": [...],

    Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., Liu, J., Fan, H., Liu, S.: Practicalstereomatchingviacascadedrecurrentnetworkwithadaptivecorrelation. In: Computer Vision and Pattern Recognition. pp. 16242–16251. IEEE (2022). https://doi.org/10.1109/CVPR52688.2022.01578

  23. [25]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv.org (2025).https: //doi.org/10.48550/arXiv.2511.10647

  24. [26]

    IEEE International Conference on 3D Vision (3DV) , volume =

    Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: International Conference on 3D Vision. pp. 218–227. IEEE, IEEE (2021).https://doi.org/10.1109/3DV53792.2021.00032

  25. [27]

    IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

    Lv, Q., Dong, J., Li, Y., Chen, S., Yu, H., Zhang, S., Wang, W.: Uwstereo: A large synthetic dataset for underwater stereo matching. IEEE transactions on cir- cuits and systems for video technology (Print) (2024).https://doi.org/10.1109/ TCSVT.2025.3572044

  26. [28]

    In: Computer Vision and Pattern Recognition

    Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Computer Vision and Pattern Recognition. pp. 4040– 4048 (2015).https://doi.org/10.1109/CVPR.2016.438 18 Y. Wang et al

  27. [29]

    In: Computer Vision and Pattern Recognition

    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Computer Vision and Pattern Recognition. pp. 3061–3070. IEEE (2015).https://doi.org/ 10.1109/CVPR.2015.7298925

  28. [30]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. (2023).https:// doi.org/10.48550/arXiv.2304.07193

  29. [31]

    In: Computer Vision and Pattern Recog- nition

    Ramirez,P.Z.,Tosi,F.,Poggi,M.,Salti,S.,Mattoccia,S.,Stefano,L.D.:Openchal- lenges in deep stereo: the booster dataset. In: Computer Vision and Pattern Recog- nition. pp. 21136–21146. IEEE (2022).https://doi.org/10.1109/CVPR52688. 2022.02049

  30. [32]

    In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Mon- treal, QC, Canada, October 10-17, 2021

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: IEEE International Conference on Computer Vision. pp. 12159–12168. IEEE (October 2021).https://doi.org/10.1109/ICCV48922.2021.01196

  31. [33]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1623–1637 (2019).https://doi.org/10.1109/TPAMI.2020.3019967

  32. [34]

    In: German Conference on Pattern Recognition

    Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nesic, N., Wang, X., Westling, P.: High-resolution stereo datasets with subpixel-accurate ground truth. In: German Conference on Pattern Recognition. pp. 31–42. Springer International Publishing (2014).https://doi.org/10.1007/978-3-319-11752-2_3

  33. [35]

    IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

    Schöps, T., Schönberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-cameravideos.In:ComputerVisionandPatternRecognition.pp.2538–2547. IEEE (2017).https://doi.org/10.1109/CVPR.2017.272

  34. [36]

    In: IEEE Workshop/Winter Conference on Applications of Computer Vision

    Shamsafar,F.,Woerz,S.,Rahim,R.,Zell,A.:Mobilestereonet:Towardslightweight deep networks for stereo matching. In: IEEE Workshop/Winter Conference on Applications of Computer Vision. pp. 2417–2426 (2021).https://doi.org/10. 1109/WACV51458.2022.00075

  35. [37]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXX

    Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., Zhang, L.: Pcw-net: Pyramid combi- nation and warping cost volume for stereo matching. In: European Conference on Computer Vision. pp. 280–297. Springer (2020).https://doi.org/10.1007/978- 3-031-19824-3_17

  36. [38]

    2024 , issue_date =

    Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568, 127063 (2021).https://doi. org/10.1016/j.neucom.2023.127063

  37. [39]

    In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Tankovich, V., Häne, C., Fanello, S., Zhang, Y., Izadi, S., Bouaziz, S.: Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In: Computer Vision and Pattern Recognition. pp. 14362–14372 (2020).https://doi. org/10.1109/CVPR46437.2021.01413

  38. [40]

    The Newer College Dataset: Handheld LiDAR, inertial and vision with ground truth,

    Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IEEE/RJS International Conference on Intelligent RObots and Systems. pp. 4909–4916. IEEE (2020).https://doi.org/10.1109/IROS45743.2020.9341801

  39. [41]

    Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships,

    Wang, X., Xu, G., Jia, H., Yang, X.: Selective-stereo: Adaptive frequency infor- mation selection for stereo matching. In: Computer Vision and Pattern Recogni- tion. pp. 19701–19710. IEEE (June 2024).https://doi.org/10.1109/CVPR52733. 2024.01863 LinStereo: Linear-Complexity Global Attention for Stereo Matching 19

  40. [42]

    Wang, Y., Li, K., Wang, L., Hu, J., Wu, D.O., Guo, Y.: Adstereo: Efficient stereo matchingwithadaptivedownsamplinganddisparityalignment.IEEETransactions on Image Processing34, 1204–1218 (2025).https://doi.org/10.1109/TIP.2025. 3540282

  41. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wen, B., Trepte, M., Aribido, J., Kautz, J., Gallo, O., Birchfield, S.: Foundation- Stereo: Zero-shot stereo matching. In: Computer Vision and Pattern Recognition. pp. 5249–5260. IEEE (2025).https://doi.org/10.1109/CVPR52734.2025.00495

  42. [44]

    Cambridge University Press, 2nd edn

    Wrobel, B.P.: Multiple View Geometry in Computer Vision. Cambridge University Press, 2nd edn. (2001)

  43. [45]

    In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xu, B., Xu, Y., Yang, X., Jia, W., Guo, Y.: Bilateral grid learning for stereo matching networks. In: Computer Vision and Pattern Recognition. pp. 12492– 12501. IEEE (2021).https://doi.org/10.1109/CVPR46437.2021.01231

  44. [46]

    Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

    Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Computer Vision and Pattern Recognition. pp. 21919–21928. IEEE (June 2023).https://doi.org/10.1109/CVPR52729.2023.02099

  45. [47]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi

    Xu, G., Wang, X., Zhang, Z., Cheng, J., Liao, C., Yang, X.: Igev++: Iterative multi-range geometry encoding volumes for stereo matching. IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7108–7122 (2024).https://doi. org/10.1109/TPAMI.2025.3569218

  46. [48]

    IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10

    Xu, G., Wang, Y., Cheng, J., Tang, J., Yang, X.: Accurate and efficient stereo matching via attention concatenation volume. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence46(4), 2461–2474 (2022).https://doi.org/10. 1109/TPAMI.2023.3335480

  47. [49]

    arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

    Xu, G., Zhou, H., Yang, X.: Cgi-stereo: Accurate and real-time stereo matching via context and geometry interaction. arXiv.org (2023).https://doi.org/10.48550/ arXiv.2301.02789

  48. [50]

    IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

    Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence45(11), 13941–13960 (2022).https://doi.org/10.1109/ TPAMI.2023.3298645

  49. [51]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

    Xu, H., Zhang, J.: Aanet: Adaptive aggregation network for efficient stereo match- ing. In: Computer Vision and Pattern Recognition. pp. 1956–1965. IEEE (2020). https://doi.org/10.1109/cvpr42600.2020.00203

  50. [52]

    In: Computer Vision and Pat- tern Recognition

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Computer Vision and Pat- tern Recognition. pp. 10371–10381. IEEE (2024).https://doi.org/10.1109/ CVPR52733.2024.00987

  51. [53]

    Depth Anything V2

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything V2. In: Neural Information Processing Systems. pp. 21875–21911. Neu- ral Information Processing Systems Foundation, Inc. (NeurIPS) (2024).https: //doi.org/10.48550/arXiv.2406.09414

  52. [54]

    In: IEEE International Conference on Computer Vision

    Yao, C., Yu, L., Liu, Z., Zeng, J., Wu, Y., Jia, Y.: Diving into the fusion of monoc- ular priors for generalized stereo matching. In: IEEE International Conference on Computer Vision. pp. 14887–14897. IEEE (2025).https://doi.org/10.1109/ ICCV51701.2025.01381

  53. [55]

    Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y

    Zhu, L., Gao, Y., Zhang, J., Li, Y., Li, X.: Reliable and effective stereo matching for underwater scenes. Remote Sensing16(23), 4570 (2024).https://doi.org/ 10.3390/rs16234570 20 Y. Wang et al. Appendix A Additional Ablation Studies We supplement the ablation studies in the main text with additional hyper- parameter sweeps covering HSCV structure (Tab. 8...