pith. sign in

arxiv: 2604.15312 · v1 · submitted 2026-04-16 · 💻 cs.CV

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Pith reviewed 2026-05-10 11:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-frame stereocross-modal promptingasymmetric stereo matchingdepth estimationevent camerasbidirectional projectionmultimodal fusion
0
0 comments X

The pith

Bidirectional prompting projects event and frame data into each other's domains to recover cues for accurate stereo matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Bi-CMPStereo, a framework that uses bidirectional cross-modal prompting to handle the differences between event cameras and conventional frame cameras in stereo setups. Event cameras provide high-speed, blur-free data but miss texture, while frames supply rich context yet fail under fast motion or poor lighting. By learning representations in a shared canonical space and projecting each input into both modalities, the method integrates semantic and structural details that the modality gap otherwise discards. If correct, this yields stereo depth maps that remain reliable in dynamic scenes where single-modality approaches degrade. The authors report stronger accuracy and generalization than prior stereo methods on relevant benchmarks.

Core claim

Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains.

What carries the argument

Bidirectional cross-modal prompting that projects each modality into both event and frame domains to recover and fuse domain-specific cues.

If this is right

  • Stereo matching accuracy increases because complementary texture from frames and timing from events are both retained.
  • Representations become more robust to motion blur and illumination changes.
  • The same model generalizes better across different camera setups and scene speeds.
  • No extra hardware synchronization is required beyond the two modalities already present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prompting idea could transfer to other pairs of sensors that differ in temporal resolution, such as lidar and camera fusion.
  • In robotics, this might allow lighter rigs that still produce reliable depth during high-speed maneuvers.
  • A direct test would measure whether removing the reverse projection step drops accuracy by a measurable margin on the same data.

Load-bearing premise

The gap between event and frame data marginalizes useful cues, and bidirectional prompting can recover those cues without creating new alignment mistakes.

What would settle it

Performance on a held-out dataset of fast-motion, low-light scenes falls to or below current state-of-the-art stereo methods, or measured alignment error rises after prompting.

Figures

Figures reproduced from arXiv: 2604.15312 by Fabio Tosi, Jiawei Han, Lihui Wang, Luca Bartolomei, Matteo Poggi, Ninghui Xu, Stefano Mattoccia, Zhiting Yao.

Figure 1
Figure 1. Figure 1: Qualitative comparison on event–frame asymmetric stereo. Compared to the state-of-the-art ZEST [41], our method achieves higher accuracy and higher-quality structural details in both complex-texture (first row) and sparse low-light scenes (second row). Abstract Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes.… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed CMPStereo network. In our bidirectional framework, the asymmetric stereo inputs are alternately designated as the target-domain and source-domain modalities. The Cross-Domain Embedding Adapter (CDEA) operates on the source domain to achieve initial source-to-target adaptation. CMPStereo employs domain-specific encoders Ft(·) and Fs(·) combined with Stereo Canonicalization Constrain… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Bi-CMPStereo framework. Bi-CMPStereo integrates complementary representations from imgCMPStereo and evCMPStereo in the image and event domains for reliable disparity estimation. The two pre-trained CMPStereo networks are frozen as stereo feature extractors to construct multi-scale cost volumes, which are concatenated and fused via a 3D hourglass network at 1/4 scale, followed by the same ca… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art methods on a low-light scene from the DSEC dataset [20]. The mean absolute error (MAE) is shown at the bottom right of each estimated disparity map. As can be seen in the highlighted regions, our method more accurately reconstructs the fine structural details of the cars. more competitive on DSEC, we retrain it under the same training configuration as our method… view at source ↗
Figure 5
Figure 5. Figure 5: Generalization Result. Qualitative comparison of cross-dataset generalization on the MVSEC [87] dataset. 4.3. Cross-Dataset Generalization on MVSEC We further evaluate the generalization of our method by di￾rectly applying the DSEC-trained model to MVSEC [87] and M3ED [6]. We compare it against ZEST [41], which serves as a representative baseline, as it is designed for zero￾shot generalization via visual p… view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison between Bi-CMPStereo with and without HVT. Generalization results on MVSEC [87]. provements in generalization, and the visual comparison in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional qualitative results in nighttime scenario from the DSEC dataset [20]. The mean absolute error (MAE) is shown at the bottom right of each estimated disparity map [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative results in daytime scenario from the DSEC dataset [20]. References [1] Soikat Hasan Ahmed, Hae Woong Jang, SM Nadim Uddin, and Yong Ju Jung. Deep event stereo leveraged by event-to￾image translation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 882–890, 2021. [2] Filippo Aleotti, Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Matto… view at source ↗
read the original abstract

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Bi-CMPStereo, a bidirectional cross-modal prompting framework for event-frame asymmetric stereo matching. It claims to address the modality gap by learning finely aligned stereo representations within a target canonical space, integrating complementary semantic and structural features via bidirectional projection of each modality into both event and frame domains, and demonstrates significant outperformance over state-of-the-art methods in accuracy and generalization through extensive experiments.

Significance. If the empirical claims hold with proper validation, the framework could advance reliable 3D perception for dynamic scenes and challenging illumination by better preserving domain-specific cues that standard cross-modal matching tends to marginalize.

major comments (2)
  1. [Abstract] Abstract: The assertion of significant outperformance over SOTA methods in accuracy and generalization is presented without any quantitative results, baselines, ablation studies, or error analysis, preventing evaluation of the data-to-claim link.
  2. [Method] Method section: The bidirectional cross-modal prompting is described at a high level without specifying the projection mechanism (e.g., learned adapters, temporal aggregation for sparse events, or cycle-consistency losses), which is load-bearing for the claim that domain-specific cues are recovered without introducing new alignment errors or information loss.
minor comments (1)
  1. The 'canonical space' is referenced without a formal definition, diagram, or equation clarifying the projection operators and alignment process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Revisions will be made to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of significant outperformance over SOTA methods in accuracy and generalization is presented without any quantitative results, baselines, ablation studies, or error analysis, preventing evaluation of the data-to-claim link.

    Authors: We agree that the abstract would benefit from including quantitative highlights to strengthen the link between claims and evidence. In the revised manuscript, we will update the abstract to incorporate key performance metrics demonstrating outperformance (e.g., accuracy and generalization improvements on standard benchmarks). The full details on baselines, ablation studies, and error analysis are provided in the Experiments section, but we will ensure the abstract offers a clearer summary of these results. revision: yes

  2. Referee: [Method] Method section: The bidirectional cross-modal prompting is described at a high level without specifying the projection mechanism (e.g., learned adapters, temporal aggregation for sparse events, or cycle-consistency losses), which is load-bearing for the claim that domain-specific cues are recovered without introducing new alignment errors or information loss.

    Authors: We appreciate the referee highlighting this point and acknowledge that the current Method section presents the bidirectional cross-modal prompting at a high level. In the revised version, we will expand this section to explicitly specify the projection mechanisms, including details on learned adapters for domain mapping, temporal aggregation approaches for handling sparse event data, and the incorporation of cycle-consistency losses. These additions, along with supporting equations and illustrations, will clarify how domain-specific cues are preserved without introducing alignment errors or information loss. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal with no derivations or fitted predictions

full rationale

The paper introduces Bi-CMPStereo as a bidirectional cross-modal prompting framework for event-frame stereo matching. The abstract and available description contain no equations, no parameter fitting presented as prediction, no uniqueness theorems, and no self-citations that bear the central claim. The approach is defined by its architectural choices (canonical space alignment and bidirectional projection), which are independent of any input data or prior results by construction. Empirical outperformance is asserted via experiments rather than derived from the inputs themselves. This is a standard non-circular method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit parameters, axioms, or invented entities; assessment is limited to surface claims.

pith-pipeline@v0.9.0 · 5458 in / 977 out tokens · 36936 ms · 2026-05-10T11:14:22.208853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 1 internal anchor

  1. [1]

    Deep event stereo leveraged by event-to- image translation

    Soikat Hasan Ahmed, Hae Woong Jang, SM Nadim Uddin, and Yong Ju Jung. Deep event stereo leveraged by event-to- image translation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 882–890, 2021

  2. [2]

    Neural disparity refinement for arbitrary resolution stereo

    Filippo Aleotti, Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- 10 fano. Neural disparity refinement for arbitrary resolution stereo. In2021 International Conference on 3D Vision (3DV), pages 207–217. IEEE, 2021

  3. [3]

    Lidar-event stereo fusion with hallucinations

    Luca Bartolomei, Matteo Poggi, Andrea Conti, and Stefano Mattoccia. Lidar-event stereo fusion with hallucinations. In European Conference on Computer Vision, pages 125–145. Springer, 2024

  4. [4]

    Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

    Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

  5. [5]

    Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail

    Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  6. [6]

    M3ed: Multi-robot, multi-sensor, multi-environment event dataset

    Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4016–4023, 2023

  7. [7]

    Pyramid stereo matching network

    Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5418, 2018

  8. [8]

    Domain generalized stereo matching via hierarchical visual transformation

    Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. Domain generalized stereo matching via hierarchical visual transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9559–9568, 2023

  9. [9]

    Depth from asymmetric frame-event stereo: A divide-and-conquer approach

    Xihao Chen, Wenming Weng, Yueyi Zhang, and Zhi- wei Xiong. Depth from asymmetric frame-event stereo: A divide-and-conquer approach. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3045–3054, 2024

  10. [10]

    Mocha-stereo: Motif chan- nel attention network for stereo matching

    Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- nel attention network for stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  11. [11]

    Monster: Marry monodepth to stereo unleashes power

    Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, and Xin Yang. Monster: Marry monodepth to stereo unleashes power. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  12. [12]

    Event-image fusion stereo using cross-modality feature propagation

    Hoonhee Cho and Kuk-Jin Yoon. Event-image fusion stereo using cross-modality feature propagation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 454– 462, 2022

  13. [13]

    Selection and cross simi- larity for event-image deep stereo

    Hoonhee Cho and Kuk-Jin Yoon. Selection and cross simi- larity for event-image deep stereo. InEuropean Conference on Computer Vision, pages 470–486. Springer, 2022

  14. [14]

    Learning adaptive dense event stereo from the image domain

    Hoonhee Cho, Jegyeong Cho, and Kuk-Jin Yoon. Learning adaptive dense event stereo from the image domain. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17797–17807, 2023

  15. [15]

    Non-coaxial event-guided motion deblurring with spa- tial alignment

    Hoonhee Cho, Yuhwan Jeong, Taewoo Kim, and Kuk-Jin Yoon. Non-coaxial event-guided motion deblurring with spa- tial alignment. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12492–12503, 2023

  16. [16]

    Tempo- ral event stereo via joint learning with stereoscopic flow

    Hoonhee Cho, Jae-Young Kang, and Kuk-Jin Yoon. Tempo- ral event stereo via joint learning with stereoscopic flow. In European Conference on Computer Vision, pages 294–314. Springer, 2024

  17. [17]

    Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works

    WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar, and David Suter. Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13022–13032, 2022

  18. [18]

    Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

    Chao Ding, Mingyuan Lin, Haijian Zhang, Jianzhuang Liu, and Lei Yu. Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

  19. [19]

    Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

    Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

  20. [20]

    Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

  21. [21]

    Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

    Dipon Kumar Ghosh and Yong Ju Jung. Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

  22. [22]

    Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Suman Ghosh and Guillermo Gallego. Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  23. [23]

    Unsupervised monocular depth estimation with left- right consistency

    Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017

  24. [24]

    arXiv preprint arXiv:2507.22052 (2025) 4, 11

    Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, and Matteo Poggi. Ov3r: Open- vocabulary semantic 3d reconstruction from rgb videos. arXiv preprint arXiv:2507.22052, 2025

  25. [25]

    Bridgedepth: Bridging monocular and stereo reasoning with latent alignment

    Tongfan Guan, Jiaxin Guo, Chen Wang, and Yun-Hui Liu. Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA, 2025. ICCV 2025 Highlight

  26. [26]

    Context-enhanced stereo transformer

    Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus- sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei Li. Context-enhanced stereo transformer. InEuropean Con- ference on Computer Vision, pages 263–279. Springer, 2022

  27. [27]

    Group-wise correlation stereo network

    Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3273–3282, 2019. 11

  28. [28]

    Defom-stereo: Depth foundation model based stereo matching

    Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  29. [29]

    End-to-end learning of geometry and context for deep stereo regression

    Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. InProceedings of the IEEE international confer- ence on computer vision, pages 66–75, 2017

  30. [30]

    Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

    Haram Kim, Sangil Lee, Junha Kim, and H Jin Kim. Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

  31. [31]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014

  32. [32]

    A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

    Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and Mohammed Bennamoun. A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

  33. [33]

    Practical stereo matching via cascaded re- current network with adaptive correlation

    Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Zi- wei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded re- current network with adaptive correlation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16263–16272, 2022

  34. [34]

    Active event-based stereo vision

    Jianing Li, Yunjian Zhang, Haiqian Han, and Xiangyang Ji. Active event-based stereo vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 971–981, 2025

  35. [35]

    Creighton, Russell H

    Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X. Creighton, Russell H. Taylor, and Mathias Un- berath. Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6197–6206, 2021

  36. [36]

    Learn- ing for disparity estimation through feature constancy

    Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learn- ing for disparity estimation through feature constancy. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2811–2820, 2018

  37. [37]

    A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

    Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

  38. [38]

    Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Mingyuan Lin, Chi Zhang, Chu He, and Lei Yu. Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

  39. [39]

    Raft-stereo: Multilevel recurrent field transforms for stereo matching

    Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021

  40. [40]

    Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature

    Biyang Liu, Huimin Yu, and Guodong Qi. Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13012–13021, 2022

  41. [41]

    Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

    Hanyue Lou, Jinxiu Liang, Minggui Teng, Bin Fan, Yong Xu, and Boxin Shi. Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

  42. [42]

    Ef- ficient deep learning for stereo matching

    Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Ef- ficient deep learning for stereo matching. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5695–5703, 2016

  43. [43]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4040–4048, 2016

  44. [44]

    Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

    Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

  45. [45]

    S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

    Junhong Min, Youngpil Jeon, Jimin Kim, and Minyong Choi. S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), page to ap- pear, Honolulu, Hawai’i, 2025. IEEE. Accepted at ICCV 2025, Hawai’i Convention Center, Oct 19–23

  46. [46]

    Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

    Mohammad Mostafavi, Lin Wang, and Kuk-Jin Yoon. Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

  47. [47]

    Event-intensity stereo: Estimating depth by the best of both worlds

    Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Event-intensity stereo: Estimating depth by the best of both worlds. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4258–4267, 2021

  48. [48]

    Stereo depth from events cameras: Concentrate and focus on the future

    Yeongwoo Nam, Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Stereo depth from events cameras: Concentrate and focus on the future. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6114–6123, 2022

  49. [49]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

  50. [50]

    Federated online adaptation for deep stereo

    Matteo Poggi and Fabio Tosi. Federated online adaptation for deep stereo. InCVPR, 2024

  51. [51]

    Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

    Matteo Poggi, Alessio Tonioni, Fabio Tosi, Stefano Mattoc- cia, and Luigi Di Stefano. Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

  52. [52]

    Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. On the synergies between machine learning and binocular stereo for depth estimation from images: a survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(9):5314–5334, 2021

  53. [53]

    High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

    Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

  54. [54]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015

  55. [55]

    A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

    Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

  56. [56]

    Cfnet: Cascade and fused cost volume for robust stereo matching

    Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2021

  57. [57]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020

  58. [58]

    Unsupervised adaptation for deep stereo

    Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Unsupervised adaptation for deep stereo. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 1605–1613, 2017

  59. [59]

    Nerf-supervised deep stereo

    Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, and Mat- teo Poggi. Nerf-supervised deep stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 855–866, 2023

  60. [60]

    Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- fano. Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  61. [61]

    A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024

    Fabio Tosi, Luca Bartolomei, and Matteo Poggi. A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024. Extended version of CVPR 2024 Tutorial ”Deep Stereo Matching in the Twen- ties” (https://sites.google.com/view/stereo-twenties)

  62. [62]

    Learning an event sequence em- bedding for dense event-based deep stereo

    Stepan Tulyakov, Francois Fleuret, Martin Kiefel, Peter Gehler, and Michael Hirsch. Learning an event sequence em- bedding for dense event-based deep stereo. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1527–1537, 2019

  63. [63]

    Selective-stereo: Adaptive frequency information selection for stereo matching

    Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024

  64. [64]

    Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos

    Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, and Wei Xu. Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  65. [65]

    Stereo hybrid event-frame (shef) cameras for 3d perception

    Ziwei Wang, Liyuan Pan, Yonhon Ng, Zheyu Zhuang, and Robert Mahony. Stereo hybrid event-frame (shef) cameras for 3d perception. In2021 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 9758–

  66. [66]

    Foundationstereo: Zero- shot stereo matching.CVPR, 2025

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

  67. [67]

    Atten- tion concatenation volume for accurate and efficient stereo matching

    Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten- tion concatenation volume for accurate and efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12981– 12990, 2022

  68. [68]

    Iterative geometry encoding volume for stereo matching

    Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21919–21928, 2023

  69. [69]

    Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

    Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Chunyuan Liao, and Xin Yang. Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

  70. [70]

    Aanet: Adaptive aggrega- tion network for efficient stereo matching

    Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega- tion network for efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1959–1968, 2020

  71. [71]

    Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  72. [72]

    De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

    Ninghui Xu, Lihui Wang, Jiajia Zhao, and Zhiting Yao. De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

  73. [73]

    Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

    Ninghui Xu, Lihui Wang, Zhiting Yao, and Takayuki Okatani. Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

  74. [74]

    Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

  75. [75]

    Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

    Xun Yang, Tianyu Chang, Tianzhu Zhang, Shanshan Wang, Richang Hong, and Meng Wang. Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

  76. [76]

    Diving into the fusion of monocular pri- ors for generalized stereo matching

    Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, and Yunde Jia. Diving into the fusion of monocular pri- ors for generalized stereo matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14887–14897, 2025

  77. [77]

    Hierarchical discrete distribution decomposition for match density esti- mation

    Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition for match density esti- mation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6044–6053, 2019

  78. [78]

    Computing the stereo match- ing cost with a convolutional neural network

    Jure Zbontar and Yann LeCun. Computing the stereo match- ing cost with a convolutional neural network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1592–1599, 2015

  79. [79]

    Data association between event streams and in- tensity frames under diverse baselines

    Dehao Zhang, Qiankun Ding, Peiqi Duan, Chu Zhou, and Boxin Shi. Data association between event streams and in- tensity frames under diverse baselines. InEuropean Confer- ence on Computer Vision, pages 72–90. Springer, 2022. 13

  80. [80]

    Ga-net: Guided aggregation net for end-to- end stereo matching

    Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to- end stereo matching. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 185–194, 2019

Showing first 80 references.