Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Fabio Tosi; Jiawei Han; Lihui Wang; Luca Bartolomei; Matteo Poggi; Ninghui Xu; Stefano Mattoccia; Zhiting Yao

arxiv: 2604.15312 · v1 · submitted 2026-04-16 · 💻 cs.CV

Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

Ninghui Xu , Fabio Tosi , Lihui Wang , Jiawei Han , Luca Bartolomei , Zhiting Yao , Matteo Poggi , Stefano Mattoccia This is my paper

Pith reviewed 2026-05-10 11:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords event-frame stereocross-modal promptingasymmetric stereo matchingdepth estimationevent camerasbidirectional projectionmultimodal fusion

0 comments

The pith

Bidirectional prompting projects event and frame data into each other's domains to recover cues for accurate stereo matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Bi-CMPStereo, a framework that uses bidirectional cross-modal prompting to handle the differences between event cameras and conventional frame cameras in stereo setups. Event cameras provide high-speed, blur-free data but miss texture, while frames supply rich context yet fail under fast motion or poor lighting. By learning representations in a shared canonical space and projecting each input into both modalities, the method integrates semantic and structural details that the modality gap otherwise discards. If correct, this yields stereo depth maps that remain reliable in dynamic scenes where single-modality approaches degrade. The authors report stronger accuracy and generalization than prior stereo methods on relevant benchmarks.

Core claim

Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains.

What carries the argument

Bidirectional cross-modal prompting that projects each modality into both event and frame domains to recover and fuse domain-specific cues.

If this is right

Stereo matching accuracy increases because complementary texture from frames and timing from events are both retained.
Representations become more robust to motion blur and illumination changes.
The same model generalizes better across different camera setups and scene speeds.
No extra hardware synchronization is required beyond the two modalities already present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prompting idea could transfer to other pairs of sensors that differ in temporal resolution, such as lidar and camera fusion.
In robotics, this might allow lighter rigs that still produce reliable depth during high-speed maneuvers.
A direct test would measure whether removing the reverse projection step drops accuracy by a measurable margin on the same data.

Load-bearing premise

The gap between event and frame data marginalizes useful cues, and bidirectional prompting can recover those cues without creating new alignment mistakes.

What would settle it

Performance on a held-out dataset of fast-motion, low-light scenes falls to or below current state-of-the-art stereo methods, or measured alignment error rises after prompting.

Figures

Figures reproduced from arXiv: 2604.15312 by Fabio Tosi, Jiawei Han, Lihui Wang, Luca Bartolomei, Matteo Poggi, Ninghui Xu, Stefano Mattoccia, Zhiting Yao.

**Figure 1.** Figure 1: Qualitative comparison on event–frame asymmetric stereo. Compared to the state-of-the-art ZEST [41], our method achieves higher accuracy and higher-quality structural details in both complex-texture (first row) and sparse low-light scenes (second row). Abstract Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes.… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed CMPStereo network. In our bidirectional framework, the asymmetric stereo inputs are alternately designated as the target-domain and source-domain modalities. The Cross-Domain Embedding Adapter (CDEA) operates on the source domain to achieve initial source-to-target adaptation. CMPStereo employs domain-specific encoders Ft(·) and Fs(·) combined with Stereo Canonicalization Constrain… view at source ↗

**Figure 3.** Figure 3: Overview of the Bi-CMPStereo framework. Bi-CMPStereo integrates complementary representations from imgCMPStereo and evCMPStereo in the image and event domains for reliable disparity estimation. The two pre-trained CMPStereo networks are frozen as stereo feature extractors to construct multi-scale cost volumes, which are concatenated and fused via a 3D hourglass network at 1/4 scale, followed by the same ca… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison with state-of-the-art methods on a low-light scene from the DSEC dataset [20]. The mean absolute error (MAE) is shown at the bottom right of each estimated disparity map. As can be seen in the highlighted regions, our method more accurately reconstructs the fine structural details of the cars. more competitive on DSEC, we retrain it under the same training configuration as our method… view at source ↗

**Figure 5.** Figure 5: Generalization Result. Qualitative comparison of cross-dataset generalization on the MVSEC [87] dataset. 4.3. Cross-Dataset Generalization on MVSEC We further evaluate the generalization of our method by directly applying the DSEC-trained model to MVSEC [87] and M3ED [6]. We compare it against ZEST [41], which serves as a representative baseline, as it is designed for zeroshot generalization via visual p… view at source ↗

**Figure 6.** Figure 6: Visual comparison between Bi-CMPStereo with and without HVT. Generalization results on MVSEC [87]. provements in generalization, and the visual comparison in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative results in nighttime scenario from the DSEC dataset [20]. The mean absolute error (MAE) is shown at the bottom right of each estimated disparity map [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative results in daytime scenario from the DSEC dataset [20]. References [1] Soikat Hasan Ahmed, Hae Woong Jang, SM Nadim Uddin, and Yong Ju Jung. Deep event stereo leveraged by event-toimage translation. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 882–890, 2021. [2] Filippo Aleotti, Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Matto… view at source ↗

read the original abstract

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bi-CMPStereo uses bidirectional prompting to align event and frame modalities for stereo, but the projection mechanism needs stronger evidence that it avoids diluting sparse cues.

read the letter

The main thing here is Bi-CMPStereo, a framework that projects event data into frame space and frame data into event space to build aligned stereo representations in a shared canonical space. This targets the practical problem of combining dense contextual info from frames with the high-speed, high-dynamic-range signals from events, especially in motion-heavy or low-light settings where standard matching falls short. The bidirectional step is positioned as a way to recover cues that usually get lost when one modality dominates. That framing is straightforward and relevant for robotics or vehicle perception tasks. The paper does a decent job spelling out the motivation and claiming the method outperforms priors in accuracy and generalization through experiments. If the full results include proper baselines, ablations, and tests on diverse conditions, this could be a useful incremental advance in cross-modal stereo. The soft spot is the unelaborated projection step itself. The stress-test point holds: without details on how prompting handles sparsity, temporal structure, or cycle consistency, it's unclear whether event advantages survive the trip to dense frame space or if new alignment artifacts appear. The abstract's outperformance claim would be more credible with explicit numbers, error analysis, and checks that the bidirectional integration actually adds value rather than just averaging features. This is for people already working on event-based or multi-modal 3D vision who need concrete fusion ideas. It shows honest engagement with the modality gap and has enough of a new mechanism to deserve peer review, though it will probably need revisions to document the prompting implementation and back the gains with reproducible evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces Bi-CMPStereo, a bidirectional cross-modal prompting framework for event-frame asymmetric stereo matching. It claims to address the modality gap by learning finely aligned stereo representations within a target canonical space, integrating complementary semantic and structural features via bidirectional projection of each modality into both event and frame domains, and demonstrates significant outperformance over state-of-the-art methods in accuracy and generalization through extensive experiments.

Significance. If the empirical claims hold with proper validation, the framework could advance reliable 3D perception for dynamic scenes and challenging illumination by better preserving domain-specific cues that standard cross-modal matching tends to marginalize.

major comments (2)

[Abstract] Abstract: The assertion of significant outperformance over SOTA methods in accuracy and generalization is presented without any quantitative results, baselines, ablation studies, or error analysis, preventing evaluation of the data-to-claim link.
[Method] Method section: The bidirectional cross-modal prompting is described at a high level without specifying the projection mechanism (e.g., learned adapters, temporal aggregation for sparse events, or cycle-consistency losses), which is load-bearing for the claim that domain-specific cues are recovered without introducing new alignment errors or information loss.

minor comments (1)

The 'canonical space' is referenced without a formal definition, diagram, or equation clarifying the projection operators and alignment process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below. Revisions will be made to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of significant outperformance over SOTA methods in accuracy and generalization is presented without any quantitative results, baselines, ablation studies, or error analysis, preventing evaluation of the data-to-claim link.

Authors: We agree that the abstract would benefit from including quantitative highlights to strengthen the link between claims and evidence. In the revised manuscript, we will update the abstract to incorporate key performance metrics demonstrating outperformance (e.g., accuracy and generalization improvements on standard benchmarks). The full details on baselines, ablation studies, and error analysis are provided in the Experiments section, but we will ensure the abstract offers a clearer summary of these results. revision: yes
Referee: [Method] Method section: The bidirectional cross-modal prompting is described at a high level without specifying the projection mechanism (e.g., learned adapters, temporal aggregation for sparse events, or cycle-consistency losses), which is load-bearing for the claim that domain-specific cues are recovered without introducing new alignment errors or information loss.

Authors: We appreciate the referee highlighting this point and acknowledge that the current Method section presents the bidirectional cross-modal prompting at a high level. In the revised version, we will expand this section to explicitly specify the projection mechanisms, including details on learned adapters for domain mapping, temporal aggregation approaches for handling sparse event data, and the incorporation of cycle-consistency losses. These additions, along with supporting equations and illustrations, will clarify how domain-specific cues are preserved without introducing alignment errors or information loss. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal with no derivations or fitted predictions

full rationale

The paper introduces Bi-CMPStereo as a bidirectional cross-modal prompting framework for event-frame stereo matching. The abstract and available description contain no equations, no parameter fitting presented as prediction, no uniqueness theorems, and no self-citations that bear the central claim. The approach is defined by its architectural choices (canonical space alignment and bidirectional projection), which are independent of any input data or prior results by construction. Empirical outperformance is asserted via experiments rather than derived from the inputs themselves. This is a standard non-circular method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit parameters, axioms, or invented entities; assessment is limited to surface claims.

pith-pipeline@v0.9.0 · 5458 in / 977 out tokens · 36936 ms · 2026-05-10T11:14:22.208853+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 1 internal anchor

[1]

Deep event stereo leveraged by event-to- image translation

Soikat Hasan Ahmed, Hae Woong Jang, SM Nadim Uddin, and Yong Ju Jung. Deep event stereo leveraged by event-to- image translation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 882–890, 2021

work page 2021
[2]

Neural disparity refinement for arbitrary resolution stereo

Filippo Aleotti, Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- 10 fano. Neural disparity refinement for arbitrary resolution stereo. In2021 International Conference on 3D Vision (3DV), pages 207–217. IEEE, 2021

work page 2021
[3]

Lidar-event stereo fusion with hallucinations

Luca Bartolomei, Matteo Poggi, Andrea Conti, and Stefano Mattoccia. Lidar-event stereo fusion with hallucinations. In European Conference on Computer Vision, pages 125–145. Springer, 2024

work page 2024
[4]

Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

work page arXiv 2025
[5]

Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail

Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[6]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4016–4023, 2023

work page 2023
[7]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5418, 2018

work page 2018
[8]

Domain generalized stereo matching via hierarchical visual transformation

Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. Domain generalized stereo matching via hierarchical visual transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9559–9568, 2023

work page 2023
[9]

Depth from asymmetric frame-event stereo: A divide-and-conquer approach

Xihao Chen, Wenming Weng, Yueyi Zhang, and Zhi- wei Xiong. Depth from asymmetric frame-event stereo: A divide-and-conquer approach. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3045–3054, 2024

work page 2024
[10]

Mocha-stereo: Motif chan- nel attention network for stereo matching

Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- nel attention network for stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[11]

Monster: Marry monodepth to stereo unleashes power

Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, and Xin Yang. Monster: Marry monodepth to stereo unleashes power. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[12]

Event-image fusion stereo using cross-modality feature propagation

Hoonhee Cho and Kuk-Jin Yoon. Event-image fusion stereo using cross-modality feature propagation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 454– 462, 2022

work page 2022
[13]

Selection and cross simi- larity for event-image deep stereo

Hoonhee Cho and Kuk-Jin Yoon. Selection and cross simi- larity for event-image deep stereo. InEuropean Conference on Computer Vision, pages 470–486. Springer, 2022

work page 2022
[14]

Learning adaptive dense event stereo from the image domain

Hoonhee Cho, Jegyeong Cho, and Kuk-Jin Yoon. Learning adaptive dense event stereo from the image domain. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17797–17807, 2023

work page 2023
[15]

Non-coaxial event-guided motion deblurring with spa- tial alignment

Hoonhee Cho, Yuhwan Jeong, Taewoo Kim, and Kuk-Jin Yoon. Non-coaxial event-guided motion deblurring with spa- tial alignment. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12492–12503, 2023

work page 2023
[16]

Tempo- ral event stereo via joint learning with stereoscopic flow

Hoonhee Cho, Jae-Young Kang, and Kuk-Jin Yoon. Tempo- ral event stereo via joint learning with stereoscopic flow. In European Conference on Computer Vision, pages 294–314. Springer, 2024

work page 2024
[17]

Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works

WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar, and David Suter. Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13022–13032, 2022

work page 2022
[18]

Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

Chao Ding, Mingyuan Lin, Haijian Zhang, Jianzhuang Liu, and Lei Yu. Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

work page 2024
[19]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

work page 2020
[20]

Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

work page 2021
[21]

Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

Dipon Kumar Ghosh and Yong Ju Jung. Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

work page 2024
[22]

Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Suman Ghosh and Guillermo Gallego. Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[23]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017

work page 2017
[24]

arXiv preprint arXiv:2507.22052 (2025) 4, 11

Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, and Matteo Poggi. Ov3r: Open- vocabulary semantic 3d reconstruction from rgb videos. arXiv preprint arXiv:2507.22052, 2025

work page arXiv 2025
[25]

Bridgedepth: Bridging monocular and stereo reasoning with latent alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, and Yun-Hui Liu. Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA, 2025. ICCV 2025 Highlight

work page 2025
[26]

Context-enhanced stereo transformer

Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus- sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei Li. Context-enhanced stereo transformer. InEuropean Con- ference on Computer Vision, pages 263–279. Springer, 2022

work page 2022
[27]

Group-wise correlation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3273–3282, 2019. 11

work page 2019
[28]

Defom-stereo: Depth foundation model based stereo matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[29]

End-to-end learning of geometry and context for deep stereo regression

Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. InProceedings of the IEEE international confer- ence on computer vision, pages 66–75, 2017

work page 2017
[30]

Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

Haram Kim, Sangil Lee, Junha Kim, and H Jin Kim. Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

work page 2022
[31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[32]

A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and Mohammed Bennamoun. A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

work page 2020
[33]

Practical stereo matching via cascaded re- current network with adaptive correlation

Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Zi- wei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded re- current network with adaptive correlation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16263–16272, 2022

work page 2022
[34]

Active event-based stereo vision

Jianing Li, Yunjian Zhang, Haiqian Han, and Xiangyang Ji. Active event-based stereo vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 971–981, 2025

work page 2025
[35]

Creighton, Russell H

Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X. Creighton, Russell H. Taylor, and Mathias Un- berath. Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6197–6206, 2021

work page 2021
[36]

Learn- ing for disparity estimation through feature constancy

Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learn- ing for disparity estimation through feature constancy. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2811–2820, 2018

work page 2018
[37]

A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

work page 2008
[38]

Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Mingyuan Lin, Chi Zhang, Chu He, and Lei Yu. Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[39]

Raft-stereo: Multilevel recurrent field transforms for stereo matching

Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021

work page 2021
[40]

Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature

Biyang Liu, Huimin Yu, and Guodong Qi. Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13012–13021, 2022

work page 2022
[41]

Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

Hanyue Lou, Jinxiu Liang, Minggui Teng, Bin Fan, Yong Xu, and Boxin Shi. Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

work page 2024
[42]

Ef- ficient deep learning for stereo matching

Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Ef- ficient deep learning for stereo matching. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5695–5703, 2016

work page 2016
[43]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4040–4048, 2016

work page 2016
[44]

Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

work page 2022
[45]

S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min, Youngpil Jeon, Jimin Kim, and Minyong Choi. S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), page to ap- pear, Honolulu, Hawai’i, 2025. IEEE. Accepted at ICCV 2025, Hawai’i Convention Center, Oct 19–23

work page 2025
[46]

Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

Mohammad Mostafavi, Lin Wang, and Kuk-Jin Yoon. Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

work page 2021
[47]

Event-intensity stereo: Estimating depth by the best of both worlds

Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Event-intensity stereo: Estimating depth by the best of both worlds. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4258–4267, 2021

work page 2021
[48]

Stereo depth from events cameras: Concentrate and focus on the future

Yeongwoo Nam, Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Stereo depth from events cameras: Concentrate and focus on the future. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6114–6123, 2022

work page 2022
[49]

Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

work page 2019
[50]

Federated online adaptation for deep stereo

Matteo Poggi and Fabio Tosi. Federated online adaptation for deep stereo. InCVPR, 2024

work page 2024
[51]

Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

Matteo Poggi, Alessio Tonioni, Fabio Tosi, Stefano Mattoc- cia, and Luigi Di Stefano. Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

work page 2021
[52]

Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. On the synergies between machine learning and binocular stereo for depth estimation from images: a survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(9):5314–5334, 2021

work page 2021
[53]

High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

work page 1964
[54]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015
[55]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

work page 2002
[56]

Cfnet: Cascade and fused cost volume for robust stereo matching

Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2021

work page 2021
[57]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020

work page 2020
[58]

Unsupervised adaptation for deep stereo

Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Unsupervised adaptation for deep stereo. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 1605–1613, 2017

work page 2017
[59]

Nerf-supervised deep stereo

Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, and Mat- teo Poggi. Nerf-supervised deep stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 855–866, 2023

work page 2023
[60]

Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- fano. Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[61]

A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024

Fabio Tosi, Luca Bartolomei, and Matteo Poggi. A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024. Extended version of CVPR 2024 Tutorial ”Deep Stereo Matching in the Twen- ties” (https://sites.google.com/view/stereo-twenties)

work page arXiv 2024
[62]

Learning an event sequence em- bedding for dense event-based deep stereo

Stepan Tulyakov, Francois Fleuret, Martin Kiefel, Peter Gehler, and Michael Hirsch. Learning an event sequence em- bedding for dense event-based deep stereo. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1527–1537, 2019

work page 2019
[63]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024

work page 2024
[64]

Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos

Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, and Wei Xu. Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[65]

Stereo hybrid event-frame (shef) cameras for 3d perception

Ziwei Wang, Liyuan Pan, Yonhon Ng, Zheyu Zhuang, and Robert Mahony. Stereo hybrid event-frame (shef) cameras for 3d perception. In2021 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 9758–

work page
[66]

Foundationstereo: Zero- shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

work page 2025
[67]

Atten- tion concatenation volume for accurate and efficient stereo matching

Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten- tion concatenation volume for accurate and efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12981– 12990, 2022

work page 2022
[68]

Iterative geometry encoding volume for stereo matching

Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21919–21928, 2023

work page 2023
[69]

Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Chunyuan Liao, and Xin Yang. Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

work page arXiv 2024
[70]

Aanet: Adaptive aggrega- tion network for efficient stereo matching

Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega- tion network for efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1959–1968, 2020

work page 1959
[71]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[72]

De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

Ninghui Xu, Lihui Wang, Jiajia Zhao, and Zhiting Yao. De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

work page 2023
[73]

Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

Ninghui Xu, Lihui Wang, Zhiting Yao, and Takayuki Okatani. Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

work page 2025
[74]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

work page 2024
[75]

Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

Xun Yang, Tianyu Chang, Tianzhu Zhang, Shanshan Wang, Richang Hong, and Meng Wang. Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

work page 2024
[76]

Diving into the fusion of monocular pri- ors for generalized stereo matching

Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, and Yunde Jia. Diving into the fusion of monocular pri- ors for generalized stereo matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14887–14897, 2025

work page 2025
[77]

Hierarchical discrete distribution decomposition for match density esti- mation

Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition for match density esti- mation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6044–6053, 2019

work page 2019
[78]

Computing the stereo match- ing cost with a convolutional neural network

Jure Zbontar and Yann LeCun. Computing the stereo match- ing cost with a convolutional neural network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1592–1599, 2015

work page 2015
[79]

Data association between event streams and in- tensity frames under diverse baselines

Dehao Zhang, Qiankun Ding, Peiqi Duan, Chu Zhou, and Boxin Shi. Data association between event streams and in- tensity frames under diverse baselines. InEuropean Confer- ence on Computer Vision, pages 72–90. Springer, 2022. 13

work page 2022
[80]

Ga-net: Guided aggregation net for end-to- end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to- end stereo matching. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 185–194, 2019

work page 2019

Showing first 80 references.

[1] [1]

Deep event stereo leveraged by event-to- image translation

Soikat Hasan Ahmed, Hae Woong Jang, SM Nadim Uddin, and Yong Ju Jung. Deep event stereo leveraged by event-to- image translation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 882–890, 2021

work page 2021

[2] [2]

Neural disparity refinement for arbitrary resolution stereo

Filippo Aleotti, Fabio Tosi, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- 10 fano. Neural disparity refinement for arbitrary resolution stereo. In2021 International Conference on 3D Vision (3DV), pages 207–217. IEEE, 2021

work page 2021

[3] [3]

Lidar-event stereo fusion with hallucinations

Luca Bartolomei, Matteo Poggi, Andrea Conti, and Stefano Mattoccia. Lidar-event stereo fusion with hallucinations. In European Conference on Computer Vision, pages 125–145. Springer, 2024

work page 2024

[4] [4]

Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation.arXiv preprint arXiv:2509.15224, 2025

work page arXiv 2025

[5] [5]

Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail

Luca Bartolomei, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Stereo anywhere: Robust zero-shot deep stereo matching even where either stereo or mono fail. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[6] [6]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4016–4023, 2023

work page 2023

[7] [7]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5410–5418, 2018

work page 2018

[8] [8]

Domain generalized stereo matching via hierarchical visual transformation

Tianyu Chang, Xun Yang, Tianzhu Zhang, and Meng Wang. Domain generalized stereo matching via hierarchical visual transformation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9559–9568, 2023

work page 2023

[9] [9]

Depth from asymmetric frame-event stereo: A divide-and-conquer approach

Xihao Chen, Wenming Weng, Yueyi Zhang, and Zhi- wei Xiong. Depth from asymmetric frame-event stereo: A divide-and-conquer approach. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3045–3054, 2024

work page 2024

[10] [10]

Mocha-stereo: Motif chan- nel attention network for stereo matching

Ziyang Chen, Wei Long, He Yao, Yongjun Zhang, Bingshu Wang, Yongbin Qin, and Jia Wu. Mocha-stereo: Motif chan- nel attention network for stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[11] [11]

Monster: Marry monodepth to stereo unleashes power

Junda Cheng, Longliang Liu, Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Yong Deng, Jinliang Zang, Yurui Chen, Zhipeng Cai, and Xin Yang. Monster: Marry monodepth to stereo unleashes power. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[12] [12]

Event-image fusion stereo using cross-modality feature propagation

Hoonhee Cho and Kuk-Jin Yoon. Event-image fusion stereo using cross-modality feature propagation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 454– 462, 2022

work page 2022

[13] [13]

Selection and cross simi- larity for event-image deep stereo

Hoonhee Cho and Kuk-Jin Yoon. Selection and cross simi- larity for event-image deep stereo. InEuropean Conference on Computer Vision, pages 470–486. Springer, 2022

work page 2022

[14] [14]

Learning adaptive dense event stereo from the image domain

Hoonhee Cho, Jegyeong Cho, and Kuk-Jin Yoon. Learning adaptive dense event stereo from the image domain. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17797–17807, 2023

work page 2023

[15] [15]

Non-coaxial event-guided motion deblurring with spa- tial alignment

Hoonhee Cho, Yuhwan Jeong, Taewoo Kim, and Kuk-Jin Yoon. Non-coaxial event-guided motion deblurring with spa- tial alignment. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 12492–12503, 2023

work page 2023

[16] [16]

Tempo- ral event stereo via joint learning with stereoscopic flow

Hoonhee Cho, Jae-Young Kang, and Kuk-Jin Yoon. Tempo- ral event stereo via joint learning with stereoscopic flow. In European Conference on Computer Vision, pages 294–314. Springer, 2024

work page 2024

[17] [17]

Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works

WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar, and David Suter. Itsa: An information-theoretic approach to automatic shortcut avoid- ance and domain generalization in stereo matching net- works. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13022–13032, 2022

work page 2022

[18] [18]

Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

Chao Ding, Mingyuan Lin, Haijian Zhang, Jianzhuang Liu, and Lei Yu. Video frame interpolation with stereo event and intensity cameras.IEEE Transactions on Multimedia, 26: 9187–9202, 2024

work page 2024

[19] [19]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

work page 2020

[20] [20]

Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021

work page 2021

[21] [21]

Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

Dipon Kumar Ghosh and Yong Ju Jung. Two-stage cross- fusion network for stereo event-based depth estimation.Ex- pert Systems with Applications, 241:122743, 2024

work page 2024

[22] [22]

Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Suman Ghosh and Guillermo Gallego. Event-based stereo depth estimation: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025

[23] [23]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 270–279, 2017

work page 2017

[24] [24]

arXiv preprint arXiv:2507.22052 (2025) 4, 11

Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, and Matteo Poggi. Ov3r: Open- vocabulary semantic 3d reconstruction from rgb videos. arXiv preprint arXiv:2507.22052, 2025

work page arXiv 2025

[25] [25]

Bridgedepth: Bridging monocular and stereo reasoning with latent alignment

Tongfan Guan, Jiaxin Guo, Chen Wang, and Yun-Hui Liu. Bridgedepth: Bridging monocular and stereo reasoning with latent alignment. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), Honolulu, Hawaii, USA, 2025. ICCV 2025 Highlight

work page 2025

[26] [26]

Context-enhanced stereo transformer

Weiyu Guo, Zhaoshuo Li, Yongkui Yang, Zheng Wang, Rus- sell H Taylor, Mathias Unberath, Alan Yuille, and Yingwei Li. Context-enhanced stereo transformer. InEuropean Con- ference on Computer Vision, pages 263–279. Springer, 2022

work page 2022

[27] [27]

Group-wise correlation stereo network

Xiaoyang Guo, Kai Yang, Wukui Yang, Xiaogang Wang, and Hongsheng Li. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3273–3282, 2019. 11

work page 2019

[28] [28]

Defom-stereo: Depth foundation model based stereo matching

Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie Jiang, and Rui Huang. Defom-stereo: Depth foundation model based stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[29] [29]

End-to-end learning of geometry and context for deep stereo regression

Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. End-to-end learning of geometry and context for deep stereo regression. InProceedings of the IEEE international confer- ence on computer vision, pages 66–75, 2017

work page 2017

[30] [30]

Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

Haram Kim, Sangil Lee, Junha Kim, and H Jin Kim. Real- time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance.IEEE Robotics and Automation Letters, 8(1):416–423, 2022

work page 2022

[31] [31]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[32] [32]

A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and Mohammed Bennamoun. A survey on deep learning tech- niques for stereo-based depth estimation.IEEE transactions on pattern analysis and machine intelligence, 44(4):1738– 1764, 2020

work page 2020

[33] [33]

Practical stereo matching via cascaded re- current network with adaptive correlation

Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Zi- wei Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng Liu. Practical stereo matching via cascaded re- current network with adaptive correlation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16263–16272, 2022

work page 2022

[34] [34]

Active event-based stereo vision

Jianing Li, Yunjian Zhang, Haiqian Han, and Xiangyang Ji. Active event-based stereo vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 971–981, 2025

work page 2025

[35] [35]

Creighton, Russell H

Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X. Creighton, Russell H. Taylor, and Mathias Un- berath. Revisiting stereo depth estimation from a sequence- to-sequence perspective with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6197–6206, 2021

work page 2021

[36] [36]

Learn- ing for disparity estimation through feature constancy

Zhengfa Liang, Yiliu Feng, Yulan Guo, Hengzhu Liu, Wei Chen, Linbo Qiao, Li Zhou, and Jianfeng Zhang. Learn- ing for disparity estimation through feature constancy. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2811–2820, 2018

work page 2018

[37] [37]

A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15µs latency asynchronous temporal con- trast vision sensor.IEEE journal of solid-state circuits, 43 (2):566–576, 2008

work page 2008

[38] [38]

Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Mingyuan Lin, Chi Zhang, Chu He, and Lei Yu. Learn- ing parallax for stereo event-based motion deblurring.IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025

[39] [39]

Raft-stereo: Multilevel recurrent field transforms for stereo matching

Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), pages 218–227. IEEE, 2021

work page 2021

[40] [40]

Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature

Biyang Liu, Huimin Yu, and Guodong Qi. Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13012–13021, 2022

work page 2022

[41] [41]

Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

Hanyue Lou, Jinxiu Liang, Minggui Teng, Bin Fan, Yong Xu, and Boxin Shi. Zero-shot event-intensity asymmetric stereo via visual prompting from image domain.Advances in Neural Information Processing Systems, 37:13274–13301, 2024

work page 2024

[42] [42]

Ef- ficient deep learning for stereo matching

Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Ef- ficient deep learning for stereo matching. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 5695–5703, 2016

work page 2016

[43] [43]

A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4040–4048, 2016

work page 2016

[44] [44]

Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022

work page 2022

[45] [45]

S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min, Youngpil Jeon, Jimin Kim, and Minyong Choi. S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), page to ap- pear, Honolulu, Hawai’i, 2025. IEEE. Accepted at ICCV 2025, Hawai’i Convention Center, Oct 19–23

work page 2025

[46] [46]

Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

Mohammad Mostafavi, Lin Wang, and Kuk-Jin Yoon. Learn- ing to reconstruct hdr images from events, with applications to depth and flow prediction.International Journal of Com- puter Vision, 129(4):900–920, 2021

work page 2021

[47] [47]

Event-intensity stereo: Estimating depth by the best of both worlds

Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Event-intensity stereo: Estimating depth by the best of both worlds. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4258–4267, 2021

work page 2021

[48] [48]

Stereo depth from events cameras: Concentrate and focus on the future

Yeongwoo Nam, Mohammad Mostafavi, Kuk-Jin Yoon, and Jonghyun Choi. Stereo depth from events cameras: Concentrate and focus on the future. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6114–6123, 2022

work page 2022

[49] [49]

Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

work page 2019

[50] [50]

Federated online adaptation for deep stereo

Matteo Poggi and Fabio Tosi. Federated online adaptation for deep stereo. InCVPR, 2024

work page 2024

[51] [51]

Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

Matteo Poggi, Alessio Tonioni, Fabio Tosi, Stefano Mattoc- cia, and Luigi Di Stefano. Continual adaptation for deep stereo.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4713–4729, 2021

work page 2021

[52] [52]

Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, and Stefano Mattoccia. On the synergies between machine learning and binocular stereo for depth estimation from images: a survey.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 44(9):5314–5334, 2021

work page 2021

[53] [53]

High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with 12 an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

work page 1964

[54] [54]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical image com- puting and computer-assisted intervention, pages 234–241. Springer, 2015

work page 2015

[55] [55]

A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algo- rithms.International journal of computer vision, 47:7–42, 2002

work page 2002

[56] [56]

Cfnet: Cascade and fused cost volume for robust stereo matching

Zhelun Shen, Yuchao Dai, and Zhibo Rao. Cfnet: Cascade and fused cost volume for robust stereo matching. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13906–13915, 2021

work page 2021

[57] [57]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on com- puter vision, pages 402–419. Springer, 2020

work page 2020

[58] [58]

Unsupervised adaptation for deep stereo

Alessio Tonioni, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano. Unsupervised adaptation for deep stereo. InPro- ceedings of the IEEE International Conference on Computer Vision, pages 1605–1613, 2017

work page 2017

[59] [59]

Nerf-supervised deep stereo

Fabio Tosi, Alessio Tonioni, Daniele De Gregorio, and Mat- teo Poggi. Nerf-supervised deep stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 855–866, 2023

work page 2023

[60] [60]

Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Stefano Mattoccia, and Luigi Di Ste- fano. Neural disparity refinement.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[61] [61]

A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024

Fabio Tosi, Luca Bartolomei, and Matteo Poggi. A sur- vey on deep stereo matching in the twenties.arXiv preprint arXiv:2407.07816, 2024. Extended version of CVPR 2024 Tutorial ”Deep Stereo Matching in the Twen- ties” (https://sites.google.com/view/stereo-twenties)

work page arXiv 2024

[62] [62]

Learning an event sequence em- bedding for dense event-based deep stereo

Stepan Tulyakov, Francois Fleuret, Martin Kiefel, Peter Gehler, and Michael Hirsch. Learning an event sequence em- bedding for dense event-based deep stereo. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 1527–1537, 2019

work page 2019

[63] [63]

Selective-stereo: Adaptive frequency information selection for stereo matching

Xianqi Wang, Gangwei Xu, Hao Jia, and Xin Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024

work page 2024

[64] [64]

Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos

Yang Wang, Peng Wang, Zhenheng Yang, Chenxu Luo, Yi Yang, and Wei Xu. Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[65] [65]

Stereo hybrid event-frame (shef) cameras for 3d perception

Ziwei Wang, Liyuan Pan, Yonhon Ng, Zheyu Zhuang, and Robert Mahony. Stereo hybrid event-frame (shef) cameras for 3d perception. In2021 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 9758–

work page

[66] [66]

Foundationstereo: Zero- shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero- shot stereo matching.CVPR, 2025

work page 2025

[67] [67]

Atten- tion concatenation volume for accurate and efficient stereo matching

Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang. Atten- tion concatenation volume for accurate and efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12981– 12990, 2022

work page 2022

[68] [68]

Iterative geometry encoding volume for stereo matching

Gangwei Xu, Xianqi Wang, Xiaohuan Ding, and Xin Yang. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21919–21928, 2023

work page 2023

[69] [69]

Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

Gangwei Xu, Xianqi Wang, Zhaoxing Zhang, Junda Cheng, Chunyuan Liao, and Xin Yang. Igev++: Iterative multi-range geometry encoding volumes for stereo matching.arXiv preprint arXiv:2409.00638, 2024

work page arXiv 2024

[70] [70]

Aanet: Adaptive aggrega- tion network for efficient stereo matching

Haofei Xu and Juyong Zhang. Aanet: Adaptive aggrega- tion network for efficient stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1959–1968, 2020

work page 1959

[71] [71]

Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying flow, stereo and depth estimation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023

[72] [72]

De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

Ninghui Xu, Lihui Wang, Jiajia Zhao, and Zhiting Yao. De- noising for dynamic vision sensor based on augmented spa- tiotemporal correlation.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4812–4824, 2023

work page 2023

[73] [73]

Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

Ninghui Xu, Lihui Wang, Zhiting Yao, and Takayuki Okatani. Mets: Motion-encoded time-surface for event- based high-speed pose tracking.International Journal of Computer Vision, 133(7):4401–4419, 2025

work page 2025

[74] [74]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

work page 2024

[75] [75]

Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

Xun Yang, Tianyu Chang, Tianzhu Zhang, Shanshan Wang, Richang Hong, and Meng Wang. Learning hierarchical vi- sual transformation for domain generalizable visual match- ing and recognition.International Journal of Computer Vi- sion, 132(11):4823–4849, 2024

work page 2024

[76] [76]

Diving into the fusion of monocular pri- ors for generalized stereo matching

Chengtang Yao, Lidong Yu, Zhidan Liu, Jiaxi Zeng, Yuwei Wu, and Yunde Jia. Diving into the fusion of monocular pri- ors for generalized stereo matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14887–14897, 2025

work page 2025

[77] [77]

Hierarchical discrete distribution decomposition for match density esti- mation

Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical discrete distribution decomposition for match density esti- mation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6044–6053, 2019

work page 2019

[78] [78]

Computing the stereo match- ing cost with a convolutional neural network

Jure Zbontar and Yann LeCun. Computing the stereo match- ing cost with a convolutional neural network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1592–1599, 2015

work page 2015

[79] [79]

Data association between event streams and in- tensity frames under diverse baselines

Dehao Zhang, Qiankun Ding, Peiqi Duan, Chu Zhou, and Boxin Shi. Data association between event streams and in- tensity frames under diverse baselines. InEuropean Confer- ence on Computer Vision, pages 72–90. Springer, 2022. 13

work page 2022

[80] [80]

Ga-net: Guided aggregation net for end-to- end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end-to- end stereo matching. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 185–194, 2019

work page 2019