Learning Parallax for Stereo Event-based Motion Deblurring

Chi Zhang; Chu He; Lei Yu; Mingyuan Lin

arxiv: 2309.09513 · v1 · submitted 2023-09-18 · 💻 cs.CV

Learning Parallax for Stereo Event-based Motion Deblurring

Mingyuan Lin , Chi Zhang , Chu He , Lei Yu This is my paper

Pith reviewed 2026-05-24 07:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords event-based visionmotion deblurringstereo matchingparallax learningimage reconstructionmulti-modal fusion

0 comments

The pith

St-EDNet recovers sharp images from misaligned blurry photos and event streams by learning coarse-to-fine parallax alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents St-EDNet, a coarse-to-fine framework that recovers sequences of sharp images directly from a single blurry intensity image paired with concurrent but misaligned event streams. It begins with a cross-modal stereo matching module that performs coarse spatial alignment without any ground-truth depth data. A dual-feature embedding architecture then refines the bidirectional association between the coarsely aligned inputs and reconstructs the latent sharp frames. The authors also introduce the StEIC dataset of real stereo event and intensity captures with dense disparity maps. Experiments show the network outperforms prior methods on real-world misaligned data.

Core claim

The central claim is that high-quality sharp image sequences can be recovered from misaligned blurry images and concurrent event streams by first applying cross-modal stereo matching for coarse alignment without ground-truth depths, followed by dual-feature embedding to build fine bidirectional associations and perform reconstruction.

What carries the argument

St-EDNet framework, which uses a cross-modal stereo matching module for coarse spatial alignment of blurry images and events, plus a dual-feature embedding architecture for fine association and sharp image sequence reconstruction.

If this is right

Deblurring becomes possible with real-world inputs that lack perfect pixel-wise alignment between intensity images and events.
A single blurry image plus concurrent events suffice as input, without additional aligned data.
The new StEIC dataset supplies real stereo events, intensity images, and dense disparity maps for training and benchmarking.
The approach produces a sequence of latent sharp images rather than a single output frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stereo matching step could be adapted to other multi-modal pairs that suffer from spatial misalignment, such as event-LiDAR fusion.
Extending the coarse-to-fine pipeline to handle varying event densities might improve robustness in low-light or high-speed scenes.
Controlled synthetic experiments that vary the degree of initial misalignment could isolate the contribution of the stereo module.

Load-bearing premise

Cross-modal stereo matching can produce sufficient coarse spatial alignment between the blurry image and event streams without ground-truth depths.

What would settle it

Real-world test sequences where the initial misalignment exceeds what the stereo matching module can correct, resulting in reconstruction quality no better than methods that assume perfect alignment.

Figures

Figures reproduced from arXiv: 2309.09513 by Chi Zhang, Chu He, Lei Yu, Mingyuan Lin.

**Figure 2.** Figure 2: The overall structure of the proposed St-EDNet, which consists of two modules: DispNet and DblrNet. DispNet estimates [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the DispNet [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Intermediate disparities predicted by (c) DispNet, and [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Illustration of the stereo event and intensity camera [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results of motion deblurring of 9 different methods on the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of motion deblurring of 9 different methods on the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-frame motion deblurring results on the [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparisons of the multi-frame motion deblurring on the [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of coarse disparity estimation with the input blurry image and events on the [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative ablation study for DispNet on the [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative ablation study for DblrNet on the [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

read the original abstract

Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a stereo pipeline for parallax in event deblurring but the no-GT-depths claim is probably limited to inference time.

read the letter

The main takeaway is a coarse-to-fine network called St-EDNet that first aligns a blurry intensity image with stereo event streams using cross-modal matching, then refines the association with dual-feature embedding to recover sharp frames. They also release the StEIC dataset of real stereo events, images, and disparity maps. This setup directly targets the misalignment that most prior event deblurring work assumes away, which is a practical pain point when cameras are not perfectly registered.

Referee Report

2 major / 1 minor

Summary. The paper proposes St-EDNet, a coarse-to-fine framework for recovering sharp image sequences from misaligned inputs consisting of a single blurry intensity image and concurrent event streams captured by stereo event and intensity cameras. The approach first performs coarse spatial alignment via a cross-modal stereo matching module claimed to require no ground-truth depths, then uses a dual-feature embedding architecture to build fine bidirectional associations and reconstruct latent sharp images. A new StEIC dataset is introduced containing real-world events, intensity images, and dense disparity maps, with experiments asserting superiority over state-of-the-art methods on real-world datasets.

Significance. If the claims hold, the work addresses a practical limitation in event-based deblurring by enabling operation on misaligned stereo data without perfect pixel-wise alignment, potentially broadening applicability in real-world settings; the release of the StEIC dataset with disparity maps would also provide a useful resource for cross-modal stereo and deblurring research.

major comments (2)

[Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.
[Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.

minor comments (1)

[Abstract] The abstract introduces the acronym St-EDNet and StEIC but does not expand them on first use or clarify their relation to the full framework name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we address each major comment point by point with clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.

Authors: The cross-modal stereo matching module is trained via a self-supervised photometric consistency loss combined with event-specific constraints and does not use the dense disparity maps from StEIC as supervision (no disparity regression loss is applied). The disparity maps are included in the dataset solely to enable quantitative evaluation of the alignment module and to support future cross-modal stereo research; they play no role in training the module itself. This preserves the claimed independence from ground-truth depths at both training and inference time. We will revise the manuscript to explicitly state the self-supervised training procedure for the module. revision: yes
Referee: [Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.

Authors: Abstracts are concise summaries constrained by length limits and therefore omit detailed quantitative results, error bars, ablations, and statistics; these are fully reported in the Experiments section of the manuscript (including PSNR/SSIM with standard deviations, ablation tables, and dataset details). The superiority claims are supported by those experiments on real-world data. We can add one or two key quantitative highlights to the abstract if the editor requests it. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained

full rationale

The paper proposes a coarse-to-fine neural architecture (St-EDNet) for deblurring from misaligned stereo event/intensity inputs. The cross-modal stereo matching module is presented as operating without ground-truth depths, and the dual-feature embedding proceeds from that alignment to image reconstruction. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the training uses the provided StEIC dataset but the architectural derivation and performance claims remain independent of any tautological loop. This is the normal case of an empirical method whose validity is tested externally rather than forced internally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of a learned neural architecture trained on real event-intensity pairs; the stereo matching step assumes sufficient overlap and texture for cross-modal correspondence.

free parameters (1)

Network weights
All parameters of the stereo matching module and dual-feature embedding network are fitted during training on the StEIC and other datasets.

axioms (1)

domain assumption Concurrent event streams and a single blurry intensity image contain sufficient information to recover sharp frames even under spatial misalignment
Invoked as the motivation and operating premise of the entire framework.

invented entities (2)

St-EDNet no independent evidence
purpose: Coarse-to-fine deblurring architecture
Newly proposed network; no independent evidence outside the paper.
StEIC dataset no independent evidence
purpose: Real-world training and evaluation data with events, intensity images, and disparity maps
Newly collected dataset; no external validation mentioned.

pith-pipeline@v0.9.0 · 5726 in / 1381 out tokens · 40034 ms · 2026-05-24T07:06:09.930989+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

coarse spatial alignment ... with a cross-modal stereo matching module without the need for ground-truth depths
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DispNet ... U-Net-based architecture ... Pyramid Attention blocks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

[1]

Learning to extract a video sequence from a single motion-blurred image,

M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in CVPR, 2018, pp. 6334–6342. 1, 2, 8, 9

work page 2018
[2]

Bringing alive blurred moments,

K. Purohit, A. Shah, and A. Rajagopalan, “Bringing alive blurred moments,” in CVPR, 2019, pp. 6830–6839. 1, 2

work page 2019
[3]

Photosequencing of motion blur using short and long exposures,

V . Rengarajan, S. Zhao, R. Zhen, J. Glotzbach, H. Sheikh, and A. C. Sankaranarayanan, “Photosequencing of motion blur using short and long exposures,” in CVPRW, 2020, pp. 510–511. 1

work page 2020
[4]

A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,

P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,” IEEE J. Solid- State Circuits, vol. 43, no. 2, pp. 566–576, 2008. 1

work page 2008
[5]

A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,

C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits , vol. 49, no. 10, pp. 2333–2341, 2014. 1

work page 2014
[6]

Event- based vision: A survey,

G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,”IEEE TPAMI, vol. 44, no. 1, pp. 154–180, 2020. 1, 2

work page 2020
[7]

Bringing a blurry frame alive at high frame-rate with an event camera,

L. Pan, C. Scheerlinck, X. Yu, R. Hartley, M. Liu, and Y . Dai, “Bringing a blurry frame alive at high frame-rate with an event camera,” in CVPR, 2019, pp. 6820–6829. 1, 8, 9

work page 2019
[8]

Learning event-driven video deblurring and interpolation,

S. Lin, J. Zhang, J. Pan, Z. Jiang, D. Zou, Y . Wang, J. Chen, and J. Ren, “Learning event-driven video deblurring and interpolation,” in ECCV, 2020, pp. 695–710. 1, 2, 3, 4, 8, 9

work page 2020
[9]

Reducing the sim-to-real gap for event cameras,

T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony, “Reducing the sim-to-real gap for event cameras,” in ECCV, 2020, pp. 534–549. 1

work page 2020
[10]

Event enhanced high- quality image recovery,

B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high- quality image recovery,” in ECCV, 2020, pp. 155–171. 1, 2, 3, 8, 9

work page 2020
[11]

Motion deblurring with real events,

F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592. 1, 2, 3, 4, 8, 9

work page 2021
[12]

E-cir: Event-enhanced continuous intensity recovery,

C. Song, Q. Huang, and C. Bajaj, “E-cir: Event-enhanced continuous intensity recovery,” in CVPR, 2022, pp. 7803–7812. 1, 4, 8, 9

work page 2022
[13]

Dsec: A stereo event camera dataset for driving scenarios,

M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo event camera dataset for driving scenarios,” IEEE RAL , vol. 6, no. 3, pp. 4947–4954, 2021. 1, 2, 5, 6

work page 2021
[14]

Event-image fusion stereo using cross-modality feature propagation,

H. Cho and K.-J. Yoon, “Event-image fusion stereo using cross-modality feature propagation,” in AAAI, 2022, pp. 882–890. 1

work page 2022
[15]

Data association between event streams and intensity frames under diverse baselines,

D. Zhang, Q. D. P. D. C. Zhou, and B. Shi, “Data association between event streams and intensity frames under diverse baselines,” in ECCV,

work page
[16]

Dynamic event camera calibration,

K. Huang, Y . Wang, and L. Kneip, “Dynamic event camera calibration,” in IROS, 2021, pp. 7021–7028. 1

work page 2021
[17]

How to calibrate your event camera,

M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in CVPR, 2021, pp. 1403–1409. 1

work page 2021
[18]

Multiple view geometry,

A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging Topics in Computer Vision , vol. 3, pp. 45–108, 2005. 1

work page 2005
[19]

Time lens: Event-based video frame interpolation,

S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y . Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in CVPR, 2021, pp. 16 155–16 164. 1

work page 2021
[20]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,

A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,” IEEE RAL, vol. 3, no. 3, pp. 2032– 2039, 2018. 2, 5, 6

work page 2032
[21]

Learning to extract flawless slow motion from blurry videos,

M. Jin, Z. Hu, and P. Favaro, “Learning to extract flawless slow motion from blurry videos,” in CVPR, 2019, pp. 8112–8121. 2

work page 2019
[22]

Single-image blind deblurring using multi-scale latent structure prior,

Y . Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao, “Single-image blind deblurring using multi-scale latent structure prior,” IEEE TCSVT, vol. 30, no. 7, pp. 2033–2045, 2019. 2

work page 2033
[23]

Blind deconvolution using a normalized sparsity measure,

D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in CVPR, 2011, pp. 233–240. 2

work page 2011
[24]

Edge-based blur kernel estimation using patch priors,

L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernel estimation using patch priors,” in ICCP, 2013, pp. 1–8. 2

work page 2013
[25]

Deep idempotent network for efficient single image blind deblurring,

Y . Mao, Z. Wan, Y . Dai, and X. Yu, “Deep idempotent network for efficient single image blind deblurring,” IEEE TCSVT , vol. 33, no. 1, pp. 172–185, 2022. 2

work page 2022
[26]

From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,

D. Gong, J. Yang, L. Liu, Y . Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi, “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,” in CVPR, 2017, pp. 2319–2328. 2

work page 2017
[27]

Intra-frame deblurring by leveraging inter-frame camera motion,

H. Zhang and J. Yang, “Intra-frame deblurring by leveraging inter-frame camera motion,” in CVPR, 2015, pp. 4036–4044. 2

work page 2015
[28]

Blur-invariant deep learning for blind-deblurring,

T. M. Nimisha, A. Kumar Singh, and A. N. Rajagopalan, “Blur-invariant deep learning for blind-deblurring,” in ICCV, 2017, pp. 4752–4760. 2

work page 2017
[29]

Exposure trajectory recovery from motion blur,

Y . Zhang, C. Wang, S. J. Maybank, and D. Tao, “Exposure trajectory recovery from motion blur,” IEEE TPAMI, vol. 44, no. 11, pp. 7490– 7504, 2021. 2, 8, 9, 10, 11

work page 2021
[30]

Stereo video deblurring,

A. Sellent, C. Rother, and S. Roth, “Stereo video deblurring,” in ECCV, 2016, pp. 558–575. 2

work page 2016
[31]

Joint stereo video deblurring, scene flow estimation and moving object segmentation,

L. Pan, Y . Dai, M. Liu, F. Porikli, and Q. Pan, “Joint stereo video deblurring, scene flow estimation and moving object segmentation,” IEEE TIP, vol. 29, pp. 1748–1761, 2019. 2

work page 2019
[32]

Davanet: Stereo deblurring with view aggregation,

S. Zhou, J. Zhang, W. Zuo, H. Xie, J. Pan, and J. S. Ren, “Davanet: Stereo deblurring with view aggregation,” in CVPR, 2019, pp. 10 996– 11 005. 2, 8, 9

work page 2019
[33]

Cfnet: Cascade and fused cost volume for robust stereo matching,

Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” in CVPR, 2021, pp. 13 906–13 915. 2, 10, 11

work page 2021
[34]

Attention concatenation volume for accurate and efficient stereo matching,

G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in CVPR, 2022, pp. 12 981– 12 990. 2, 10, 11

work page 2022
[35]

Practical stereo matching via cascaded recurrent network with adaptive correlation,

J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in CVPR, 2022, pp. 16 263–16 272. 2

work page 2022
[36]

Aanet: Adaptive aggregation network for efficient stereo matching,

H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in CVPR, 2020, pp. 1959–1968. 2, 10, 11

work page 2020
[37]

Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,

Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,” inICCV, 2021, pp. 6197–6206. 2

work page 2021
[38]

The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,

H. Zhang, X. Ye, S. Chen, Z. Wang, H. Li, and W. Ouyang, “The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,”IEEE TCSVT, vol. 32, no. 7, pp. 4613– 4625, 2021. 2

work page 2021
[39]

Stereo hybrid event-frame (shef) cameras for 3d perception,

Z. Wang, L. Pan, Y . Ng, Z. Zhuang, and R. Mahony, “Stereo hybrid event-frame (shef) cameras for 3d perception,” in IROS, 2021, pp. 9758–

work page 2021
[40]

Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,

H. Kim, S. Lee, J. Kim, and H. J. Kim, “Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,” IEEE RAL, vol. 8, no. 1, pp. 416–423, 2022. 3, 10

work page 2022
[41]

Accurate depth estimation from a hybrid event-rgb stereo setup,

Y .-F. Zuo, L. Cui, X. Peng, Y . Xu, S. Gao, X. Wang, and L. Kneip, “Accurate depth estimation from a hybrid event-rgb stereo setup,” in IROS, 2021, pp. 6833–6840. 3

work page 2021
[42]

Self-supervised intensity-event stereo matching,

J. Gu, J. Zhou, R. S. W. Chu, Y . Chen, J. Zhang, X. Cheng, S. Zhang, and J. S. Ren, “Self-supervised intensity-event stereo matching,”Journal of Imaging Science and Technology , vol. 66, pp. 1–16, 2022. 3, 10

work page 2022
[43]

Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,

W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883. 4

work page 2016
[44]

Dynamic scene deblurring with parameter selective sharing and nested skip connections,

H. Gao, X. Tao, X. Shen, and J. Jia, “Dynamic scene deblurring with parameter selective sharing and nested skip connections,” in CVPR, 2019, pp. 3848–3856. 4

work page 2019
[45]

Rethinking coarse-to-fine approach in single image deblurring,

S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in ICCV, 2021, pp. 4641–4650. 4

work page 2021
[46]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Pyramid Attention Network for Semantic Segmentation

H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180 , 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks,” in NeurIPS, vol. 28, 2015. 4 SUBMISSION TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13

work page 2015
[49]

Perceptual losses for real-time style transfer and super-resolution,

J. Justin, A. Alexandre, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711. 5

work page 2016
[50]

Tightly coupled 3d lidar inertial odometry and mapping,

H. Ye, Y . Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in ICRA, 2019, pp. 3144–3150. 6

work page 2019
[51]

Events-to-video: Bringing modern computer vision to event cameras,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in CVPR, 2019, pp. 3857–3866. 6, 10

work page 2019
[52]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022, pp. 624–642. 7

work page 2022
[53]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004. 9

work page 2004
[54]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595. 9

work page 2018
[55]

High speed and high dynamic range video with an event camera,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE TPAMI, vol. 43, no. 6, pp. 1964–1980, 2019. 10

work page 1964

[1] [1]

Learning to extract a video sequence from a single motion-blurred image,

M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in CVPR, 2018, pp. 6334–6342. 1, 2, 8, 9

work page 2018

[2] [2]

Bringing alive blurred moments,

K. Purohit, A. Shah, and A. Rajagopalan, “Bringing alive blurred moments,” in CVPR, 2019, pp. 6830–6839. 1, 2

work page 2019

[3] [3]

Photosequencing of motion blur using short and long exposures,

V . Rengarajan, S. Zhao, R. Zhen, J. Glotzbach, H. Sheikh, and A. C. Sankaranarayanan, “Photosequencing of motion blur using short and long exposures,” in CVPRW, 2020, pp. 510–511. 1

work page 2020

[4] [4]

A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,

P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,” IEEE J. Solid- State Circuits, vol. 43, no. 2, pp. 566–576, 2008. 1

work page 2008

[5] [5]

A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,

C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits , vol. 49, no. 10, pp. 2333–2341, 2014. 1

work page 2014

[6] [6]

Event- based vision: A survey,

G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,”IEEE TPAMI, vol. 44, no. 1, pp. 154–180, 2020. 1, 2

work page 2020

[7] [7]

Bringing a blurry frame alive at high frame-rate with an event camera,

L. Pan, C. Scheerlinck, X. Yu, R. Hartley, M. Liu, and Y . Dai, “Bringing a blurry frame alive at high frame-rate with an event camera,” in CVPR, 2019, pp. 6820–6829. 1, 8, 9

work page 2019

[8] [8]

Learning event-driven video deblurring and interpolation,

S. Lin, J. Zhang, J. Pan, Z. Jiang, D. Zou, Y . Wang, J. Chen, and J. Ren, “Learning event-driven video deblurring and interpolation,” in ECCV, 2020, pp. 695–710. 1, 2, 3, 4, 8, 9

work page 2020

[9] [9]

Reducing the sim-to-real gap for event cameras,

T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony, “Reducing the sim-to-real gap for event cameras,” in ECCV, 2020, pp. 534–549. 1

work page 2020

[10] [10]

Event enhanced high- quality image recovery,

B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high- quality image recovery,” in ECCV, 2020, pp. 155–171. 1, 2, 3, 8, 9

work page 2020

[11] [11]

Motion deblurring with real events,

F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592. 1, 2, 3, 4, 8, 9

work page 2021

[12] [12]

E-cir: Event-enhanced continuous intensity recovery,

C. Song, Q. Huang, and C. Bajaj, “E-cir: Event-enhanced continuous intensity recovery,” in CVPR, 2022, pp. 7803–7812. 1, 4, 8, 9

work page 2022

[13] [13]

Dsec: A stereo event camera dataset for driving scenarios,

M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo event camera dataset for driving scenarios,” IEEE RAL , vol. 6, no. 3, pp. 4947–4954, 2021. 1, 2, 5, 6

work page 2021

[14] [14]

Event-image fusion stereo using cross-modality feature propagation,

H. Cho and K.-J. Yoon, “Event-image fusion stereo using cross-modality feature propagation,” in AAAI, 2022, pp. 882–890. 1

work page 2022

[15] [15]

Data association between event streams and intensity frames under diverse baselines,

D. Zhang, Q. D. P. D. C. Zhou, and B. Shi, “Data association between event streams and intensity frames under diverse baselines,” in ECCV,

work page

[16] [16]

Dynamic event camera calibration,

K. Huang, Y . Wang, and L. Kneip, “Dynamic event camera calibration,” in IROS, 2021, pp. 7021–7028. 1

work page 2021

[17] [17]

How to calibrate your event camera,

M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in CVPR, 2021, pp. 1403–1409. 1

work page 2021

[18] [18]

Multiple view geometry,

A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging Topics in Computer Vision , vol. 3, pp. 45–108, 2005. 1

work page 2005

[19] [19]

Time lens: Event-based video frame interpolation,

S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y . Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in CVPR, 2021, pp. 16 155–16 164. 1

work page 2021

[20] [20]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,

A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,” IEEE RAL, vol. 3, no. 3, pp. 2032– 2039, 2018. 2, 5, 6

work page 2032

[21] [21]

Learning to extract flawless slow motion from blurry videos,

M. Jin, Z. Hu, and P. Favaro, “Learning to extract flawless slow motion from blurry videos,” in CVPR, 2019, pp. 8112–8121. 2

work page 2019

[22] [22]

Single-image blind deblurring using multi-scale latent structure prior,

Y . Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao, “Single-image blind deblurring using multi-scale latent structure prior,” IEEE TCSVT, vol. 30, no. 7, pp. 2033–2045, 2019. 2

work page 2033

[23] [23]

Blind deconvolution using a normalized sparsity measure,

D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in CVPR, 2011, pp. 233–240. 2

work page 2011

[24] [24]

Edge-based blur kernel estimation using patch priors,

L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernel estimation using patch priors,” in ICCP, 2013, pp. 1–8. 2

work page 2013

[25] [25]

Deep idempotent network for efficient single image blind deblurring,

Y . Mao, Z. Wan, Y . Dai, and X. Yu, “Deep idempotent network for efficient single image blind deblurring,” IEEE TCSVT , vol. 33, no. 1, pp. 172–185, 2022. 2

work page 2022

[26] [26]

From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,

D. Gong, J. Yang, L. Liu, Y . Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi, “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,” in CVPR, 2017, pp. 2319–2328. 2

work page 2017

[27] [27]

Intra-frame deblurring by leveraging inter-frame camera motion,

H. Zhang and J. Yang, “Intra-frame deblurring by leveraging inter-frame camera motion,” in CVPR, 2015, pp. 4036–4044. 2

work page 2015

[28] [28]

Blur-invariant deep learning for blind-deblurring,

T. M. Nimisha, A. Kumar Singh, and A. N. Rajagopalan, “Blur-invariant deep learning for blind-deblurring,” in ICCV, 2017, pp. 4752–4760. 2

work page 2017

[29] [29]

Exposure trajectory recovery from motion blur,

Y . Zhang, C. Wang, S. J. Maybank, and D. Tao, “Exposure trajectory recovery from motion blur,” IEEE TPAMI, vol. 44, no. 11, pp. 7490– 7504, 2021. 2, 8, 9, 10, 11

work page 2021

[30] [30]

Stereo video deblurring,

A. Sellent, C. Rother, and S. Roth, “Stereo video deblurring,” in ECCV, 2016, pp. 558–575. 2

work page 2016

[31] [31]

Joint stereo video deblurring, scene flow estimation and moving object segmentation,

L. Pan, Y . Dai, M. Liu, F. Porikli, and Q. Pan, “Joint stereo video deblurring, scene flow estimation and moving object segmentation,” IEEE TIP, vol. 29, pp. 1748–1761, 2019. 2

work page 2019

[32] [32]

Davanet: Stereo deblurring with view aggregation,

S. Zhou, J. Zhang, W. Zuo, H. Xie, J. Pan, and J. S. Ren, “Davanet: Stereo deblurring with view aggregation,” in CVPR, 2019, pp. 10 996– 11 005. 2, 8, 9

work page 2019

[33] [33]

Cfnet: Cascade and fused cost volume for robust stereo matching,

Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” in CVPR, 2021, pp. 13 906–13 915. 2, 10, 11

work page 2021

[34] [34]

Attention concatenation volume for accurate and efficient stereo matching,

G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in CVPR, 2022, pp. 12 981– 12 990. 2, 10, 11

work page 2022

[35] [35]

Practical stereo matching via cascaded recurrent network with adaptive correlation,

J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in CVPR, 2022, pp. 16 263–16 272. 2

work page 2022

[36] [36]

Aanet: Adaptive aggregation network for efficient stereo matching,

H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in CVPR, 2020, pp. 1959–1968. 2, 10, 11

work page 2020

[37] [37]

Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,

Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,” inICCV, 2021, pp. 6197–6206. 2

work page 2021

[38] [38]

The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,

H. Zhang, X. Ye, S. Chen, Z. Wang, H. Li, and W. Ouyang, “The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,”IEEE TCSVT, vol. 32, no. 7, pp. 4613– 4625, 2021. 2

work page 2021

[39] [39]

Stereo hybrid event-frame (shef) cameras for 3d perception,

Z. Wang, L. Pan, Y . Ng, Z. Zhuang, and R. Mahony, “Stereo hybrid event-frame (shef) cameras for 3d perception,” in IROS, 2021, pp. 9758–

work page 2021

[40] [40]

Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,

H. Kim, S. Lee, J. Kim, and H. J. Kim, “Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,” IEEE RAL, vol. 8, no. 1, pp. 416–423, 2022. 3, 10

work page 2022

[41] [41]

Accurate depth estimation from a hybrid event-rgb stereo setup,

Y .-F. Zuo, L. Cui, X. Peng, Y . Xu, S. Gao, X. Wang, and L. Kneip, “Accurate depth estimation from a hybrid event-rgb stereo setup,” in IROS, 2021, pp. 6833–6840. 3

work page 2021

[42] [42]

Self-supervised intensity-event stereo matching,

J. Gu, J. Zhou, R. S. W. Chu, Y . Chen, J. Zhang, X. Cheng, S. Zhang, and J. S. Ren, “Self-supervised intensity-event stereo matching,”Journal of Imaging Science and Technology , vol. 66, pp. 1–16, 2022. 3, 10

work page 2022

[43] [43]

Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,

W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883. 4

work page 2016

[44] [44]

Dynamic scene deblurring with parameter selective sharing and nested skip connections,

H. Gao, X. Tao, X. Shen, and J. Jia, “Dynamic scene deblurring with parameter selective sharing and nested skip connections,” in CVPR, 2019, pp. 3848–3856. 4

work page 2019

[45] [45]

Rethinking coarse-to-fine approach in single image deblurring,

S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in ICCV, 2021, pp. 4641–4650. 4

work page 2021

[46] [46]

Attention U-Net: Learning Where to Look for the Pancreas

O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [47]

Pyramid Attention Network for Semantic Segmentation

H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180 , 2018. 4

work page internal anchor Pith review Pith/arXiv arXiv 2018

[48] [48]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks,” in NeurIPS, vol. 28, 2015. 4 SUBMISSION TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13

work page 2015

[49] [49]

Perceptual losses for real-time style transfer and super-resolution,

J. Justin, A. Alexandre, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711. 5

work page 2016

[50] [50]

Tightly coupled 3d lidar inertial odometry and mapping,

H. Ye, Y . Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in ICRA, 2019, pp. 3144–3150. 6

work page 2019

[51] [51]

Events-to-video: Bringing modern computer vision to event cameras,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in CVPR, 2019, pp. 3857–3866. 6, 10

work page 2019

[52] [52]

Real-time intermediate flow estimation for video frame interpolation,

Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022, pp. 624–642. 7

work page 2022

[53] [53]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004. 9

work page 2004

[54] [54]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595. 9

work page 2018

[55] [55]

High speed and high dynamic range video with an event camera,

H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE TPAMI, vol. 43, no. 6, pp. 1964–1980, 2019. 10

work page 1964