pith. sign in

arxiv: 2309.09513 · v1 · submitted 2023-09-18 · 💻 cs.CV

Learning Parallax for Stereo Event-based Motion Deblurring

Pith reviewed 2026-05-24 07:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords event-based visionmotion deblurringstereo matchingparallax learningimage reconstructionmulti-modal fusion
0
0 comments X

The pith

St-EDNet recovers sharp images from misaligned blurry photos and event streams by learning coarse-to-fine parallax alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents St-EDNet, a coarse-to-fine framework that recovers sequences of sharp images directly from a single blurry intensity image paired with concurrent but misaligned event streams. It begins with a cross-modal stereo matching module that performs coarse spatial alignment without any ground-truth depth data. A dual-feature embedding architecture then refines the bidirectional association between the coarsely aligned inputs and reconstructs the latent sharp frames. The authors also introduce the StEIC dataset of real stereo event and intensity captures with dense disparity maps. Experiments show the network outperforms prior methods on real-world misaligned data.

Core claim

The central claim is that high-quality sharp image sequences can be recovered from misaligned blurry images and concurrent event streams by first applying cross-modal stereo matching for coarse alignment without ground-truth depths, followed by dual-feature embedding to build fine bidirectional associations and perform reconstruction.

What carries the argument

St-EDNet framework, which uses a cross-modal stereo matching module for coarse spatial alignment of blurry images and events, plus a dual-feature embedding architecture for fine association and sharp image sequence reconstruction.

If this is right

  • Deblurring becomes possible with real-world inputs that lack perfect pixel-wise alignment between intensity images and events.
  • A single blurry image plus concurrent events suffice as input, without additional aligned data.
  • The new StEIC dataset supplies real stereo events, intensity images, and dense disparity maps for training and benchmarking.
  • The approach produces a sequence of latent sharp images rather than a single output frame.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The stereo matching step could be adapted to other multi-modal pairs that suffer from spatial misalignment, such as event-LiDAR fusion.
  • Extending the coarse-to-fine pipeline to handle varying event densities might improve robustness in low-light or high-speed scenes.
  • Controlled synthetic experiments that vary the degree of initial misalignment could isolate the contribution of the stereo module.

Load-bearing premise

Cross-modal stereo matching can produce sufficient coarse spatial alignment between the blurry image and event streams without ground-truth depths.

What would settle it

Real-world test sequences where the initial misalignment exceeds what the stereo matching module can correct, resulting in reconstruction quality no better than methods that assume perfect alignment.

Figures

Figures reproduced from arXiv: 2309.09513 by Chi Zhang, Chu He, Lei Yu, Mingyuan Lin.

Figure 1
Figure 1. Figure 1: Illustrative examples of the impact of the misaligned [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall structure of the proposed St-EDNet, which consists of two modules: DispNet and DblrNet. DispNet estimates [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the DispNet [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Intermediate disparities predicted by (c) DispNet, and [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Illustration of the stereo event and intensity camera [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of motion deblurring of 9 different methods on the [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of motion deblurring of 9 different methods on the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-frame motion deblurring results on the [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparisons of the multi-frame motion deblurring on the [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of coarse disparity estimation with the input blurry image and events on the [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative ablation study for DispNet on the [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative ablation study for DblrNet on the [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes St-EDNet, a coarse-to-fine framework for recovering sharp image sequences from misaligned inputs consisting of a single blurry intensity image and concurrent event streams captured by stereo event and intensity cameras. The approach first performs coarse spatial alignment via a cross-modal stereo matching module claimed to require no ground-truth depths, then uses a dual-feature embedding architecture to build fine bidirectional associations and reconstruct latent sharp images. A new StEIC dataset is introduced containing real-world events, intensity images, and dense disparity maps, with experiments asserting superiority over state-of-the-art methods on real-world datasets.

Significance. If the claims hold, the work addresses a practical limitation in event-based deblurring by enabling operation on misaligned stereo data without perfect pixel-wise alignment, potentially broadening applicability in real-world settings; the release of the StEIC dataset with disparity maps would also provide a useful resource for cross-modal stereo and deblurring research.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.
  2. [Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.
minor comments (1)
  1. [Abstract] The abstract introduces the acronym St-EDNet and StEIC but does not expand them on first use or clarify their relation to the full framework name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we address each major comment point by point with clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.

    Authors: The cross-modal stereo matching module is trained via a self-supervised photometric consistency loss combined with event-specific constraints and does not use the dense disparity maps from StEIC as supervision (no disparity regression loss is applied). The disparity maps are included in the dataset solely to enable quantitative evaluation of the alignment module and to support future cross-modal stereo research; they play no role in training the module itself. This preserves the claimed independence from ground-truth depths at both training and inference time. We will revise the manuscript to explicitly state the self-supervised training procedure for the module. revision: yes

  2. Referee: [Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.

    Authors: Abstracts are concise summaries constrained by length limits and therefore omit detailed quantitative results, error bars, ablations, and statistics; these are fully reported in the Experiments section of the manuscript (including PSNR/SSIM with standard deviations, ablation tables, and dataset details). The superiority claims are supported by those experiments on real-world data. We can add one or two key quantitative highlights to the abstract if the editor requests it. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is self-contained

full rationale

The paper proposes a coarse-to-fine neural architecture (St-EDNet) for deblurring from misaligned stereo event/intensity inputs. The cross-modal stereo matching module is presented as operating without ground-truth depths, and the dual-feature embedding proceeds from that alignment to image reconstruction. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the training uses the provided StEIC dataset but the architectural derivation and performance claims remain independent of any tautological loop. This is the normal case of an empirical method whose validity is tested externally rather than forced internally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of a learned neural architecture trained on real event-intensity pairs; the stereo matching step assumes sufficient overlap and texture for cross-modal correspondence.

free parameters (1)
  • Network weights
    All parameters of the stereo matching module and dual-feature embedding network are fitted during training on the StEIC and other datasets.
axioms (1)
  • domain assumption Concurrent event streams and a single blurry intensity image contain sufficient information to recover sharp frames even under spatial misalignment
    Invoked as the motivation and operating premise of the entire framework.
invented entities (2)
  • St-EDNet no independent evidence
    purpose: Coarse-to-fine deblurring architecture
    Newly proposed network; no independent evidence outside the paper.
  • StEIC dataset no independent evidence
    purpose: Real-world training and evaluation data with events, intensity images, and disparity maps
    Newly collected dataset; no external validation mentioned.

pith-pipeline@v0.9.0 · 5726 in / 1381 out tokens · 40034 ms · 2026-05-24T07:06:09.930989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 2 internal anchors

  1. [1]

    Learning to extract a video sequence from a single motion-blurred image,

    M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in CVPR, 2018, pp. 6334–6342. 1, 2, 8, 9

  2. [2]

    Bringing alive blurred moments,

    K. Purohit, A. Shah, and A. Rajagopalan, “Bringing alive blurred moments,” in CVPR, 2019, pp. 6830–6839. 1, 2

  3. [3]

    Photosequencing of motion blur using short and long exposures,

    V . Rengarajan, S. Zhao, R. Zhen, J. Glotzbach, H. Sheikh, and A. C. Sankaranarayanan, “Photosequencing of motion blur using short and long exposures,” in CVPRW, 2020, pp. 510–511. 1

  4. [4]

    A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,

    P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,” IEEE J. Solid- State Circuits, vol. 43, no. 2, pp. 566–576, 2008. 1

  5. [5]

    A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,

    C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits , vol. 49, no. 10, pp. 2333–2341, 2014. 1

  6. [6]

    Event- based vision: A survey,

    G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,”IEEE TPAMI, vol. 44, no. 1, pp. 154–180, 2020. 1, 2

  7. [7]

    Bringing a blurry frame alive at high frame-rate with an event camera,

    L. Pan, C. Scheerlinck, X. Yu, R. Hartley, M. Liu, and Y . Dai, “Bringing a blurry frame alive at high frame-rate with an event camera,” in CVPR, 2019, pp. 6820–6829. 1, 8, 9

  8. [8]

    Learning event-driven video deblurring and interpolation,

    S. Lin, J. Zhang, J. Pan, Z. Jiang, D. Zou, Y . Wang, J. Chen, and J. Ren, “Learning event-driven video deblurring and interpolation,” in ECCV, 2020, pp. 695–710. 1, 2, 3, 4, 8, 9

  9. [9]

    Reducing the sim-to-real gap for event cameras,

    T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony, “Reducing the sim-to-real gap for event cameras,” in ECCV, 2020, pp. 534–549. 1

  10. [10]

    Event enhanced high- quality image recovery,

    B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high- quality image recovery,” in ECCV, 2020, pp. 155–171. 1, 2, 3, 8, 9

  11. [11]

    Motion deblurring with real events,

    F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592. 1, 2, 3, 4, 8, 9

  12. [12]

    E-cir: Event-enhanced continuous intensity recovery,

    C. Song, Q. Huang, and C. Bajaj, “E-cir: Event-enhanced continuous intensity recovery,” in CVPR, 2022, pp. 7803–7812. 1, 4, 8, 9

  13. [13]

    Dsec: A stereo event camera dataset for driving scenarios,

    M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo event camera dataset for driving scenarios,” IEEE RAL , vol. 6, no. 3, pp. 4947–4954, 2021. 1, 2, 5, 6

  14. [14]

    Event-image fusion stereo using cross-modality feature propagation,

    H. Cho and K.-J. Yoon, “Event-image fusion stereo using cross-modality feature propagation,” in AAAI, 2022, pp. 882–890. 1

  15. [15]

    Data association between event streams and intensity frames under diverse baselines,

    D. Zhang, Q. D. P. D. C. Zhou, and B. Shi, “Data association between event streams and intensity frames under diverse baselines,” in ECCV,

  16. [16]

    Dynamic event camera calibration,

    K. Huang, Y . Wang, and L. Kneip, “Dynamic event camera calibration,” in IROS, 2021, pp. 7021–7028. 1

  17. [17]

    How to calibrate your event camera,

    M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in CVPR, 2021, pp. 1403–1409. 1

  18. [18]

    Multiple view geometry,

    A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging Topics in Computer Vision , vol. 3, pp. 45–108, 2005. 1

  19. [19]

    Time lens: Event-based video frame interpolation,

    S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y . Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in CVPR, 2021, pp. 16 155–16 164. 1

  20. [20]

    The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,

    A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,” IEEE RAL, vol. 3, no. 3, pp. 2032– 2039, 2018. 2, 5, 6

  21. [21]

    Learning to extract flawless slow motion from blurry videos,

    M. Jin, Z. Hu, and P. Favaro, “Learning to extract flawless slow motion from blurry videos,” in CVPR, 2019, pp. 8112–8121. 2

  22. [22]

    Single-image blind deblurring using multi-scale latent structure prior,

    Y . Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao, “Single-image blind deblurring using multi-scale latent structure prior,” IEEE TCSVT, vol. 30, no. 7, pp. 2033–2045, 2019. 2

  23. [23]

    Blind deconvolution using a normalized sparsity measure,

    D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in CVPR, 2011, pp. 233–240. 2

  24. [24]

    Edge-based blur kernel estimation using patch priors,

    L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernel estimation using patch priors,” in ICCP, 2013, pp. 1–8. 2

  25. [25]

    Deep idempotent network for efficient single image blind deblurring,

    Y . Mao, Z. Wan, Y . Dai, and X. Yu, “Deep idempotent network for efficient single image blind deblurring,” IEEE TCSVT , vol. 33, no. 1, pp. 172–185, 2022. 2

  26. [26]

    From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,

    D. Gong, J. Yang, L. Liu, Y . Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi, “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,” in CVPR, 2017, pp. 2319–2328. 2

  27. [27]

    Intra-frame deblurring by leveraging inter-frame camera motion,

    H. Zhang and J. Yang, “Intra-frame deblurring by leveraging inter-frame camera motion,” in CVPR, 2015, pp. 4036–4044. 2

  28. [28]

    Blur-invariant deep learning for blind-deblurring,

    T. M. Nimisha, A. Kumar Singh, and A. N. Rajagopalan, “Blur-invariant deep learning for blind-deblurring,” in ICCV, 2017, pp. 4752–4760. 2

  29. [29]

    Exposure trajectory recovery from motion blur,

    Y . Zhang, C. Wang, S. J. Maybank, and D. Tao, “Exposure trajectory recovery from motion blur,” IEEE TPAMI, vol. 44, no. 11, pp. 7490– 7504, 2021. 2, 8, 9, 10, 11

  30. [30]

    Stereo video deblurring,

    A. Sellent, C. Rother, and S. Roth, “Stereo video deblurring,” in ECCV, 2016, pp. 558–575. 2

  31. [31]

    Joint stereo video deblurring, scene flow estimation and moving object segmentation,

    L. Pan, Y . Dai, M. Liu, F. Porikli, and Q. Pan, “Joint stereo video deblurring, scene flow estimation and moving object segmentation,” IEEE TIP, vol. 29, pp. 1748–1761, 2019. 2

  32. [32]

    Davanet: Stereo deblurring with view aggregation,

    S. Zhou, J. Zhang, W. Zuo, H. Xie, J. Pan, and J. S. Ren, “Davanet: Stereo deblurring with view aggregation,” in CVPR, 2019, pp. 10 996– 11 005. 2, 8, 9

  33. [33]

    Cfnet: Cascade and fused cost volume for robust stereo matching,

    Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” in CVPR, 2021, pp. 13 906–13 915. 2, 10, 11

  34. [34]

    Attention concatenation volume for accurate and efficient stereo matching,

    G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in CVPR, 2022, pp. 12 981– 12 990. 2, 10, 11

  35. [35]

    Practical stereo matching via cascaded recurrent network with adaptive correlation,

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in CVPR, 2022, pp. 16 263–16 272. 2

  36. [36]

    Aanet: Adaptive aggregation network for efficient stereo matching,

    H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in CVPR, 2020, pp. 1959–1968. 2, 10, 11

  37. [37]

    Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,

    Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,” inICCV, 2021, pp. 6197–6206. 2

  38. [38]

    The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,

    H. Zhang, X. Ye, S. Chen, Z. Wang, H. Li, and W. Ouyang, “The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,”IEEE TCSVT, vol. 32, no. 7, pp. 4613– 4625, 2021. 2

  39. [39]

    Stereo hybrid event-frame (shef) cameras for 3d perception,

    Z. Wang, L. Pan, Y . Ng, Z. Zhuang, and R. Mahony, “Stereo hybrid event-frame (shef) cameras for 3d perception,” in IROS, 2021, pp. 9758–

  40. [40]

    Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,

    H. Kim, S. Lee, J. Kim, and H. J. Kim, “Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,” IEEE RAL, vol. 8, no. 1, pp. 416–423, 2022. 3, 10

  41. [41]

    Accurate depth estimation from a hybrid event-rgb stereo setup,

    Y .-F. Zuo, L. Cui, X. Peng, Y . Xu, S. Gao, X. Wang, and L. Kneip, “Accurate depth estimation from a hybrid event-rgb stereo setup,” in IROS, 2021, pp. 6833–6840. 3

  42. [42]

    Self-supervised intensity-event stereo matching,

    J. Gu, J. Zhou, R. S. W. Chu, Y . Chen, J. Zhang, X. Cheng, S. Zhang, and J. S. Ren, “Self-supervised intensity-event stereo matching,”Journal of Imaging Science and Technology , vol. 66, pp. 1–16, 2022. 3, 10

  43. [43]

    Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,

    W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883. 4

  44. [44]

    Dynamic scene deblurring with parameter selective sharing and nested skip connections,

    H. Gao, X. Tao, X. Shen, and J. Jia, “Dynamic scene deblurring with parameter selective sharing and nested skip connections,” in CVPR, 2019, pp. 3848–3856. 4

  45. [45]

    Rethinking coarse-to-fine approach in single image deblurring,

    S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in ICCV, 2021, pp. 4641–4650. 4

  46. [46]

    Attention U-Net: Learning Where to Look for the Pancreas

    O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. 4

  47. [47]

    Pyramid Attention Network for Semantic Segmentation

    H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180 , 2018. 4

  48. [48]

    Spatial transformer networks,

    M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks,” in NeurIPS, vol. 28, 2015. 4 SUBMISSION TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13

  49. [49]

    Perceptual losses for real-time style transfer and super-resolution,

    J. Justin, A. Alexandre, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711. 5

  50. [50]

    Tightly coupled 3d lidar inertial odometry and mapping,

    H. Ye, Y . Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in ICRA, 2019, pp. 3144–3150. 6

  51. [51]

    Events-to-video: Bringing modern computer vision to event cameras,

    H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in CVPR, 2019, pp. 3857–3866. 6, 10

  52. [52]

    Real-time intermediate flow estimation for video frame interpolation,

    Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022, pp. 624–642. 7

  53. [53]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004. 9

  54. [54]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595. 9

  55. [55]

    High speed and high dynamic range video with an event camera,

    H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE TPAMI, vol. 43, no. 6, pp. 1964–1980, 2019. 10