Learning Parallax for Stereo Event-based Motion Deblurring
Pith reviewed 2026-05-24 07:06 UTC · model grok-4.3
The pith
St-EDNet recovers sharp images from misaligned blurry photos and event streams by learning coarse-to-fine parallax alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that high-quality sharp image sequences can be recovered from misaligned blurry images and concurrent event streams by first applying cross-modal stereo matching for coarse alignment without ground-truth depths, followed by dual-feature embedding to build fine bidirectional associations and perform reconstruction.
What carries the argument
St-EDNet framework, which uses a cross-modal stereo matching module for coarse spatial alignment of blurry images and events, plus a dual-feature embedding architecture for fine association and sharp image sequence reconstruction.
If this is right
- Deblurring becomes possible with real-world inputs that lack perfect pixel-wise alignment between intensity images and events.
- A single blurry image plus concurrent events suffice as input, without additional aligned data.
- The new StEIC dataset supplies real stereo events, intensity images, and dense disparity maps for training and benchmarking.
- The approach produces a sequence of latent sharp images rather than a single output frame.
Where Pith is reading between the lines
- The stereo matching step could be adapted to other multi-modal pairs that suffer from spatial misalignment, such as event-LiDAR fusion.
- Extending the coarse-to-fine pipeline to handle varying event densities might improve robustness in low-light or high-speed scenes.
- Controlled synthetic experiments that vary the degree of initial misalignment could isolate the contribution of the stereo module.
Load-bearing premise
Cross-modal stereo matching can produce sufficient coarse spatial alignment between the blurry image and event streams without ground-truth depths.
What would settle it
Real-world test sequences where the initial misalignment exceeds what the stereo matching module can correct, resulting in reconstruction quality no better than methods that assume perfect alignment.
Figures
read the original abstract
Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes St-EDNet, a coarse-to-fine framework for recovering sharp image sequences from misaligned inputs consisting of a single blurry intensity image and concurrent event streams captured by stereo event and intensity cameras. The approach first performs coarse spatial alignment via a cross-modal stereo matching module claimed to require no ground-truth depths, then uses a dual-feature embedding architecture to build fine bidirectional associations and reconstruct latent sharp images. A new StEIC dataset is introduced containing real-world events, intensity images, and dense disparity maps, with experiments asserting superiority over state-of-the-art methods on real-world datasets.
Significance. If the claims hold, the work addresses a practical limitation in event-based deblurring by enabling operation on misaligned stereo data without perfect pixel-wise alignment, potentially broadening applicability in real-world settings; the release of the StEIC dataset with disparity maps would also provide a useful resource for cross-modal stereo and deblurring research.
major comments (2)
- [Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.
- [Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.
minor comments (1)
- [Abstract] The abstract introduces the acronym St-EDNet and StEIC but does not expand them on first use or clarify their relation to the full framework name.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. Below we address each major comment point by point with clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths' is load-bearing for the no-GT independence assertion, yet the StEIC dataset supplies dense disparity maps; if these supervise the matching module (e.g., via disparity regression loss during training), the method depends on paired depth data for learning and only avoids GT at inference, weakening the stated independence.
Authors: The cross-modal stereo matching module is trained via a self-supervised photometric consistency loss combined with event-specific constraints and does not use the dense disparity maps from StEIC as supervision (no disparity regression loss is applied). The disparity maps are included in the dataset solely to enable quantitative evaluation of the alignment module and to support future cross-modal stereo research; they play no role in training the module itself. This preserves the claimed independence from ground-truth depths at both training and inference time. We will revise the manuscript to explicitly state the self-supervised training procedure for the module. revision: yes
-
Referee: [Abstract] Abstract (framework description): No quantitative results, error bars, ablation studies, or dataset statistics are provided to support the asserted superiority on real-world datasets, making it impossible to assess whether the coarse-to-fine pipeline actually delivers the claimed performance gains over baselines that assume perfect alignment.
Authors: Abstracts are concise summaries constrained by length limits and therefore omit detailed quantitative results, error bars, ablations, and statistics; these are fully reported in the Experiments section of the manuscript (including PSNR/SSIM with standard deviations, ablation tables, and dataset details). The superiority claims are supported by those experiments on real-world data. We can add one or two key quantitative highlights to the abstract if the editor requests it. revision: partial
Circularity Check
No significant circularity; framework is self-contained
full rationale
The paper proposes a coarse-to-fine neural architecture (St-EDNet) for deblurring from misaligned stereo event/intensity inputs. The cross-modal stereo matching module is presented as operating without ground-truth depths, and the dual-feature embedding proceeds from that alignment to image reconstruction. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the training uses the provided StEIC dataset but the architectural derivation and performance claims remain independent of any tautological loop. This is the normal case of an empirical method whose validity is tested externally rather than forced internally.
Axiom & Free-Parameter Ledger
free parameters (1)
- Network weights
axioms (1)
- domain assumption Concurrent event streams and a single blurry intensity image contain sufficient information to recover sharp frames even under spatial misalignment
invented entities (2)
-
St-EDNet
no independent evidence
-
StEIC dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
coarse spatial alignment ... with a cross-modal stereo matching module without the need for ground-truth depths
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DispNet ... U-Net-based architecture ... Pyramid Attention blocks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning to extract a video sequence from a single motion-blurred image,
M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in CVPR, 2018, pp. 6334–6342. 1, 2, 8, 9
work page 2018
-
[2]
Bringing alive blurred moments,
K. Purohit, A. Shah, and A. Rajagopalan, “Bringing alive blurred moments,” in CVPR, 2019, pp. 6830–6839. 1, 2
work page 2019
-
[3]
Photosequencing of motion blur using short and long exposures,
V . Rengarajan, S. Zhao, R. Zhen, J. Glotzbach, H. Sheikh, and A. C. Sankaranarayanan, “Photosequencing of motion blur using short and long exposures,” in CVPRW, 2020, pp. 510–511. 1
work page 2020
-
[4]
A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,
P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×128 120 db 15 µs latency asynchronous temporal contrast vision sensor,” IEEE J. Solid- State Circuits, vol. 43, no. 2, pp. 566–576, 2008. 1
work page 2008
-
[5]
A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,
C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 × 180 130 db 3 µs latency global shutter spatiotemporal vision sensor,” IEEE J. Solid-State Circuits , vol. 49, no. 10, pp. 2333–2341, 2014. 1
work page 2014
-
[6]
Event- based vision: A survey,
G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event- based vision: A survey,”IEEE TPAMI, vol. 44, no. 1, pp. 154–180, 2020. 1, 2
work page 2020
-
[7]
Bringing a blurry frame alive at high frame-rate with an event camera,
L. Pan, C. Scheerlinck, X. Yu, R. Hartley, M. Liu, and Y . Dai, “Bringing a blurry frame alive at high frame-rate with an event camera,” in CVPR, 2019, pp. 6820–6829. 1, 8, 9
work page 2019
-
[8]
Learning event-driven video deblurring and interpolation,
S. Lin, J. Zhang, J. Pan, Z. Jiang, D. Zou, Y . Wang, J. Chen, and J. Ren, “Learning event-driven video deblurring and interpolation,” in ECCV, 2020, pp. 695–710. 1, 2, 3, 4, 8, 9
work page 2020
-
[9]
Reducing the sim-to-real gap for event cameras,
T. Stoffregen, C. Scheerlinck, D. Scaramuzza, T. Drummond, N. Barnes, L. Kleeman, and R. Mahony, “Reducing the sim-to-real gap for event cameras,” in ECCV, 2020, pp. 534–549. 1
work page 2020
-
[10]
Event enhanced high- quality image recovery,
B. Wang, J. He, L. Yu, G.-S. Xia, and W. Yang, “Event enhanced high- quality image recovery,” in ECCV, 2020, pp. 155–171. 1, 2, 3, 8, 9
work page 2020
-
[11]
Motion deblurring with real events,
F. Xu, L. Yu, B. Wang, W. Yang, G.-S. Xia, X. Jia, Z. Qiao, and J. Liu, “Motion deblurring with real events,” in ICCV, 2021, pp. 2583–2592. 1, 2, 3, 4, 8, 9
work page 2021
-
[12]
E-cir: Event-enhanced continuous intensity recovery,
C. Song, Q. Huang, and C. Bajaj, “E-cir: Event-enhanced continuous intensity recovery,” in CVPR, 2022, pp. 7803–7812. 1, 4, 8, 9
work page 2022
-
[13]
Dsec: A stereo event camera dataset for driving scenarios,
M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo event camera dataset for driving scenarios,” IEEE RAL , vol. 6, no. 3, pp. 4947–4954, 2021. 1, 2, 5, 6
work page 2021
-
[14]
Event-image fusion stereo using cross-modality feature propagation,
H. Cho and K.-J. Yoon, “Event-image fusion stereo using cross-modality feature propagation,” in AAAI, 2022, pp. 882–890. 1
work page 2022
-
[15]
Data association between event streams and intensity frames under diverse baselines,
D. Zhang, Q. D. P. D. C. Zhou, and B. Shi, “Data association between event streams and intensity frames under diverse baselines,” in ECCV,
-
[16]
Dynamic event camera calibration,
K. Huang, Y . Wang, and L. Kneip, “Dynamic event camera calibration,” in IROS, 2021, pp. 7021–7028. 1
work page 2021
-
[17]
How to calibrate your event camera,
M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” in CVPR, 2021, pp. 1403–1409. 1
work page 2021
-
[18]
A. Heyden and M. Pollefeys, “Multiple view geometry,” Emerging Topics in Computer Vision , vol. 3, pp. 45–108, 2005. 1
work page 2005
-
[19]
Time lens: Event-based video frame interpolation,
S. Tulyakov, D. Gehrig, S. Georgoulis, J. Erbach, M. Gehrig, Y . Li, and D. Scaramuzza, “Time lens: Event-based video frame interpolation,” in CVPR, 2021, pp. 16 155–16 164. 1
work page 2021
-
[20]
The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,
A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V . Kumar, and K. Daniilidis, “The multivehicle stereo event camera dataset: An event camera dataset for 3d perception,” IEEE RAL, vol. 3, no. 3, pp. 2032– 2039, 2018. 2, 5, 6
work page 2032
-
[21]
Learning to extract flawless slow motion from blurry videos,
M. Jin, Z. Hu, and P. Favaro, “Learning to extract flawless slow motion from blurry videos,” in CVPR, 2019, pp. 8112–8121. 2
work page 2019
-
[22]
Single-image blind deblurring using multi-scale latent structure prior,
Y . Bai, H. Jia, M. Jiang, X. Liu, X. Xie, and W. Gao, “Single-image blind deblurring using multi-scale latent structure prior,” IEEE TCSVT, vol. 30, no. 7, pp. 2033–2045, 2019. 2
work page 2033
-
[23]
Blind deconvolution using a normalized sparsity measure,
D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in CVPR, 2011, pp. 233–240. 2
work page 2011
-
[24]
Edge-based blur kernel estimation using patch priors,
L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernel estimation using patch priors,” in ICCP, 2013, pp. 1–8. 2
work page 2013
-
[25]
Deep idempotent network for efficient single image blind deblurring,
Y . Mao, Z. Wan, Y . Dai, and X. Yu, “Deep idempotent network for efficient single image blind deblurring,” IEEE TCSVT , vol. 33, no. 1, pp. 172–185, 2022. 2
work page 2022
-
[26]
From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,
D. Gong, J. Yang, L. Liu, Y . Zhang, I. Reid, C. Shen, A. Van Den Hengel, and Q. Shi, “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,” in CVPR, 2017, pp. 2319–2328. 2
work page 2017
-
[27]
Intra-frame deblurring by leveraging inter-frame camera motion,
H. Zhang and J. Yang, “Intra-frame deblurring by leveraging inter-frame camera motion,” in CVPR, 2015, pp. 4036–4044. 2
work page 2015
-
[28]
Blur-invariant deep learning for blind-deblurring,
T. M. Nimisha, A. Kumar Singh, and A. N. Rajagopalan, “Blur-invariant deep learning for blind-deblurring,” in ICCV, 2017, pp. 4752–4760. 2
work page 2017
-
[29]
Exposure trajectory recovery from motion blur,
Y . Zhang, C. Wang, S. J. Maybank, and D. Tao, “Exposure trajectory recovery from motion blur,” IEEE TPAMI, vol. 44, no. 11, pp. 7490– 7504, 2021. 2, 8, 9, 10, 11
work page 2021
-
[30]
A. Sellent, C. Rother, and S. Roth, “Stereo video deblurring,” in ECCV, 2016, pp. 558–575. 2
work page 2016
-
[31]
Joint stereo video deblurring, scene flow estimation and moving object segmentation,
L. Pan, Y . Dai, M. Liu, F. Porikli, and Q. Pan, “Joint stereo video deblurring, scene flow estimation and moving object segmentation,” IEEE TIP, vol. 29, pp. 1748–1761, 2019. 2
work page 2019
-
[32]
Davanet: Stereo deblurring with view aggregation,
S. Zhou, J. Zhang, W. Zuo, H. Xie, J. Pan, and J. S. Ren, “Davanet: Stereo deblurring with view aggregation,” in CVPR, 2019, pp. 10 996– 11 005. 2, 8, 9
work page 2019
-
[33]
Cfnet: Cascade and fused cost volume for robust stereo matching,
Z. Shen, Y . Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” in CVPR, 2021, pp. 13 906–13 915. 2, 10, 11
work page 2021
-
[34]
Attention concatenation volume for accurate and efficient stereo matching,
G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in CVPR, 2022, pp. 12 981– 12 990. 2, 10, 11
work page 2022
-
[35]
Practical stereo matching via cascaded recurrent network with adaptive correlation,
J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu, “Practical stereo matching via cascaded recurrent network with adaptive correlation,” in CVPR, 2022, pp. 16 263–16 272. 2
work page 2022
-
[36]
Aanet: Adaptive aggregation network for efficient stereo matching,
H. Xu and J. Zhang, “Aanet: Adaptive aggregation network for efficient stereo matching,” in CVPR, 2020, pp. 1959–1968. 2, 10, 11
work page 2020
-
[37]
Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,
Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath, “Revisiting stereo depth estimation from a sequence-to- sequence perspective with transformers,” inICCV, 2021, pp. 6197–6206. 2
work page 2021
-
[38]
H. Zhang, X. Ye, S. Chen, Z. Wang, H. Li, and W. Ouyang, “The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement,”IEEE TCSVT, vol. 32, no. 7, pp. 4613– 4625, 2021. 2
work page 2021
-
[39]
Stereo hybrid event-frame (shef) cameras for 3d perception,
Z. Wang, L. Pan, Y . Ng, Z. Zhuang, and R. Mahony, “Stereo hybrid event-frame (shef) cameras for 3d perception,” in IROS, 2021, pp. 9758–
work page 2021
-
[40]
H. Kim, S. Lee, J. Kim, and H. J. Kim, “Real-time hetero-stereo match- ing for event and frame camera with aligned events using maximum shift distance,” IEEE RAL, vol. 8, no. 1, pp. 416–423, 2022. 3, 10
work page 2022
-
[41]
Accurate depth estimation from a hybrid event-rgb stereo setup,
Y .-F. Zuo, L. Cui, X. Peng, Y . Xu, S. Gao, X. Wang, and L. Kneip, “Accurate depth estimation from a hybrid event-rgb stereo setup,” in IROS, 2021, pp. 6833–6840. 3
work page 2021
-
[42]
Self-supervised intensity-event stereo matching,
J. Gu, J. Zhou, R. S. W. Chu, Y . Chen, J. Zhang, X. Cheng, S. Zhang, and J. S. Ren, “Self-supervised intensity-event stereo matching,”Journal of Imaging Science and Technology , vol. 66, pp. 1–16, 2022. 3, 10
work page 2022
-
[43]
W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super- resolution using an efficient sub-pixel convolutional neural network,” in CVPR, 2016, pp. 1874–1883. 4
work page 2016
-
[44]
Dynamic scene deblurring with parameter selective sharing and nested skip connections,
H. Gao, X. Tao, X. Shen, and J. Jia, “Dynamic scene deblurring with parameter selective sharing and nested skip connections,” in CVPR, 2019, pp. 3848–3856. 4
work page 2019
-
[45]
Rethinking coarse-to-fine approach in single image deblurring,
S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking coarse-to-fine approach in single image deblurring,” in ICCV, 2021, pp. 4641–4650. 4
work page 2021
-
[46]
Attention U-Net: Learning Where to Look for the Pancreas
O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich et al. , “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Pyramid Attention Network for Semantic Segmentation
H. Li, P. Xiong, J. An, and L. Wang, “Pyramid attention network for semantic segmentation,” arXiv preprint arXiv:1805.10180 , 2018. 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformer networks,” in NeurIPS, vol. 28, 2015. 4 SUBMISSION TO IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13
work page 2015
-
[49]
Perceptual losses for real-time style transfer and super-resolution,
J. Justin, A. Alexandre, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711. 5
work page 2016
-
[50]
Tightly coupled 3d lidar inertial odometry and mapping,
H. Ye, Y . Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in ICRA, 2019, pp. 3144–3150. 6
work page 2019
-
[51]
Events-to-video: Bringing modern computer vision to event cameras,
H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “Events-to-video: Bringing modern computer vision to event cameras,” in CVPR, 2019, pp. 3857–3866. 6, 10
work page 2019
-
[52]
Real-time intermediate flow estimation for video frame interpolation,
Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou, “Real-time intermediate flow estimation for video frame interpolation,” in ECCV, 2022, pp. 624–642. 7
work page 2022
-
[53]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE TIP, vol. 13, no. 4, pp. 600–612, 2004. 9
work page 2004
-
[54]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595. 9
work page 2018
-
[55]
High speed and high dynamic range video with an event camera,
H. Rebecq, R. Ranftl, V . Koltun, and D. Scaramuzza, “High speed and high dynamic range video with an event camera,” IEEE TPAMI, vol. 43, no. 6, pp. 1964–1980, 2019. 10
work page 1964
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.