pith. machine review for the scientific record. sign in

arxiv: 2602.19202 · v2 · submitted 2026-02-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords event cameravideo reconstructiondiffusion modelframe interpolationzero-shot predictiongenerative priorevent-to-frame
0
0 comments X

The pith

Pre-trained video diffusion models reconstruct high-fidelity frames from sparse event camera streams when guided by inter-frame residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that event cameras lose absolute intensity and static texture because they record only changes, and that this loss can be recovered by conditioning a pre-trained video diffusion model on the event stream. It introduces an event-based inter-frame residual guidance term that exploits the physical correlation between events and frame differences to improve reconstruction accuracy. The same framework is then extended without retraining to video interpolation and prediction by modulating the diffusion sampling process. If these steps hold, event data can be turned into dense, high-quality video output that outperforms prior specialized methods on both real and synthetic benchmarks.

Core claim

A baseline that feeds event data directly as conditioning to a video diffusion model already produces usable frames; adding event-based inter-frame residual guidance, which injects the difference between consecutive reconstructed frames as an additional control signal, raises fidelity further. The same conditioned reverse process supports zero-shot frame interpolation and future-frame prediction simply by changing the number and timing of sampling steps.

What carries the argument

Event-based inter-frame residual guidance, a conditioning signal derived from the physical difference between successive frames that is injected into the diffusion reverse process alongside the event stream.

If this is right

  • Event streams can be converted to dense video without task-specific training once a video diffusion prior is available.
  • The same model supports both reconstruction and temporal tasks (interpolation, prediction) by changing only the sampling schedule.
  • Quantitative gains appear on both synthetic and real event datasets, indicating the guidance term transfers across domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If residual guidance proves robust, the method could reduce the need for paired event-frame training data in future event-vision pipelines.
  • Extending the same conditioning to longer sequences might allow consistent video generation from very sparse event input over many seconds.
  • The framework suggests that other sparse sensors (e.g., lidar point clouds) could be paired with video diffusion priors using analogous residual signals.

Load-bearing premise

The physical correlation between recorded events and actual frame intensity differences is strong enough that residual guidance computed from reconstructed frames remains accurate.

What would settle it

A controlled ablation on a held-out real-world event dataset in which removing the residual guidance term produces a statistically significant drop in PSNR or perceptual metrics compared with the full method.

read the original abstract

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UniE2F, a unified diffusion framework that reconstructs high-fidelity video frames from sparse event-camera data by conditioning a pre-trained video diffusion model on event streams. It first builds a baseline via direct event conditioning, then adds event-based inter-frame residual guidance derived from physical log-intensity change correlations, and extends the approach to zero-shot interpolation and prediction by modulating the reverse diffusion sampling process. Experiments on real-world and synthetic datasets are reported to show quantitative and qualitative gains over prior methods, with code promised to be released.

Significance. If the experimental claims hold under closer scrutiny, the work offers a practical way to leverage large-scale video generative priors for event-to-frame tasks, addressing the inherent loss of absolute intensity and texture in event data. The unified treatment of reconstruction, interpolation, and prediction within a single sampling framework is a clear strength, as is the explicit grounding in the standard event-camera model (events as log-intensity differences). Public code release would further support reproducibility.

major comments (2)
  1. [§4 Experiments] §4 (Experiments): Quantitative results are presented without error bars, standard deviations across runs, or dataset statistics (e.g., event density distributions or frame counts). This omission makes it difficult to assess whether the reported gains over baselines are statistically reliable or sensitive to particular data splits.
  2. [§3.2 and §4] §3.2 (Residual Guidance) and §4: The claim that the event-based inter-frame residual guidance produces accurate rather than merely plausible reconstructions rests on the physical correlation between events and intensity changes, yet no ablation isolates its contribution versus the baseline conditioning alone, nor are there direct comparisons against ground-truth intensity values beyond standard perceptual metrics.
minor comments (2)
  1. [Abstract] Abstract: The statement that the method 'significantly outperforms previous approaches' would benefit from one or two concrete metric values (e.g., PSNR or LPIPS deltas) to give readers an immediate sense of scale.
  2. [Figures and §4] Figure captions and §4: Several qualitative comparisons lack explicit indication of which rows/columns correspond to which method or dataset; adding this would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recommendation of minor revision, and the constructive comments. We address each major point below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4 Experiments] Quantitative results are presented without error bars, standard deviations across runs, or dataset statistics (e.g., event density distributions or frame counts). This omission makes it difficult to assess whether the reported gains over baselines are statistically reliable or sensitive to particular data splits.

    Authors: We agree that error bars and additional dataset statistics would improve clarity. In the revised manuscript we will report standard deviations across runs with error bars in all quantitative tables and add dataset statistics including event density distributions and frame counts per sequence. revision: yes

  2. Referee: [§3.2 and §4] §3.2 (Residual Guidance) and §4: The claim that the event-based inter-frame residual guidance produces accurate rather than merely plausible reconstructions rests on the physical correlation between events and intensity changes, yet no ablation isolates its contribution versus the baseline conditioning alone, nor are there direct comparisons against ground-truth intensity values beyond standard perceptual metrics.

    Authors: We agree that an explicit ablation would strengthen the evidence for the residual guidance. We will add an ablation study in Section 4 comparing the baseline event conditioning against the full model with inter-frame residual guidance. Note that our reported PSNR and SSIM metrics already constitute direct pixel-level comparisons to ground-truth intensity; we will clarify this point and retain the perceptual metrics as complementary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper conditions a pre-trained external video diffusion model on event data and augments it with inter-frame residual guidance derived from the standard event-camera physical model (events encode log-intensity differences). No equations reduce predictions to fitted parameters defined from the target data, no self-citation chains justify uniqueness or ansatzes, and no known results are merely renamed. The central claims rest on established diffusion conditioning and the well-known event-to-intensity mapping, making the derivation independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the generative capability of an external pre-trained video diffusion model and the existence of a usable physical correlation between event streams and absolute intensity frames; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption physical correlation between the event stream and video frames exists and can be used as guidance
    Invoked to justify the inter-frame residual guidance step

pith-pipeline@v0.9.0 · 5508 in / 1246 out tokens · 33608 ms · 2026-05-15T20:33:10.646828+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 1 internal anchor

  1. [1]

    IEEE Journal of Solid-State Circuits49(10), 2333–2341 (2014)

    Brandli, C., Berner, R., Yang, M., Liu, S.-C., Delbruck, T.: A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits49(10), 2333–2341 (2014)

  2. [2]

    In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pp

    Brandli, C., Muller, L., Delbruck, T.: Real- time, high-speed video decompression using a frame-and event-based davis sensor. In: 2014 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 686–689 (2014). IEEE

  3. [3]

    IEEE Transactions on Image Pro- cessing31, 7237–7251 (2022)

    Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Transactions on Image Pro- cessing31, 7237–7251 (2022)

  4. [4]

    International Journal of Computer Vision131(2), 453–470 (2023)

    Zhang, H., Zhang, L., Dai, Y., Li, H., Koniusz, P.: Event-guided multi-patch net- work with self-supervision for non-uniform motion deblurring. International Journal of Computer Vision131(2), 453–470 (2023)

  5. [5]

    Advances in Neural Information Processing Systems35, 7462– 7476 (2022)

    Zhu, Z., Hou, J., Lyu, X.: Learning graph- embedded key-event back-tracing for object tracking in event clouds. Advances in Neural Information Processing Systems35, 7462– 7476 (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Zhu, Z., Hou, J., Wu, D.O.: Cross-modal orthogonal high-rank augmentation for rgb- event transformer-trackers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22045–22055 (2023)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chen, Z., Zhu, Z., Zhang, Y., Hou, J., Shi, G., Wu, J.: Segment any event streams via weighted adaptation of pivotal tokens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3890–3900 (2024) 18

  8. [8]

    IEEE Transactions on Mobile Computing (2025)

    Zhu, Z., Hou, J., Li, J., Wu, J., Hou, J.: Modeling state shifting via local-global dis- tillation for event-frame gaze tracking. IEEE Transactions on Mobile Computing (2025)

  9. [9]

    IEEE transactions on pattern analysis and machine intelligence43(6), 1964–1980 (2019)

    Rebecq, H., Ranftl, R., Koltun, V., Scara- muzza, D.: High speed and high dynamic range video with an event camera. IEEE transactions on pattern analysis and machine intelligence43(6), 1964–1980 (2019)

  10. [10]

    IEEE transactions on pattern analysis and machine intelligence44(1), 154–180 (2020)

    Gallego, G., Delbr¨ uck, T., Orchard, G., Bar- tolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A.J., Conradt, J., Daniilidis, K., et al.: Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence44(1), 154–180 (2020)

  11. [11]

    In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp

    Stoffregen, T., Scheerlinck, C., Scaramuzza, D., Drummond, T., Barnes, N., Kleeman, L., Mahony, R.: Reducing the sim-to-real gap for event cameras. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp. 534–549 (2020). Springer

  12. [12]

    In: Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pp

    Scheerlinck, C., Rebecq, H., Gehrig, D., Barnes, N., Mahony, R., Scaramuzza, D.: Fast image reconstruction with an event cam- era. In: Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pp. 156–163 (2020)

  13. [13]

    IEEE Transactions on Image Processing 30, 2488–2500 (2021)

    Cadena, P.R.G., Qian, Y., Wang, C., Yang, M.: Spade-e2vid: Spatially-adaptive denor- malization for event-based video reconstruc- tion. IEEE Transactions on Image Processing 30, 2488–2500 (2021)

  14. [14]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp

    Weng, W., Zhang, Y., Xiong, Z.: Event- based video reconstruction using transformer. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 2563–2572 (2021)

  15. [15]

    ACM Transactions on Multimedia Computing, Communications and Applications19(2s), 1– 31 (2023)

    Dong, J., Ota, K., Dong, M.: Video frame interpolation: A comprehensive survey. ACM Transactions on Multimedia Computing, Communications and Applications19(2s), 1– 31 (2023)

  16. [16]

    IEEE Transactions on Circuits and Systems for Video Technology31(4), 1283– 1295 (2020)

    Li, S., Fang, J., Xu, H., Xue, J.: Video frame prediction by deep multi-branch mask net- work. IEEE Transactions on Circuits and Systems for Video Technology31(4), 1283– 1295 (2020)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Tulyakov, S., Gehrig, D., Georgoulis, S., Erbach, J., Gehrig, M., Li, Y., Scaramuzza, D.: Time lens: Event-based video frame inter- polation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16155–16164 (2021)

  18. [18]

    IEEE Conference on Computer Vision and Pattern Recognition (2022)

    Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D.: Time Lens++: Event-based frame interpolation with non-linear parametric flow and multi- scale fusion. IEEE Conference on Computer Vision and Pattern Recognition (2022)

  19. [20]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Con- ference, pp

    Wang, Z., Hamann, F., Chaney, K., Jiang, W., Gallego, G., Daniilidis, K.: Event-based continuous color video decompression from single frames. In: Proceedings of the Com- puter Vision and Pattern Recognition Con- ference, pp. 4968–4978 (2025)

  20. [21]

    Advances in neural information processing systems33, 6840– 6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffu- sion probabilistic models. Advances in neural information processing systems33, 6840– 6851 (2020)

  21. [22]

    Advances in Neural Information Processing Systems35, 23593–23606 (2022)

    Kawar, B., Elad, M., Ermon, S., Song, J.: Denoising diffusion restoration models. Advances in Neural Information Processing Systems35, 23593–23606 (2022)

  22. [23]

    In: The Eleventh International Conference on Learn- ing Representations (2023)

    Chung, H., Kim, J., Mccann, M.T., Klasky, M.L., Ye, J.C.: Diffusion posterior sampling for general noisy inverse problems. In: The Eleventh International Conference on Learn- ing Representations (2023)

  23. [24]

    In: The Eleventh Interna- tional Conference on Learning Representa- tions (2023) 19

    Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. In: The Eleventh Interna- tional Conference on Learning Representa- tions (2023) 19

  24. [25]

    IEEE transactions on pattern analysis and machine intelligence45(4), 4713–4726 (2022)

    Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image super- resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence45(4), 4713–4726 (2022)

  25. [26]

    In: ACM SIGGRAPH 2022 Conference Proceedings, pp

    Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion mod- els. In: ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10 (2022)

  26. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Gao, S., Liu, X., Zeng, B., Xu, S., Li, Y., Luo, X., Liu, J., Zhen, X., Zhang, B.: Implicit diffusion models for continu- ous super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10021–10030 (2023)

  27. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

  28. [29]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  29. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D.: Time lens++: Event-based frame interpo- lation with parametric non-linear flow and multi-scale fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17755–17764 (2022)

  30. [31]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    He, W., You, K., Qiao, Z., Jia, X., Zhang, Z., Wang, W., Lu, H., Wang, Y., Liao, J.: Timereplayer: Unlocking the potential of event cameras for video interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17804–17813 (2022)

  31. [32]

    In: European Conference on Computer Vision, pp

    Wu, S., You, K., He, W., Yang, C., Tian, Y., Wang, Y., Zhang, Z., Liao, J.: Video inter- polation by event-driven anisotropic adjust- ment of optical flow. In: European Conference on Computer Vision, pp. 267–283 (2022). Springer

  32. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6820–6829 (2019)

  33. [34]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2519–2533 (2020)

    Pan, L., Hartley, R., Scheerlinck, C., Liu, M., Yu, X., Dai, Y.: High frame rate video reconstruction based on an event cam- era. IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2519–2533 (2020)

  34. [35]

    IEEE Transactions on Circuits and Systems for Video Technol- ogy33(2), 701–712 (2022)

    Chen, Z., Wu, J., Hou, J., Li, L., Dong, W., Shi, G.: Ecsnet: Spatio-temporal feature learning for event camera. IEEE Transactions on Circuits and Systems for Video Technol- ogy33(2), 701–712 (2022)

  35. [36]

    In: The 2011 International Joint Conference on Neural Networks, pp

    Cook, M., Gugelmann, L., Jug, F., Krautz, C., Steger, A.: Interacting maps for fast visual interpretation. In: The 2011 International Joint Conference on Neural Networks, pp. 770–776 (2011). IEEE

  36. [37]

    Kim, H., Handa, A., Benosman, R., Ieng, S.-H., Davison, A.J.: Simultaneous mosaicing and tracking with an event camera. J. Solid State Circ43, 566–576 (2008)

  37. [38]

    International Journal of Computer Vision 126(12), 1381–1393 (2018)

    Munda, G., Reinbacher, C., Pock, T.: Real-time intensity-image reconstruction for event cameras using manifold regularisation. International Journal of Computer Vision 126(12), 1381–1393 (2018)

  38. [39]

    In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Bardow, P., Davison, A.J., Leutenegger, S.: Simultaneous optical flow and intensity esti- mation from an event camera. In: Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 884–892 (2016)

  39. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Paredes-Vall´ es, F., De Croon, G.C.: Back to event basics: Self-supervised learning of 20 image reconstruction for event cameras via photometric constancy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3446– 3455 (2021)

  40. [41]

    In: Proceedings of the 31st ACM International Conference on Mul- timedia, pp

    Liang, Q., Zheng, X., Huang, K., Zhang, Y., Chen, J., Tian, Y.: Event-diffusion: Event- based image reconstruction and restoration with diffusion models. In: Proceedings of the 31st ACM International Conference on Mul- timedia, pp. 3837–3846 (2023)

  41. [42]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XL, pp

    Zhu, L., Zheng, Y., Zhang, Y., Wang, X., Wang, L., Huang, H.: Temporal residual guided diffusion framework for event-driven video reconstruction. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XL, pp. 411–427. Springer, Cham, Switzerland (2024)

  42. [43]

    Advances in Neural Information Processing Systems37, 70406–70430 (2024)

    Chen, K., Li, H., Zhou, J., Wang, Z., Wang, L.: Lase-e2v: Towards language- guided semantic-aware event-to-video recon- struction. Advances in Neural Information Processing Systems37, 70406–70430 (2024)

  43. [44]

    arXiv preprint arXiv:2407.08231 (2024)

    Liang, J., Yu, B., Yang, Y., Han, Y., Shi, B.: E2vidiff: Perceptual events-to-video reconstruction using diffusion priors. arXiv preprint arXiv:2407.08231 (2024)

  44. [45]

    In: 2024 IEEE International Con- ference on Image Processing (ICIP), pp

    Zhao, Y., Zhang, P., Wang, C., Lam, E.Y.: Controllable unsupervised event-based video generation. In: 2024 IEEE International Con- ference on Image Processing (ICIP), pp. 2278–2284 (2024). IEEE

  45. [46]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D.: Time lens++: Event-based frame interpolation with parametric non-linear flow and multi- scale fusion. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  46. [47]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Liu, H., Xu, J., Chang, Y., Zhou, H., Zhao, H., Wang, L., Yan, L.: Timetracker: Event- based continuous point tracking for video frame interpolation with non-linear motion. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 17649– 17659 (2025)

  47. [48]

    In: European Conference on Computer Vision, pp

    Ma, Y., Guo, S., Chen, Y., Xue, T., Gu, J.: Timelens-xl: Real-time event-based video frame interpolation with large motion. In: European Conference on Computer Vision, pp. 178–194 (2024). Springer

  48. [49]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Kim, T., Chae, Y., Jang, H.-K., Yoon, K.-J.: Event-based video frame interpolation with cross-modal asymmetric bidirectional motion fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18032–18042 (2023)

  49. [50]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Chen, J., Feng, B.Y., Cai, H., Wang, T., Burner, L., Yuan, D., Fermuller, C., Met- zler, C.A., Aloimonos, Y.: Repurposing pre- trained video diffusion models for event-based video interpolation. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 12456–12466 (2025)

  50. [51]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp

    Xia, B., Zhang, Y., Wang, S., Wang, Y., Wu, X., Tian, Y., Yang, W., Van Gool, L.: Diffir: Efficient diffusion model for image restora- tion. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 13095–13105 (2023)

  51. [52]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Blattmann, A., Rombach, R., Ling, H., Dock- horn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video syn- thesis with latent diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575 (2023)

  52. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

    Wang, L., Ho, Y.-S., Yoon, K.-J.,et al.: Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 10081–10090 (2019)

  53. [54]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21 pp

    Nam, Y., Mostafavi, M., Yoon, K.-J., Choi, J.: Stereo depth from events cameras: Con- centrate and focus on the future. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21 pp. 6114–6123 (2022)

  54. [55]

    In: European Confer- ence on Computer Vision, pp

    Teng, M., Zhou, C., Lou, H., Shi, B.: Nest: Neural event stack for event-based image enhancement. In: European Confer- ence on Computer Vision, pp. 660–676 (2022). Springer

  55. [56]

    In: International Con- ference on Learning Representations (2021)

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. In: International Con- ference on Learning Representations (2021)

  56. [57]

    Advances in Neural Information Processing Systems37, 105552– 105582 (2024)

    Wu, S., Zhu, Z., Hou, J., Shi, G., Wu, J.: E-motion: Future motion simulation via event sequence diffusion. Advances in Neural Information Processing Systems37, 105552– 105582 (2024)

  57. [58]

    Advances in neural information processing systems35, 26565– 26577 (2022)

    Karras, T., Aittala, M., Aila, T., Laine, S.: Elucidating the design space of diffusion- based generative models. Advances in neural information processing systems35, 26565– 26577 (2022)

  58. [59]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 770–778 (2016)

  59. [60]

    In: Proceedings of the European Conference on Computer Vision (ECCV), pp

    Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317 (2018)

  60. [61]

    In: European Conference on Computer Vision, pp

    Lin, S., Ma, Y., Guo, Z., Wen, B.: Dvs- voltmeter: Stochastic process-based event simulator for dynamic vision sensors. In: European Conference on Computer Vision, pp. 578–593 (2022). Springer

  61. [62]

    IEEE Transactions on Image Process- ing33, 1826–1837 (2024)

    Ercan, B., Eker, O., Saglam, C., Erdem, A., Erdem, E.: Hypere2vid: Improving event- based video reconstruction via hypernet- works. IEEE Transactions on Image Process- ing33, 1826–1837 (2024)

  62. [63]

    In: International Con- ference on Learning Representations (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Con- ference on Learning Representations (2019)

  63. [64]

    IEEE transactions on image processing13(4), 600– 612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simon- celli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600– 612 (2004)

  64. [65]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 586–595 (2018)

  65. [66]

    In: AAAI Conference on Artificial Intelligence (AAAI) (2024)

    Zhu, J., Wan, Z., Dai, Y.: Video frame predic- tion from a single image and events. In: AAAI Conference on Artificial Intelligence (AAAI) (2024)

  66. [67]

    The International journal of robotics research 36(2), 142–149 (2017)

    Mueggler, E., Rebecq, H., Gallego, G., Del- bruck, T., Scaramuzza, D.: The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam. The International journal of robotics research 36(2), 142–149 (2017)

  67. [68]

    IEEE Robotics and Automation Letters3(3), 2032– 2039 (2018)

    Zhu, A.Z., Thakur, D., ¨Ozaslan, T., Pfrom- mer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: An event camera dataset for 3d perception. IEEE Robotics and Automation Letters3(3), 2032– 2039 (2018)

  68. [69]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  69. [70]

    In: Computer vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp

    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in con- text. In: Computer vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer 22 Appendix Contents AProofs of Proposition 1 BDetails of ...

  70. [71]

    After obtaining the intermediate or subsequent frames from the network output, we compare them with the corresponding ground-truth frames from the test set to compute MSE, SSIM, and LPIPS. 25 Fig. D1: Visual comparison of event-based video frame Interpolation results on synthetic dataset. Fig. D2: Visual comparison of event-based video frame Interpolation...