Recognition: unknown
A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping
Pith reviewed 2026-05-08 17:08 UTC · model grok-4.3
The pith
A unified benchmark systematically tests video restoration methods across mild to extreme refractive warping using real lab data and physics-modeled synthetics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a benchmark dataset and protocol that spans the full range of refractive warping, from turbulence-like mild distortions to strong discontinuous deformations, by pairing laboratory-captured real sequences with synthetic sequences generated via physics-based light refraction modeling, then uses it to compare restoration performance across classical and modern methods with both accuracy and perceptual metrics.
What carries the argument
The physics-based light refraction modeling that generates synthetic video sequences at controlled distortion levels, combined with real laboratory captures, to produce comparable test cases for geometric restoration algorithms.
If this is right
- Restoration methods can now be ranked consistently across a continuous gradient of distortion severity instead of only mild turbulence.
- Diffusion-based approaches such as V-cache become testable specifically on high and extreme distortion regimes where classical registration fails.
- Perceptual metrics supplement pixel metrics to expose quality differences invisible to PSNR or SSIM alone.
- The benchmark supplies a common reference for training and validating new multi-frame restoration networks aimed at unstable optical environments.
Where Pith is reading between the lines
- The same modeling approach could be adapted to generate training data for other dynamic media such as heat shimmer or particulate scattering if the refraction physics generalize.
- Integrating the benchmark sequences into end-to-end training loops might improve generalization of learning-based restorers beyond the four predefined levels.
- The performance gap between classical and diffusion methods on extreme cases suggests that temporal consistency modeling will remain a bottleneck for real-time applications.
Load-bearing premise
The physics-based synthetic sequences reproduce the statistical patterns and temporal dynamics of real severe refractive warping seen in the laboratory data.
What would settle it
A direct side-by-side statistical comparison of displacement field histograms or power spectra between the synthetic sequences and the real lab captures that shows large, systematic mismatches.
Figures
read the original abstract
Video sequence capturing through refractive dynamic media, such as a turbulent air or water surface, often suffer from severe geometric distortions and temporal instability. While recent advances address mild atmospheric turbulence, no existing benchmarks systematically evaluate restoration methods under strong and highly nonuniform refractive conditions. We present a comprehensive benchmark for geometric distortion removal in video, covering a range from turbulence-like mild warping to strong discontinuous refractive deformations. The benchmark includes both laboratory-captured real data and synthetic sequences generated for static scenes via physics-based light refraction modeling across four distortion levels and multiple surface wave types. We evaluate a spectrum of methods from simple baselines and classical registration algorithms to advanced learning-based approaches including DATUM and our proposed diffusion based V-cache for high and extreme distortions regimes. Evaluation uses both pixel-level (PSNR, SSIM), and perceptual (LPIPS, DINO, CLIP) metrics providing the first large scale analysis of geometric distortion removal. Our benchmark establishes a new foundation for developing and evaluating algorithms capable of reconstructing video from highly distorted optical environments. Our code and datasets are available at https://github.com/iafoss/refractive-mfir-benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a unified benchmark for multi-frame image restoration under severe refractive warping caused by dynamic media such as turbulent air or water surfaces. It combines laboratory-captured real data with synthetic sequences generated via physics-based light refraction modeling for static scenes, spanning four distortion levels and multiple surface wave types. The benchmark evaluates a spectrum of methods ranging from simple baselines and classical registration algorithms to advanced learning-based approaches, including DATUM and the authors' proposed diffusion-based V-cache method targeted at high and extreme distortion regimes. Performance is assessed using pixel-level metrics (PSNR, SSIM) and perceptual metrics (LPIPS, DINO, CLIP), with the goal of providing the first large-scale analysis and establishing a foundation for algorithms reconstructing video from highly distorted optical environments. Code and datasets are released publicly.
Significance. If the synthetic data faithfully reproduces the statistics and temporal dynamics of real severe refractive warping, the benchmark would fill a notable gap in the field by enabling systematic evaluation of restoration methods beyond mild turbulence regimes. The inclusion of both real and synthetic data, multi-level distortions, and diverse metrics could support reproducible progress in applications such as underwater vision or atmospheric imaging. However, the significance is currently limited by the absence of explicit validation that the physics-based synthetics match real lab statistics, which is central to the benchmark's claimed coverage of 'highly distorted optical environments.'
major comments (2)
- [Abstract / Data Generation] Abstract and data generation description: The central claim that the benchmark covers 'strong discontinuous refractive deformations' and 'highly distorted optical environments' rests on the assumption that physics-based synthetic sequences across four distortion levels and multiple wave types reproduce the statistics and temporal dynamics of the laboratory real data. No quantitative comparisons (e.g., optical-flow histograms, discontinuity frequency, or temporal correlation spectra) between real and synthetic extreme cases are reported, leaving the fidelity unverified and the benchmark's utility for the claimed regime at risk.
- [Evaluation] Evaluation section: The paper reports method rankings and V-cache effectiveness on both real and synthetic data, yet the abstract provides no quantitative results, ablation details, or verification that synthetic data matches real statistics. Without these, the cross-method comparisons and claims about performance in severe regimes cannot be fully assessed for robustness.
minor comments (2)
- [Abstract] The abstract mentions 'our proposed diffusion based V-cache' but does not define its architecture or training details at a level that allows immediate reproduction; a dedicated methods subsection would improve clarity.
- [Evaluation] Metric choices (PSNR, SSIM, LPIPS, DINO, CLIP) are listed without justification for their suitability to geometric distortion removal; a brief rationale or reference to prior use in similar tasks would strengthen the evaluation design.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comments below and will incorporate revisions to strengthen the manuscript's claims regarding data fidelity and abstract clarity.
read point-by-point responses
-
Referee: [Abstract / Data Generation] The central claim that the benchmark covers 'strong discontinuous refractive deformations' rests on the assumption that physics-based synthetic sequences reproduce the statistics and temporal dynamics of the laboratory real data. No quantitative comparisons (e.g., optical-flow histograms, discontinuity frequency, or temporal correlation spectra) between real and synthetic extreme cases are reported, leaving the fidelity unverified.
Authors: We agree that explicit quantitative validation of synthetic fidelity to real data statistics is essential to support claims about coverage of severe regimes. The synthetic data is generated via physics-based refraction modeling for static scenes, but we did not report direct statistical matches (such as optical flow histograms or temporal spectra) in the original submission. In revision, we will add these comparisons for extreme distortion levels to verify alignment with lab-captured real data and bolster the benchmark's utility. revision: yes
-
Referee: [Evaluation] The paper reports method rankings and V-cache effectiveness on both real and synthetic data, yet the abstract provides no quantitative results, ablation details, or verification that synthetic data matches real statistics. Without these, the cross-method comparisons and claims about performance in severe regimes cannot be fully assessed for robustness.
Authors: We note that abstracts are inherently concise and typically omit detailed ablations or full quantitative tables, which appear in the evaluation section. However, to address the concern, we will revise the abstract to include key quantitative highlights (e.g., representative PSNR/SSIM gains for V-cache in high-distortion cases) and reference the added fidelity validations. This maintains standard abstract length while improving transparency on robustness. revision: partial
Circularity Check
No circularity: empirical benchmark with no derivations or self-referential predictions
full rationale
The paper introduces a benchmark dataset and evaluates restoration methods on real lab-captured and physics-simulated refractive warping sequences using standard pixel and perceptual metrics. No equations, fitted parameters, or predictions are claimed; the central contribution is the release of data and code at the cited GitHub repository. No self-citations are invoked to justify uniqueness theorems or ansatzes, and the work contains no derivation chain that reduces to its own inputs by construction. The skeptic concern about synthetic-to-real distribution match is a validity question, not a circularity issue.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Atmospheric turbulence mitigation using complex wavelet-based fusion.IEEE Transactions on Image Processing, 22(6):2398–2408, 2013
Nantheera Anantrasirichai, Alin Achim, Nick G Kingsbury, and David R Bull. Atmospheric turbulence mitigation using complex wavelet-based fusion.IEEE Transactions on Image Processing, 22(6):2398–2408, 2013. 1
2013
-
[2]
At- mospheric turbulence mitigation for sequences with moving objects using recursive image fusion
Nantheera Anantrasirichai, Alin Achim, and David Bull. At- mospheric turbulence mitigation for sequences with moving objects using recursive image fusion. In2018 25th IEEE international conference on image processing (ICIP), pages 2895–2899. IEEE, 2018. 1
2018
-
[3]
Unsupervised moving ob- ject segmentation with atmospheric turbulence
Dehao Qin, Ripon Kumar Saha, Woojeh Chung, Suren Jaya- suriya, Jinwei Ye, and Nianyi Li. Unsupervised moving ob- ject segmentation with atmospheric turbulence. InEuropean Conference on Computer Vision, pages 18–37. Springer,
-
[4]
Ye Yuan, Wenhan Yang, Wenqi Ren, Jiaying Liu, Walter J Scheirer, and Zhangyang Wang. Ug 2+ track 2: A collective benchmark ... advancing image understanding in poor visi- bility environments.arXiv preprint arXiv:1904.04474, 2019. 1
-
[5]
Lucky imaging: high angular resolution imaging in the visible from the ground.Astronomy & Astrophysics, 446(2):739–745, 2006
Nicholas M Law, Craig D Mackay, and John E Bald- win. Lucky imaging: high angular resolution imaging in the visible from the ground.Astronomy & Astrophysics, 446(2):739–745, 2006. 1
2006
-
[6]
Spatio-temporal turbulence mit- igation: A translational perspective
Xingguang Zhang, Nicholas Chimitt, Yiheng Chi, Zhiyuan Mao, and Stanley H Chan. Spatio-temporal turbulence mit- igation: A translational perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2899, 2024. 1, 2, 3
2024
-
[7]
Physics-driven turbulence image restora- tion with stochastic refinement
Ajay Jaiswal, Xingguang Zhang, Stanley H Chan, and Zhangyang Wang. Physics-driven turbulence image restora- tion with stochastic refinement. InProceedings of the 7 IEEE/CVF international conference on computer vision, pages 12170–12181, 2023. 1
2023
-
[8]
Dynamic fluid sur- face reconstruction using deep neural network
Simron Thapa, Nianyi Li, and Jinwei Ye. Dynamic fluid sur- face reconstruction using deep neural network. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21–30, 2020. 1
2020
-
[9]
Restoration of non-rigidly distorted underwater images us- ing a combination of compressive sensing and local polyno- mial image representations
Jerin Geo James, Pranay Agrawal, and Ajit Rajwade. Restoration of non-rigidly distorted underwater images us- ing a combination of compressive sensing and local polyno- mial image representations. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7839– 7848, 2019
2019
-
[10]
Seeing through water: Image restoration using model-based tracking
Yuandong Tian and Srinivasa G Narasimhan. Seeing through water: Image restoration using model-based tracking. In 2009 IEEE 12th International conference on computer vi- sion, pages 2303–2310. Ieee, 2009. 1
2009
-
[11]
Sea-undistort: A dataset for through-water image restoration in high resolution airborne bathymetric mapping
Maximilian Kromer, Panagiotis Agrafiotis, and Beg ¨um Demir. Sea-undistort: A dataset for through-water image restoration in high resolution airborne bathymetric mapping. IEEE Geoscience and Remote Sensing Letters, 2025. 1
2025
-
[12]
Seeing through wavy water–air interface: A restoration model for instantaneous images distorted by surface waves
Bijian Jian, Chunbo Ma, Dejian Zhu, Yixiao Sun, and Jun Ao. Seeing through wavy water–air interface: A restoration model for instantaneous images distorted by surface waves. Future Internet, 14(8):236, 2022. 2
2022
-
[13]
Reconstruction of the instantaneous images distorted by surface waves via helmholtz–hodge decomposi- tion.Journal of Marine Science and Engineering, 11(1):164,
Bijian Jian, Chunbo Ma, Yixiao Sun, Dejian Zhu, Xu Tian, and Jun Ao. Reconstruction of the instantaneous images distorted by surface waves via helmholtz–hodge decomposi- tion.Journal of Marine Science and Engineering, 11(1):164,
-
[14]
Simultaneous 3d reconstruction for water sur- face and underwater scene
Yiming Qian, Yinqiang Zheng, Minglun Gong, and Yee- Hong Yang. Simultaneous 3d reconstruction for water sur- face and underwater scene. InProceedings of the European Conference on Computer Vision (ECCV), pages 754–770,
-
[15]
Reconstruction of distorted un- derwater images using robust registration.Optics express, 27(7):9996–10008, 2019
Zhen Zhang and Xu Yang. Reconstruction of distorted un- derwater images using robust registration.Optics express, 27(7):9996–10008, 2019. 2
2019
-
[16]
Water-air interface imaging: recovering the images distorted by surface waves via an efficient registration algo- rithm.Entropy, 24(12):1765, 2022
Bijian Jian, Chunbo Ma, Dejian Zhu, Qihong Huang, and Jun Ao. Water-air interface imaging: recovering the images distorted by surface waves via an efficient registration algo- rithm.Entropy, 24(12):1765, 2022. 2
2022
-
[17]
Unsupervised non-rigid image distortion removal via grid deformation
Nianyi Li, Simron Thapa, Cameron Whyte, Albert W Reed, Suren Jayasuriya, and Jinwei Ye. Unsupervised non-rigid image distortion removal via grid deformation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 2522–2532, 2021. 2, 3
2021
-
[18]
Xingguang Zhang, Nicholas Chimitt, Xijun Wang, Yu Yuan, and Stanley H. Chan. Learning phase distortion with se- lective state space models for video turbulence mitigation,
-
[19]
Learning to remove refractive distortions from underwater images
Simron Thapa, Nianyi Li, and Jinwei Ye. Learning to remove refractive distortions from underwater images. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 5007–5016, 2021. 2, 11
2021
-
[20]
Removing nonrigid refractive distortions for underwater images using an attention-based deep neural net- work.Intelligent Marine Technology and Systems, 2(1):25,
Tengyue Li, Jiayi Song, Zhiyu Song, Arapat Ablimit, and Long Chen. Removing nonrigid refractive distortions for underwater images using an attention-based deep neural net- work.Intelligent Marine Technology and Systems, 2(1):25,
-
[21]
GitHub repos- itory
A unified benchmark for multi-frame image restora- tion under severe refractive warping: code and evalua- tion framework.https://github.com/iafoss/ refractive-mfir-benchmark, 2026. GitHub repos- itory. 2, 7
2026
-
[22]
Refractive mfir benchmark dataset.https://zenodo. org/records/19390086, 2026. 2, 7
-
[23]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 4
2020
-
[24]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022. 4
work page internal anchor Pith review arXiv 2022
-
[25]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[26]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jian- feng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024. 4
work page internal anchor Pith review arXiv 2024
-
[28]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4
work page internal anchor Pith review arXiv 2024
-
[29]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[30]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 4
2023
-
[31]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review arXiv 2023
-
[32]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023
2023
-
[33]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- 8 image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022. 4
work page internal anchor Pith review arXiv 2022
-
[34]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4
work page internal anchor Pith review arXiv 2023
-
[35]
Dreamcache: Finetuning- free lightweight personalized image generation via feature caching
Emanuele Aiello, Umberto Michieli, Diego Valsesia, Mete Ozay, and Enrico Magli. Dreamcache: Finetuning- free lightweight personalized image generation via feature caching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12480–12489, 2025. 4
2025
-
[36]
Animate anyone: Consistent and controllable image-to-video synthesis for character animation, 2024
Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation, 2024. 4
2024
-
[37]
Image quality assessment: Form error visibil- ity to structural similarity.IEEE Trans
Z Wang. Image quality assessment: Form error visibil- ity to structural similarity.IEEE Trans. Image Process., 13(4):604–606, 2004. 5
2004
-
[38]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5
2018
-
[39]
Foundation models boost low-level perceptual similarity metrics
Abhijay Ghildyal, Nabajeet Barman, and Saman Zad- tootaghaj. Foundation models boost low-level perceptual similarity metrics. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE, 2025. 5
2025
-
[40]
Simulating ocean water.Simulat- ing nature: realistic and interactive techniques
Jerry Tessendorf et al. Simulating ocean water.Simulat- ing nature: realistic and interactive techniques. SIGGRAPH, 1(2):5, 2001. 10
2001
-
[41]
Wave breaking for nonlinear nonlocal shallow water equations.Acta Mathemat- ica, 181:229–243, 1998
Adrian Constantin and Joachim Escher. Wave breaking for nonlinear nonlocal shallow water equations.Acta Mathemat- ica, 181:229–243, 1998. 11 9 A unified Benchmark for Multi-Frame Image Restoration under Severe Refractive Warping Supplementary Material Figure 6. Data collection setup in the lab with water tank and water generators
1998
-
[42]
LAB setup Figure. 6. shows the laboratory data collection set up with a large water tank (20 × 7 × 3 feet) which was filled with approximately 19-inch depth of water. A TV monitor was placed above the water to display a set of background im- ages. The camera was set up below the water tank pointing towards the TV . During video recording, a wave generator...
-
[43]
to the left
Wave generation Table 5 provides parameters used to generate wave profiles. Details of specific wave types are provided below. 7.1. Ocean Wave For our simulation, we compute the Fast Fourier Transform (FFT) of Gerstner’s equations to represent the wave height as a random field over horizontal position and time. The heighth(x,t) at the horizontal positionx...
-
[44]
We mimic the LAB setup with camera located underwater and assume low field of view (parallel rays coming from the camera)
Video generation The distorted videos are generated by applying 200-frame- long series of precomputed wave normals to a selected background resized to512×512. We mimic the LAB setup with camera located underwater and assume low field of view (parallel rays coming from the camera). The vector form of Snell’s law (Eq. 9) is applied to these rays,⃗ v 1, at t...
-
[45]
Pixel (PSNR and SSIM) and perception metrics (LPIPS, DINO, CLIP) are used
Evaluation on the synthetic data Table 5-Table 8 provide a full summary of the evaluation on ocean, shallow water, sine, and ripple waves at low, mid, high, and extreme levels of distortion. Pixel (PSNR and SSIM) and perception metrics (LPIPS, DINO, CLIP) are used. Entire video setup refers to evaluation of the metric for each frame in the video and then ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.