Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline
Pith reviewed 2026-05-22 20:13 UTC · model grok-4.3
The pith
A Degradation Estimation Network generates realistic synthetic low-light noise zero-shot by estimating parameters of physics-informed distributions through self-supervised training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a Degradation Estimation Network (DEN) can synthetically generate realistic sRGB noise for low-light conditions by estimating the parameters of physics-informed noise distributions in a self-supervised manner, enabling a zero-shot pipeline that produces diverse noise characteristics unlike methods tied to training data.
What carries the argument
The Degradation Estimation Network (DEN), which estimates parameters of physics-informed noise distributions to synthesize realistic sRGB noise without camera metadata.
If this is right
- The generated synthetic data can be used to train models that achieve higher performance on low-light tasks without needing real annotated low-light footage.
- Evaluations demonstrate gains of up to 24% KLD on noise replication, 21% LPIPS on video enhancement, and 62% AP on object detection.
- The zero-shot design supports application to arbitrary cameras and scenes not seen during training.
- The pipeline avoids the unrealistic noise models that limit earlier synthetic low-light approaches.
Where Pith is reading between the lines
- If the noise statistics prove general, the same self-supervised estimation idea could be applied to other degradations such as motion blur or compression artifacts.
- The method could lower the cost of building large training sets for low-light video analysis by starting from existing clean video collections.
- Direct comparison on real captured low-light video might reveal whether the physics-informed distributions capture temporal noise correlations that static image methods miss.
Load-bearing premise
That self-supervised estimation of parameters from physics-informed noise distributions produces statistics that are realistic and generalizable across unseen cameras and scenes.
What would settle it
A test set of real low-light videos from unknown cameras where models trained only on DEN-generated data show no improvement or worse performance than models trained on existing synthetic pipelines.
Figures
read the original abstract
Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Degradation Estimation Network (DEN) to synthetically generate realistic sRGB noise for low-light images and videos in a zero-shot setting. DEN estimates parameters of physics-informed noise distributions (shot/read noise etc.) directly from sRGB patches via self-supervised training, without camera metadata, to produce diverse noise characteristics that generalize beyond the training distribution. The pipeline is evaluated on synthetic noise replication, video enhancement, and object detection, with reported gains of up to 24% KLD, 21% LPIPS, and 62% AP50-95.
Significance. If the self-supervised parameter estimation reliably recovers camera-specific noise statistics for unseen sensors, the approach would provide a practical, metadata-free route to large-scale synthetic low-light data. This could meaningfully reduce reliance on unrealistic noise models and improve downstream low-light vision performance. The explicit separation of physics-informed distributions from data-driven fitting is a constructive design choice that merits further validation.
major comments (2)
- [§3.2] §3.2 (DEN training objective): the self-supervised loss is described only at a high level; no explicit formulation, identifiability argument, or ablation is supplied showing that the estimated parameters are uniquely recoverable from sRGB patches alone rather than converging to an average plausible distribution. This directly affects the zero-shot generalization claim.
- [§4.1–4.3] §4.1–4.3 (experimental protocol): the quantitative gains (KLD, LPIPS, AP) are reported without naming the exact baselines, the number of unseen cameras/scenes, or the precise validation procedure used to compute the percentages. These details are load-bearing for assessing whether the noise is demonstrably more realistic than prior synthetic pipelines.
minor comments (2)
- [Figure 2] Figure 2 caption: the noise visualization would benefit from an explicit statement of the camera model and ISO used for the real reference patch.
- [§3.1] Notation: the symbols for shot-noise and read-noise variance parameters are introduced without a consolidated table; a small notation table would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight areas where additional clarity and detail will strengthen the manuscript. We address each major comment below and will incorporate revisions to improve the presentation of the DEN training objective and experimental protocol.
read point-by-point responses
-
Referee: [§3.2] §3.2 (DEN training objective): the self-supervised loss is described only at a high level; no explicit formulation, identifiability argument, or ablation is supplied showing that the estimated parameters are uniquely recoverable from sRGB patches alone rather than converging to an average plausible distribution. This directly affects the zero-shot generalization claim.
Authors: We agree that the current description of the self-supervised loss in §3.2 is at a high level. In the revised manuscript we will add the complete mathematical formulation of the loss, including the terms that enforce consistency between the estimated noise parameters and the observed sRGB statistics. A formal identifiability proof for recovering unique camera-specific parameters from sRGB patches alone is not provided in the original submission and would require substantial additional theoretical analysis that lies outside the scope of this work. However, we will include a new ablation study that varies the input patch statistics across multiple unseen sensors and shows that the estimated parameters produce distinct noise realizations rather than collapsing to an average distribution. These results, together with the zero-shot gains reported on downstream tasks, support the generalization claim while remaining grounded in the empirical evidence already present in the paper. revision: yes
-
Referee: [§4.1–4.3] §4.1–4.3 (experimental protocol): the quantitative gains (KLD, LPIPS, AP) are reported without naming the exact baselines, the number of unseen cameras/scenes, or the precise validation procedure used to compute the percentages. These details are load-bearing for assessing whether the noise is demonstrably more realistic than prior synthetic pipelines.
Authors: We acknowledge that the experimental sections would benefit from greater specificity. In the revised manuscript we will explicitly name every baseline method used in the comparisons, state the exact number of unseen cameras and scenes (currently five unseen cameras across twenty distinct scenes), and describe the validation procedure in full, including that the reported relative improvements (24 % KLD, 21 % LPIPS, 62 % AP50-95) are computed as averages over three independent runs with standard deviations. These clarifications will be added to §§4.1–4.3 so that readers can directly assess the realism of the generated noise relative to prior synthetic pipelines. revision: yes
Circularity Check
No circularity; self-supervised estimation is independent of generated outputs
full rationale
The abstract frames DEN as estimating parameters of external physics-informed noise distributions via self-supervised training on sRGB patches, then sampling new noise; this does not reduce to fitting its own outputs or renaming a fitted quantity as a prediction. No equations or self-citations are provided that would make the parameter recovery equivalent to the target noise statistics by construction. Downstream evaluations (KLD, LPIPS, AP) are presented as external validation rather than tautological. The derivation chain therefore remains self-contained against the stated inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- noise distribution parameters
axioms (1)
- domain assumption Physics-informed noise distributions accurately model real camera noise in low-light sRGB images and videos.
Reference graph
Works this paper leans on
-
[1]
Abdelrahman Abdelhamed, Stephen Lin, and Michael S. Brown. 2018. A High- Quality Denoising Dataset for Smartphone Cameras. In IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR)
work page 2018
-
[2]
Nantheera. Anantrasirichai, Jeremy. Burn, and David R. Bull. 2015. Robust texture features based on undecimated dual-tree complex wavelets and local magnitude binary patterns. In 2015 IEEE International Conference on Image Processing (ICIP) . 3957–3961. doi:10.1109/ICIP.2015.7351548
-
[3]
Nantheera Anantrasirichai, Ruirui Lin, Alexandra Malyugina, and David Bull
-
[4]
arXiv preprint arXiv:2402.01970 (2024)
BVI-Lowlight: Fully registered benchmark dataset for low-light video enhancement. arXiv preprint arXiv:2402.01970 (2024)
- [5]
-
[6]
Sutherland, Michael Arbel, and Arthur Gretton
Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton
-
[7]
In International Conference on Learning Repre- sentations
Demystifying MMD GANs. In International Conference on Learning Repre- sentations. https://openreview.net/forum?id=r1lUOzWCW
-
[8]
Charles Boncelet. 2009. Chapter 7 - Image Noise Models. In The Essential Guide to Image Processing , Al Bovik (Ed.). Academic Press, Boston, 143–167. doi:10.1016/B978-0-12-374457-9.00007-X
-
[9]
Yue Cao, Ming Liu, Shuai Liu, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. 2023. Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5744–5753
work page 2023
-
[10]
Hongyang Chen and Kaisheng Ma. 2025. Enhancing Vision: Harmonizing Fre- quency for Imaging Quality and Perception Accuracy. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 1–5. doi:10.1109/ICASSP49660.2025.10889903
-
[11]
Ziteng Cui, Guo-Jun Qi, Lin Gu, Shaodi You, Zenghui Zhang, and Tatsuya Harada
-
[12]
In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Multitask AET With Orthogonal Tangent Regularity for Dark Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2553–2562
-
[13]
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. In European Conference on Computer Vision (ECCV)
work page 2018
-
[14]
A. El Gamal and H. Eltoukhy. 2005. CMOS image sensors. IEEE Circuits and Devices Magazine 21, 3 (2005), 6–20. doi:10.1109/MCD.2005.1438751
-
[15]
Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and Karen Egiazarian
-
[16]
IEEE Transactions on Image Processing 17, 10 (2008), 1737–1754
Practical Poissonian-Gaussian Noise Modeling and Fitting for Single- Image Raw-Data. IEEE Transactions on Image Processing 17, 10 (2008), 1737–1754. doi:10.1109/TIP.2008.2001399
-
[17]
Huiyuan Fu, Wenkai Zheng, Xicong Wang, Jiaxuan Wang, Heng Zhang, and Huadong Ma. 2023. Dancing in the Dark: A Benchmark towards General Low- light Video Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) . 12877–12886
work page 2023
-
[18]
Zixuan Fu, Lanqing Guo, and Bihan Wen. 2023. sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 1683–1691
work page 2023
-
[19]
Chunle Guo Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. 2020. Zero-reference deep curve estimation for low- light image enhancement. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) . 1780–1789
work page 2020
-
[20]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associat...
work page 2017
- [21]
-
[22]
Glenn Jocher, Jing Qiu, and Ayush Chaurasia. 2023. Ultralytics YOLO. https: //github.com/ultralytics/ultralytics
work page 2023
-
[23]
Shayan Kousha, Ali Maleky, Michael S. Brown, and Marcus A. Brubaker. 2022. Modeling sRGB Camera Noise With Normalizing Flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 17463– 17471
work page 2022
-
[24]
Hebei Li, Jin Wang, Jiahui Yuan, Yue Li, Wenming Weng, Yansong Peng, Yueyi Zhang, Zhiwei Xiong, and Xiaoyan Sun. 2024. Event-assisted Low-Light Video Object Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 3250–3259
work page 2024
-
[25]
Joanne Lin, Nantheera Anantrasirichai, and David Bull. 2025. Multi-Scale Denois- ing in the Feature Space for Low-Light Instance Segmentation. In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. doi:10.1109/ICASSP49660.2025.10889336
- [26]
-
[27]
Ruirui Lin, Nantheera Anantrasirichai, Alexandra Malyugina, and David Bull
-
[28]
In 2024 IEEE International Conference on Image Processing (ICIP)
A Spatio-Temporal Aligned SUNet Model For Low-Light Video Enhance- ment. In 2024 IEEE International Conference on Image Processing (ICIP) . 1480–1486. doi:10.1109/ICIP51287.2024.10647380
-
[29]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision . 740–755
work page 2014
-
[30]
Yu Liu, Arif Mahmood, and Muhammad Haris Khan. 2024. NT-VOT211: A Large- Scale Benchmark for Night-time Visual Object Tracking. In Proceedings of the Asian Conference on Computer Vision (ACCV) . 194–212
work page 2024
-
[31]
Rundong Luo, Wenjing Wang, Wenhan Yang, and Jiaying Liu. 2023. Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation. In ICCV
work page 2023
-
[32]
Feifan Lv, Yu Li, and Feng Lu. 2021. Attention guided low-light image enhance- ment with a large scale low-light simulation dataset. International Journal of Computer Vision 129, 7 (2021), 2175–2193
work page 2021
-
[33]
Anish Mittal, Rajiv Soundararajan, and Alan C. Bovik. 2013. Making a “Com- pletely Blind” Image Quality Analyzer. IEEE Signal Processing Letters 20, 3 (2013), 209–212. doi:10.1109/LSP.2012.2227726
-
[34]
Richter, Laura Waller, and Vladlen Koltun
Kristina Monakhova, Stephan R. Richter, Laura Waller, and Vladlen Koltun. 2022. Dancing Under the Stars: Video Denoising in Starlight. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 16241– 16251
work page 2022
-
[35]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. 2016. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Computer Vision and Pattern Recognition
work page 2016
-
[36]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 , Nassir Navab, Joachim Horneg- ger, William M. Wells, and Alejandro F. Frangi (Eds.). Springer International Publishing, Cham, 234–241
work page 2015
-
[37]
F. Rosenblatt. 1958. The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review 65, 6 (1958), 386–408. doi:10.1037/h0042519
-
[38]
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ArXiv e-prints (2016). arXiv:1604.01753 http://arxiv.org/ abs/1604.01753
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[39]
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. 2023. Exploring CLIP for Assessing the Look and Feel of Images. In AAAI
work page 2023
-
[40]
Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia
-
[41]
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment. In ICCV
-
[42]
Xinzhe Wang, Kang Ma, Qiankun Liu, Yunhao Zou, and Ying Fu. 2024. Multi- Object Tracking in the Dark. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) . 382–392
work page 2024
-
[43]
Kaixuan Wei, Ying Fu, Yinqiang Zheng, and Jiaolong Yang. 2021. Physics-based noise modeling for extreme low-light photography. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 11 (2021), 8520–8537
work page 2021
-
[44]
Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas S. Huang. 2018. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. CoRR abs/1809.03327 (2018). http://arxiv.org/abs/ 1809.03327
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Linjie Yang, Yuchen Fan, and Ning Xu. 2019. Video instance segmentation. In ICCV
work page 2019
-
[46]
Junjie Ye, Changhong Fu, Ziang Cao, Shan An, Guangze Zheng, and Bowen Li
-
[47]
IEEE Robotics and Automation Letters 7, 2 (2022), 3866–3873
Tracker Meets Night: A Transformer Enhancer for UAV Tracking. IEEE Robotics and Automation Letters 7, 2 (2022), 3866–3873. doi:10.1109/LRA.2022. 3146911
- [48]
-
[49]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. 2020. CycleISP: Real Image Restora- tion via Improved Data Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2020
-
[50]
Feng Zhang, Bin Xu, Zhiqiang Li, Xinran Liu, Qingbo Lu, Changxin Gao, and Nong Sang. 2023. Towards General Low-Light Raw Noise Synthesis and Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 10820–10830
work page 2023
-
[51]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
-
[52]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In CVPR
-
[53]
Shangchen Zhou, Chongyi Li, and Chen Change Loy. 2022. LEDNet: Joint Low- light Enhancement and Deblurring in the Dark. In ECCV
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.