pith. sign in

arxiv: 2407.03535 · v3 · pith:YYMGLY5Vnew · submitted 2024-07-03 · 💻 cs.CV

BVI-RLV: A Fully Registered Dataset for Low-Light Video Enhancement

Pith reviewed 2026-05-25 08:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords low-light video enhancementregistered datasetpaired framessub-pixel registrationsupervised learningvideo denoisingmotion capturedeep learning datasets
0
0 comments X

The pith

BVI-RLV supplies over 30k sub-pixel registered low-light to normal-light video frame pairs from 40 scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BVI-RLV as a dataset of paired low-light and normal-light videos that avoids the misalignment common in prior collections. It uses a motorized dolly plus image refinement to align frames at sub-pixel accuracy across dynamic motions and full HD resolution. Experiments establish that training enhancement models on these registered pairs produces higher quality outputs than training on misaligned data from the same scenes. The dataset also yields models that generalize better than those from existing collections when tested across datasets and in outdoor scenes. Baselines are supplied for CNN, Transformer, Mamba, and diffusion architectures.

Core claim

BVI-RLV comprises over 30k paired frames from 40 diverse scenes captured under two low-light conditions and aligned to normal-light ground truth. Sub-pixel registration holds for 99.24 percent of the full-HD data through motorized dolly motion combined with image-based refinement, while covering varied motion types and realistic temporal noise. Registration proves essential for supervised learning, delivering up to 5.85 dB PSNR gains over unregistered training, and models trained on the dataset outperform those from prior collections in cross-dataset tests, including real-world outdoor scenes.

What carries the argument

Motorized dolly movement combined with image-based refinement to produce sub-pixel accurate alignment between low-light and normal-light video frames.

If this is right

  • Training enhancement networks on the registered pairs raises PSNR by as much as 5.85 dB relative to unregistered versions of the same data.
  • Models trained on BVI-RLV exceed the cross-dataset performance of models trained on existing low-light collections.
  • The dataset supports training that generalizes to real-world outdoor low-light video.
  • Baseline results for CNN, Transformer, Mamba, and diffusion models become available for direct comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Precise frame alignment may enable models to exploit temporal correlations more effectively than misalignment permits.
  • The capture method could be adapted to other video tasks that require exact low-light to reference pairing.
  • Public release of the paired sequences may allow researchers to test whether registration quality correlates with gains in temporal consistency metrics.
  • Superior outdoor performance hints that the dataset captures noise statistics closer to uncontrolled environments than ND-filter approaches.

Load-bearing premise

The dolly motion and refinement process yields alignments that stay sub-pixel accurate and artifact-free across all scenes without introducing systematic biases into downstream model training.

What would settle it

A direct comparison of model performance when trained on BVI-RLV pairs versus the same scenes captured with handheld or static-camera methods that lack the dolly alignment step.

Figures

Figures reproduced from arXiv: 2407.03535 by Alexandra Malyugina, David R Bull, Guoxi Huang, Joanne Lin, Nantheera Anantrasirichai, Qi Sun, Ruirui Lin.

Figure 1
Figure 1. Figure 1: Scene examples with varying light levels and different motion profiles as shown in x-t plane. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Models trained on BVI-RLV generate higher results compared to the other three LLVE [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Scene setting showing the camera in ‘angle’ position, mounted on CineDrive system. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cropped images (Lego and Kitchen scenes at 350 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Main architectural components of the four different benchmarking methods, i.e., PCDUNet, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Subjective results of the BVI-CDM model trained on different datasets, and tested on [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Subjective results of the STA-SUNet model trained on different datasets, and tested on the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Low-light videos often exhibit spatiotemporally incoherent noise, compromising visibility and degrading performance in computer vision applications. A major challenge for enhancing such content using deep learning lies in the scarcity of pixel-aligned, high-quality training data. We introduce BVI-RLV, a fully registered low-light video dataset comprising over 30k paired frames from 40 diverse scenes under two low-light conditions, each aligned with normal-light ground truth. Unlike existing datasets that rely on neutral density (ND) filters or suffer from misalignment issues, BVI-RLV achieves sub-pixel registration for 99.24% of data at full HD resolution across dynamic motion scenarios using a motorized dolly and image-based refinement. The dataset covers a wide range of motion types and realistic temporal noise. We also provide baseline implementations using four representative architectures: Convolutional Neural Network (CNN), Transformer, State Space Model (Mamba), and Diffusion Model (DM). Experiments demonstrate that registration is crucial for supervised learning, yielding up to 5.85 dB PSNR improvement compared to unregistered training. Models trained on BVI-RLV outperform those trained on existing datasets in cross-dataset evaluations, achieving superior performance even in real-world outdoor scenes. Our dataset is publicly available at https://doi.org/10.21227/mzny-8c77.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents BVI-RLV, a new dataset of >30k paired low-light/normal-light video frames across 40 scenes captured with a motorized dolly plus image-based refinement, claiming 99.24% sub-pixel registration at full HD. It supplies baseline results for CNN, Transformer, Mamba, and diffusion models, reports up to 5.85 dB PSNR gain from using registered versus unregistered pairs, and shows superior cross-dataset generalization including on real outdoor scenes.

Significance. If the registration accuracy and lack of systematic bias hold, the dataset would fill a documented gap in aligned low-light video data and enable more reliable supervised training; the public release and multi-architecture baselines are concrete strengths that would support reproducibility and further work in the area.

major comments (2)
  1. [Abstract / registration section] Abstract and registration-method description: the headline 99.24% sub-pixel registration figure is presented without an independent validation metric (e.g., residual error against fiducial markers, multi-view consistency, or external reference alignment); if the percentage is derived solely from internal convergence of the image-based refinement step, it cannot rule out consistent sub-pixel biases that would affect downstream supervised training and the reported 5.85 dB gain.
  2. [Experiments / ablation studies] Experiments section (cross-dataset and registration-ablation results): the claim that registration is “crucial” and yields up to 5.85 dB improvement requires explicit confirmation that the unregistered training baseline used identical data volume, augmentation, optimizer schedule, and convergence criteria; without those controls the PSNR delta cannot be attributed solely to alignment quality.
minor comments (1)
  1. [Dataset description] The motion-type taxonomy and noise-characterization details would benefit from an explicit table or figure summarizing the 40 scenes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / registration section] Abstract and registration-method description: the headline 99.24% sub-pixel registration figure is presented without an independent validation metric (e.g., residual error against fiducial markers, multi-view consistency, or external reference alignment); if the percentage is derived solely from internal convergence of the image-based refinement step, it cannot rule out consistent sub-pixel biases that would affect downstream supervised training and the reported 5.85 dB gain.

    Authors: The 99.24% figure is computed from the residual displacement after the motorized dolly plus image-based refinement, with sub-pixel defined as <1 pixel error via feature matching. We agree this is an internal metric and does not provide fully independent validation (e.g., fiducial markers). In revision we will explicitly detail the computation, add discussion of possible systematic biases, and include multi-frame consistency checks from static scenes as supporting evidence. revision: yes

  2. Referee: [Experiments / ablation studies] Experiments section (cross-dataset and registration-ablation results): the claim that registration is “crucial” and yields up to 5.85 dB improvement requires explicit confirmation that the unregistered training baseline used identical data volume, augmentation, optimizer schedule, and convergence criteria; without those controls the PSNR delta cannot be attributed solely to alignment quality.

    Authors: The unregistered baseline used identical data volume, augmentations, optimizer, schedule, and convergence criteria; the sole difference was pair alignment. We will revise the experiments section to state these controls explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset contribution with no derivation chain

full rationale

The paper presents an empirical dataset and baseline experiments rather than any mathematical derivation or fitted-parameter prediction. The sub-pixel registration claim is a reported measurement from the data collection process (motorized dolly + refinement), not a quantity derived from or fitted to the downstream PSNR results. Cross-dataset evaluations are standard empirical comparisons with no self-referential reduction. No equations, ansatzes, or uniqueness theorems are invoked that collapse to the paper's own inputs. This is a self-contained dataset paper; the reader's circularity score of 1.0 is consistent with the absence of load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset creation and benchmarking paper with no mathematical derivations, free parameters, or new postulated entities; it relies on standard computer vision practices for registration and supervised training.

pith-pipeline@v0.9.0 · 5791 in / 1302 out tokens · 37690 ms · 2026-05-25T08:48:19.556838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline

    cs.CV 2025-04 unverdicted novelty 6.0

    A self-supervised Degradation Estimation Network estimates parameters for physics-informed noise distributions to generate realistic synthetic low-light data, showing gains on noise replication, enhancement, and detec...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Do, and Vladlen Koltun

    Chen Chen, Qifeng Chen, Minh N. Do, and Vladlen Koltun. Seeing motion in the dark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  2. [2]

    Seeing dynamic scene in the dark: High-quality video dataset with mechatronic alignment

    Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia. Seeing dynamic scene in the dark: High-quality video dataset with mechatronic alignment. In ICCV, 2021

  3. [3]

    Dancing in the dark: A benchmark towards general low-light video enhancement

    Huiyuan Fu, Wenkai Zheng, Xicong Wang, Jiaxuan Wang, Heng Zhang, and Huadong Ma. Dancing in the dark: A benchmark towards general low-light video enhancement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  4. [4]

    Self-supervised training for blind multi-frame video denoising

    Valery Dewil, Jeremy Anger, Axel Davy, Thibaud Ehret, Gabriele Facciolo, and Pablo Arias. Self-supervised training for blind multi-frame video denoising. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2724–2734, January 2021

  5. [5]

    Self-supervised low-light image enhancement using discrepant untrained network priors

    Jinxiu Liang, Yong Xu, Yuhui Quan, Boxin Shi, and Hui Ji. Self-supervised low-light image enhancement using discrepant untrained network priors. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7332–7345, 2022. 9

  6. [6]

    Anantrasirichai and David Bull

    N. Anantrasirichai and David Bull. Contextual colorization and denoising for low-light ultra high resolution sequences. In ICIP proc., pages 1614–1618, 2021

  7. [7]

    A topological loss function for image denoising on a new BVI-lowlight dataset

    Alexandra Malyugina, Nantheera Anantrasirichai, and David Bull. A topological loss function for image denoising on a new BVI-lowlight dataset. Signal Processing, 211, 2023

  8. [8]

    Richter, Laura Waller, and Vladlen Koltun

    Kristina Monakhova, Stephan R. Richter, Laura Waller, and Vladlen Koltun. Dancing under the stars: video denoising in starlight. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16220–16230, 2022

  9. [9]

    BDD100K: A diverse driving dataset for heterogeneous multitask learning

    Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  10. [10]

    Learning to see moving objects in the dark

    Haiyang Jiang and Yinqiang Zheng. Learning to see moving objects in the dark. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7323–7332, 2019

  11. [11]

    Supervised raw video denoising with a benchmark dataset on dynamic scenes

    Huanjing Yue, Cong Cao, Lei Liao, Ronghe Chu, and Jingyu Yang. Supervised raw video denoising with a benchmark dataset on dynamic scenes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2298–2307, 2020

  12. [12]

    An uncompressed benchmark image dataset for colour imaging

    Gerald Schaefer. An uncompressed benchmark image dataset for colour imaging. In 2010 IEEE International Conference on Image Processing, pages 3537–3540, 2010

  13. [13]

    Benchmarking denoising algorithms with real photographs

    Tobias Plötz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2750–2759, 2017

  14. [14]

    Abdelhamed, S

    A. Abdelhamed, S. Lin, and M.-S. Brown. A high-quality denoising dataset for smartphone cameras. In CVPR proc., pages 1692–1700, 2018

  15. [15]

    Low-light image and video enhancement using deep learning: A sur- vey

    Chongyi Li, Chunle Guo, Linghao Han, Jun Jiang, Ming-Ming Cheng, Jinwei Gu, and Chen Change Loy. Low-light image and video enhancement using deep learning: A sur- vey. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44(12):9396–9416, 2022

  16. [16]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Inter- vention (MICCAI), 2015

  17. [17]

    Revisiting temporal alignment for video restoration

    Kun Zhou, Wenbo Li, Liying Lu, Xiaoguang Han, and Jiangbo Lu. Revisiting temporal alignment for video restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  18. [18]

    Enhancing low light videos by exploring high sensitivity camera noise

    Wei Wang, Xin Chen, Cheng Yang, Xiang Li, Xuemei Hu, and Tao Yue. Enhancing low light videos by exploring high sensitivity camera noise. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4110–4118, 2019

  19. [19]

    J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In ICCV, pages 764–773, Oct 2017

  20. [20]

    Low-light video enhancement with synthetic event guidance

    Lin Liu, Junfeng An, Jianzhuang Liu, Shanxin Yuan, Xiangyu Chen, Wengang Zhou, Houqiang Li, Yan Feng Wang, and Qi Tian. Low-light video enhancement with synthetic event guidance. Proceedings of the AAAI Conference on Artificial Intelligence, 37(2):1692–1700, Jun. 2023

  21. [21]

    Low light video enhancement using synthetic data produced with an intermediate domain mapping

    Danai Triantafyllidou, Sean Moran, Steven McDonagh, Sarah Parisot, and Gregory Slabaugh. Low light video enhancement using synthetic data produced with an intermediate domain mapping. In European Conference on Computer Vision, pages 103–119. Springer, 2020

  22. [22]

    Anantrasirichai, Alin Achim, and David Bull

    N. Anantrasirichai, Alin Achim, and David Bull. Atmospheric turbulence mitigation for sequences with moving objects using recursive image fusion. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2895–2899, 2018

  23. [23]

    Image registration by local histogram matching

    Dinggang Shen. Image registration by local histogram matching. Pattern Recognition, 40(4):1161–1172, 2007

  24. [24]

    Sarvaiya, Suprava Patnaik, and Salman Bombaywala

    J.N. Sarvaiya, Suprava Patnaik, and Salman Bombaywala. Image registration by template matching using normalized cross-correlation. In 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, pages 819–822, 2009. 10

  25. [25]

    Noise flow: Noise modeling with conditional normalizing flows

    Abdelrahman Abdelhamed, Marcus Brubaker, and Michael Brown. Noise flow: Noise modeling with conditional normalizing flows. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3165–3173, 2019

  26. [26]

    A spatio-temporal aligned sunet model for low-light video enhancement

    Ruirui Lin, Nantheera Anantrasirichai, Alexandra Malyugina, and David Bull. A spatio-temporal aligned sunet model for low-light video enhancement. In Submitting to IEEE International Conference on Image Processing, 2024

  27. [27]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ICLR, 2021

  28. [28]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  29. [29]

    VMamba: Visual State Space Model

    Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024

  30. [30]

    Low-light image enhancement with wavelet-based diffusion models

    Hai Jiang, Ao Luo, Haoqiang Fan, Songchen Han, and Shuaicheng Liu. Low-light image enhancement with wavelet-based diffusion models. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023

  31. [31]

    Chan, Ke Yu, Chao Dong, and Chen Change Loy

    Xintao Wang, Kelvin C.K. Chan, Ke Yu, Chao Dong, and Chen Change Loy. EDVR: Video restoration with enhanced deformable convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019

  32. [32]

    Sendur and I.W

    L. Sendur and I.W. Selesnick. Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency. IEEE Transactions on Signal Processing, 50(11):2744–2756, 2002

  33. [33]

    Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024

  34. [34]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1833–1844, 2021

  35. [35]

    Holcombe

    Alex O. Holcombe. Seeing slow and seeing fast: two limits on perception. Trends in Cognitive Sciences, pages 216–221, 2009

  36. [36]

    High-resolution image synthesis and semantic manipulation with conditional gans

    Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018

  37. [37]

    Real image denoising with feature attention

    Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3155–3164, 2019. 11