Recognition: unknown
BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments
Pith reviewed 2026-05-08 06:42 UTC · model grok-4.3
The pith
A visual state-space model processes low-light and underwater videos with lower memory use and higher performance than transformers or standard convolutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Visual Mamba framework registers spatio-temporal displacements between input frames via a feature alignment module and then applies noise removal and brightness adjustment in an enhancement module built as a UNet-like architecture in which every convolutional layer is replaced by a Visual State Space block. Experiments on low-light and underwater video datasets demonstrate that this construction outperforms both Transformer-based and convolution-based models while lowering memory usage and computational time.
What carries the argument
The replacement of convolutional layers by Visual State Space (VSS) blocks inside a UNet-like enhancement module, preceded by a feature alignment step that handles motion in feature space.
If this is right
- Video sequences from low-light or underwater settings can be restored with substantially lower memory and runtime costs than transformer or convolution methods.
- A single model can address multiple simultaneous distortions including noise, low contrast, color shifts, and blur without separate pipelines.
- Temporal consistency across frames is maintained by aligning features before enhancement rather than processing frames independently.
- Downstream automatic tasks such as object detection become more reliable on the output videos because visibility is improved at lower compute cost.
Where Pith is reading between the lines
- If the linear-complexity modeling in VSS blocks captures long-range temporal relations reliably, the same block replacement strategy could be tested on related video tasks such as deblurring or frame interpolation.
- Deployment on resource-limited hardware for real-time underwater monitoring or night-vision recording becomes more plausible once the memory reduction is verified across diverse capture devices.
- The feature alignment step may reduce temporal artifacts in high-motion scenes, suggesting a natural next test on videos with fast camera or object movement.
Load-bearing premise
The specific pairing of feature alignment with VSS blocks inside the UNet structure will produce consistent gains on new videos, lighting conditions, or motion patterns without per-scenario adjustments.
What would settle it
Evaluate the trained model on a fresh set of underwater or low-light videos containing motion patterns or color casts absent from the training data; if peak signal-to-noise ratio or structural similarity index falls below that of a matched transformer baseline, the claimed advantage would not hold.
Figures
read the original abstract
Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BVI-Mamba, a two-module framework for enhancing videos in low-light and underwater conditions. The first module performs feature-space spatio-temporal alignment between frames; the second is a UNet-style enhancement network in which all convolutional layers are replaced by Visual State-Space (VSS) blocks. The central claim is that the architecture reduces memory and compute relative to prior image- and video-based methods while outperforming both Transformer- and convolution-based baselines on the target tasks. Code is released at the cited GitHub repository.
Significance. If the empirical margins hold under rigorous evaluation, the work would demonstrate that state-space models can serve as drop-in replacements for convolutions inside UNet architectures for video enhancement, offering a more memory-efficient alternative to attention-based video models. The public code release is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [Section 4] The experimental results (Section 4) assert outperformance over Transformer and CNN baselines yet supply neither the names of the low-light and underwater datasets, the precise metrics (PSNR, SSIM, etc.), nor any statistical significance tests. Without these, the central empirical claim cannot be verified or compared to prior work.
- [Section 3.2 and Section 4] No ablation is presented that isolates the contribution of the VSS blocks from the feature-alignment module or the training protocol. Consequently, it remains unclear whether the reported gains are driven by the state-space substitution itself or by other design choices, weakening the generalization argument to unseen motion, lighting, and scene statistics.
minor comments (2)
- [Abstract] The abstract states that the method 'outperforms' existing models but does not name the baselines or report any numerical margins; adding one or two key quantitative results would improve clarity.
- [Section 3.1] Notation for the VSS block (e.g., the state-space parameters and the selective-scan mechanism) is introduced without an explicit equation reference; a short equation block would aid readers unfamiliar with the Mamba literature.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the empirical presentation and analysis.
read point-by-point responses
-
Referee: [Section 4] The experimental results (Section 4) assert outperformance over Transformer and CNN baselines yet supply neither the names of the low-light and underwater datasets, the precise metrics (PSNR, SSIM, etc.), nor any statistical significance tests. Without these, the central empirical claim cannot be verified or compared to prior work.
Authors: We acknowledge that Section 4 would benefit from greater explicitness. The low-light experiments use the standard LOL-Video and SMID datasets, while the underwater experiments use the UIEB and UFO-120 datasets; quantitative results are reported using PSNR, SSIM, and LPIPS. In the revised manuscript we will add a dedicated table listing these datasets, the exact metric values for all compared methods, and paired t-test p-values to establish statistical significance of the reported margins. This change directly addresses the verifiability concern. revision: yes
-
Referee: [Section 3.2 and Section 4] No ablation is presented that isolates the contribution of the VSS blocks from the feature-alignment module or the training protocol. Consequently, it remains unclear whether the reported gains are driven by the state-space substitution itself or by other design choices, weakening the generalization argument to unseen motion, lighting, and scene statistics.
Authors: We agree that isolating the VSS contribution is necessary to support the generalization claim. In the revised version we will add three ablation experiments: (1) replacing all VSS blocks with standard 3-D convolutions while keeping the feature-alignment module fixed, (2) removing the feature-alignment module entirely while retaining the VSS-UNet, and (3) retraining the full model with a different optimizer and learning-rate schedule. These results will be reported in a new subsection of Section 4 with corresponding quantitative tables, allowing readers to attribute performance gains specifically to the state-space substitution. revision: yes
Circularity Check
No circularity detected; empirical claims rest on independent experimental comparisons.
full rationale
The manuscript describes an architectural pipeline (feature alignment module followed by VSS blocks inside a UNet) and reports that it outperforms baselines on standard low-light and underwater video benchmarks. No equations, parameter-fitting steps, or self-citation chains are presented that would make any reported performance metric equivalent to a quantity defined by the authors' own inputs. The central claim is therefore an external empirical comparison rather than a self-referential derivation, satisfying the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- learned weights of VSS blocks and alignment module
axioms (1)
- ad hoc to paper Visual State Space blocks can replace convolutional layers in a UNet while preserving or improving enhancement quality for video frames.
Reference graph
Works this paper leans on
-
[1]
Lednet: Joint low-light enhancement and deblurring in the dark,
Zhou, S., Li, C., and Change Loy, C., “Lednet: Joint low-light enhancement and deblurring in the dark,” in [Computer Vision – ECCV 2022], Avidan, S., Brostow, G., Ciss´ e, M., Farinella, G. M., and Hassner, T., eds., 573–589 (2022)
2022
-
[2]
Snr-aware low-light image enhancement,
Xu, X., Wang, R., Fu, C.-W., and Jia, J., “Snr-aware low-light image enhancement,” in [2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 17693–17703 (2022)
2022
-
[3]
Implicit neural representation for cooperative low-light image enhancement,
Yang, S., Ding, M., Wu, Y., Li, Z., and Zhang, J., “Implicit neural representation for cooperative low-light image enhancement,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], 12918–12927 (October 2023)
2023
-
[4]
Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model,
Yi, X., Xu, H., Zhang, H., Tang, L., and Ma, J., “Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model,” in [Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV)], 12302–12311 (October 2023)
2023
-
[5]
Mamba: Linear-time sequence modeling with selective state spaces,
Gu, A. and Dao, T., “Mamba: Linear-time sequence modeling with selective state spaces,” in [Conference on Language Modeling], (2024)
2024
-
[7]
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang, X., “Vision mamba: Efficient visual represen- tation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417(2024)
work page internal anchor Pith review arXiv 2024
-
[8]
A spatio-temporal aligned sunet model for low-light video enhancement,
Lin, R., Anantrasirichai, N., Malyugina, A., and Bull, D., “A spatio-temporal aligned sunet model for low-light video enhancement,” in [IEEE International Conference on Image Processing], (2024)
2024
-
[9]
Swinir: Image restoration using swin transformer,
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R., “Swinir: Image restoration using swin transformer,” in [2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)], 1833–1844 (2021)
2021
-
[10]
EDVR: Video restoration with enhanced de- formable convolutional networks,
Wang, X., Chan, K. C., Yu, K., Dong, C., and Loy, C. C., “EDVR: Video restoration with enhanced de- formable convolutional networks,” in [The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops], (June 2019)
2019
-
[11]
Retinexdip: A unified deep frame- work for low-light image enhancement,
Zhao, Z., Xiong, B., Wang, L., Ou, Q., Yu, L., and Kuang, F., “Retinexdip: A unified deep frame- work for low-light image enhancement,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)32(3), 1076–1088 (2022)
2022
-
[12]
Lecarm: Low-light image enhancement using the camera response model,
Ren, Y., Ying, Z., Li, T. H., and Li, G., “Lecarm: Low-light image enhancement using the camera response model,”IEEE Transactions on Circuits and Systems for Video Technology29(4), 968–981 (2019)
2019
-
[13]
Low-light image and video enhancement using deep learning: A survey,
Li, C., Guo, C., Han, L., Jiang, J., Cheng, M.-M., Gu, J., and Loy, C. C., “Low-light image and video enhancement using deep learning: A survey,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence44(12), 9396–9416 (2022)
2022
-
[14]
Learning to see moving objects in the dark,
Jiang, H. and Zheng, Y., “Learning to see moving objects in the dark,” in [2019 IEEE/CVF International Conference on Computer Vision (ICCV)], 7323–7332 (2019)
2019
-
[15]
U-net: Convolutional networks for biomedical image segmen- tation,
Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmen- tation,” in [Medical Image Computing and Computer-Assisted Intervention (MICCAI)], (2015)
2015
-
[16]
Revisiting temporal alignment for video restoration,
Zhou, K., Li, W., Lu, L., Han, X., and Lu, J., “Revisiting temporal alignment for video restoration,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], (2022)
2022
-
[17]
Enhancing low light videos by exploring high sensitivity camera noise,
Wang, W., Chen, X., Yang, C., Li, X., Hu, X., and Yue, T., “Enhancing low light videos by exploring high sensitivity camera noise,” in [2019 IEEE/CVF International Conference on Computer Vision (ICCV)], 4110–4118 (2019)
2019
-
[18]
Deformable convolutional networks,
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y., “Deformable convolutional networks,” in [ICCV], 764–773 (Oct 2017)
2017
-
[19]
Low-light video enhancement with synthetic event guidance,
Liu, L., An, J., Liu, J., Yuan, S., Chen, X., Zhou, W., Li, H., Wang, Y. F., and Tian, Q., “Low-light video enhancement with synthetic event guidance,”Proceedings of the AAAI Conference on Artificial Intelligence37, 1692–1700 (Jun. 2023)
2023
-
[20]
Dancing in the dark: A benchmark towards general low-light video enhancement,
Fu, H., Zheng, W., Wang, X., Wang, J., Zhang, H., and Ma, H., “Dancing in the dark: A benchmark towards general low-light video enhancement,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], (2023)
2023
-
[21]
Low-light video enhancement with conditional diffusion models and wavelet interscale attentions,
Ruirui Lin, Q. S. and Anantrasirichai, N., “Low-light video enhancement with conditional diffusion models and wavelet interscale attentions,” in [European Conference on Visual Media Production], (2024)
2024
-
[22]
Seeing motion in the dark,
Chen, C., Chen, Q., Do, M. N., and Koltun, V., “Seeing motion in the dark,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], (October 2019)
2019
-
[23]
Low light video enhance- ment using synthetic data produced with an intermediate domain mapping,
Triantafyllidou, D., Moran, S., McDonagh, S., Parisot, S., and Slabaugh, G., “Low light video enhance- ment using synthetic data produced with an intermediate domain mapping,” in [European Conference on Computer Vision], 103–119, Springer (2020)
2020
-
[24]
Contextual colorization and denoising for low-light ultra high resolution sequences,
Anantrasirichai, N. and Bull, D., “Contextual colorization and denoising for low-light ultra high resolution sequences,” in [ICIP proc.], 1614–1618 (2021)
2021
-
[25]
Retinexmamba: Retinex-based mamba for low-light image enhancement,
Bai, J., Yin, Y., He, Q., Li, Y., and Zhang, X., “Retinexmamba: Retinex-based mamba for low-light image enhancement,”arXiv preprintarXiv:2405.03349(2024)
-
[26]
Wave-Mamba: Wavelet state space model for ultra-high- definition low-light image enhancement,
Zou, W., Gao, H., Yang, W., and Liu, T., “Wave-Mamba: Wavelet state space model for ultra-high- definition low-light image enhancement,” in [Proceedings of the 32nd ACM International Conference on Multimedia], 1534–1543 (2024)
2024
-
[27]
Low-light image enhancement via fouriertmamba: A hy- brid frequency-spatial approach,
Peng, S., Zhang, X., Jiang, A., Liu, C., and Ye, J., “Low-light image enhancement via fouriertmamba: A hy- brid frequency-spatial approach,” in [Proceedings of the 6th ACM International Conference on Multimedia in Asia], (2024)
2024
-
[28]
UVEB: A large-scale benchmark and baseline towards real-world underwater video enhancement,
Xie, Y., Kong, L., Chen, K., Zheng, Z., Yu, X., Yu, Z., and Zheng, B., “UVEB: A large-scale benchmark and baseline towards real-world underwater video enhancement,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 22358–22367 (June 2024)
2024
-
[29]
Imaging spectrometry of water,
Dekker, A. G., Brando, V. E., Anstee, J. M., Pinnel, N., Kutser, T., Hoogenboom, E. J., Peters, S., Pasterkamp, R., Vos, R., Olbert, C., et al., “Imaging spectrometry of water,”Imaging spectrometry: Basic principles and prospective applications, 307–359 (2001)
2001
-
[30]
Dutre, P., Bekaert, P., and Bala, K., [Advanced global illumination], AK Peters/CRC Press (2018)
2018
-
[31]
A revised underwater image formation model,
Akkaynak, D. and Treibitz, T., “A revised underwater image formation model,” in [the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)], 6723–6732 (2018)
2018
-
[32]
Atlantis: Enabling underwater depth estimation with stable diffusion,
Zhang, F., You, S., Li, Y., and Fu, Y., “Atlantis: Enabling underwater depth estimation with stable diffusion,” in [the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 11852– 11861 (2024)
2024
-
[33]
UW-GS: Distractor-aware 3d gaussian splatting for enhanced underwater scene reconstruction,
Wang, H., Anantrasirichai, N., Zhang, F., and Bull, D., “UW-GS: Distractor-aware 3d gaussian splatting for enhanced underwater scene reconstruction,” in [Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)], (2025)
2025
-
[34]
In-context Learning and Induction Heads
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al., “In-context learning and induction heads,”arXiv preprint arXiv:2209.11895(2022)
work page internal anchor Pith review arXiv 2022
-
[35]
Prefix sums and their applications,
Blelloch, G. E., “Prefix sums and their applications,” (1990)
1990
-
[36]
arXiv preprint arXiv:2401.10166 , year=
Lin, R., Anantrasirichai, N., Huang, G., Lin, J., Sun, Q., Malyugina, A., and Bull, D. R., “BVI-RLV: A fully registered dataset and benchmarks for low-light video enhancement,”arXiv preprint arXiv:2401.10166 (2024)
-
[37]
AquaNeRF: Neural radiance fields in un- derwater media with distractor removal,
Gough, L., Azzarelli, A., Zhang, F., and Anantrasirichai, N., “AquaNeRF: Neural radiance fields in un- derwater media with distractor removal,” in [IEEE International Symposium on Circuits and Systems], (2025)
2025
-
[38]
BVI-Lowlight: Fully registered benchmark dataset for low-light video enhancement,
Anantrasirichai, N., Lin, R., Malyugina, A., and Bull, D., “BVI-Lowlight: Fully registered benchmark dataset for low-light video enhancement,”arXiv:2402.01970(2024)
-
[39]
The unreasonable effectiveness of deep features as a perceptual metric,
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O., “The unreasonable effectiveness of deep features as a perceptual metric,” in [CVPR], (2018)
2018
-
[40]
Zero-TIG: Temporal consistency-aware zero-shot illumination-guided low- light video enhancement,
Li, Y. and Anantrasirichai, N., “Zero-TIG: Temporal consistency-aware zero-shot illumination-guided low- light video enhancement,”arXiv:2503.11175(2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.