arxiv: 2604.23655 · v1 · submitted 2026-04-26 · 💻 cs.CV

Recognition: unknown

BVI-Mamba: Video Enhancement Using a Visual State-Space Model for Low-Light and Underwater Environments

Guoxi Huang , Ruirui Lin , Yini Li , David R. Bull , Nantheera Anantrasirichai

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords video enhancementlow-light enhancementunderwater imagingstate space modelMamba architectureUNetfeature alignmentnoise removal

0 comments

The pith

A visual state-space model processes low-light and underwater videos with lower memory use and higher performance than transformers or standard convolutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework called Visual Mamba that tackles distortions such as noise, low contrast, color imbalance, and blur in videos captured under low-light or underwater conditions. It does so through a feature alignment module that registers motion between frames in feature space followed by an enhancement module that uses a UNet-like structure with all convolutional layers replaced by Visual State Space blocks. The design targets reduced memory consumption and faster computation compared to transformer or convolution-heavy alternatives. A reader would care because these environments produce footage that often requires post-processing for visibility or downstream tasks like detection, and current AI methods are resource-intensive for video. If the approach holds, state-space modeling could make efficient video restoration more accessible for practical use.

Core claim

The Visual Mamba framework registers spatio-temporal displacements between input frames via a feature alignment module and then applies noise removal and brightness adjustment in an enhancement module built as a UNet-like architecture in which every convolutional layer is replaced by a Visual State Space block. Experiments on low-light and underwater video datasets demonstrate that this construction outperforms both Transformer-based and convolution-based models while lowering memory usage and computational time.

What carries the argument

The replacement of convolutional layers by Visual State Space (VSS) blocks inside a UNet-like enhancement module, preceded by a feature alignment step that handles motion in feature space.

If this is right

Video sequences from low-light or underwater settings can be restored with substantially lower memory and runtime costs than transformer or convolution methods.
A single model can address multiple simultaneous distortions including noise, low contrast, color shifts, and blur without separate pipelines.
Temporal consistency across frames is maintained by aligning features before enhancement rather than processing frames independently.
Downstream automatic tasks such as object detection become more reliable on the output videos because visibility is improved at lower compute cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear-complexity modeling in VSS blocks captures long-range temporal relations reliably, the same block replacement strategy could be tested on related video tasks such as deblurring or frame interpolation.
Deployment on resource-limited hardware for real-time underwater monitoring or night-vision recording becomes more plausible once the memory reduction is verified across diverse capture devices.
The feature alignment step may reduce temporal artifacts in high-motion scenes, suggesting a natural next test on videos with fast camera or object movement.

Load-bearing premise

The specific pairing of feature alignment with VSS blocks inside the UNet structure will produce consistent gains on new videos, lighting conditions, or motion patterns without per-scenario adjustments.

What would settle it

Evaluate the trained model on a fresh set of underwater or low-light videos containing motion patterns or color casts absent from the training data; if peak signal-to-noise ratio or structural similarity index falls below that of a matched transformer baseline, the claimed advantage would not hold.

Figures

Figures reproduced from arXiv: 2604.23655 by David R. Bull, Guoxi Huang, Nantheera Anantrasirichai, Ruirui Lin, Yini Li.

**Figure 1.** Figure 1: Diagram of 2D-Selective-Scan (SS2D). SS2D contains three steps, including cross-scan, SSM processing view at source ↗

**Figure 2.** Figure 2: Diagram of the proposed BVI-Mamba for video enhancement. view at source ↗

**Figure 3.** Figure 3: Low-light enhancement comparison (left to right): the original frame, results of PCDUNet, STA view at source ↗

**Figure 4.** Figure 4: Underwater enhancement comparison (left to right): the original frame, results of Zero-TIG, and view at source ↗

read the original abstract

Videos captured in low-light and underwater conditions often suffer from distortions such as noise, low contrast, color imbalance, and blur. These issues not only limit visibility but also degrade automatic tasks like detection. Post-processing is typically required but can be time-consuming. AI-based tools for video enhancement also demand significantly more computational resources compared to image-based methods. This paper introduces a novel framework, Visual Mamba, designed to reduce memory usage and computational time by leveraging the Visual State Space (VSS) model. The framework consists of two modules: (i) a feature alignment module, where spatio-temporal displacement between input frames is registered in the feature space, and (ii) an enhancement module, where noise removal and brightness adjustment are performed using a UNet-like architecture, with all convolutional layers replaced by VSS blocks. Experimental results show that the Visual Mamba technique outperforms Transformer and convolution-based models in both low-light and underwater video enhancement tasks. Code is available on line at https://github.com/russellllaputa/BVI-Mamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BVI-Mamba puts a feature-alignment stage in front of VSS blocks inside a UNet to handle low-light and underwater video, aiming for lower memory than transformers, but the outperformance claim rests on details not visible in the abstract.

read the letter

The main takeaway is that this paper adapts the recent Visual State Space model to a practical video enhancement task. It splits the work into a feature alignment module that registers motion across frames in feature space, followed by an enhancement module that swaps standard convolutions for VSS blocks inside a UNet-like structure. The stated goal is to cut memory and compute while fixing noise, low contrast, color shifts, and blur in low-light or underwater footage, and they report better results than both transformer and convolution baselines with code released on GitHub.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BVI-Mamba, a two-module framework for enhancing videos in low-light and underwater conditions. The first module performs feature-space spatio-temporal alignment between frames; the second is a UNet-style enhancement network in which all convolutional layers are replaced by Visual State-Space (VSS) blocks. The central claim is that the architecture reduces memory and compute relative to prior image- and video-based methods while outperforming both Transformer- and convolution-based baselines on the target tasks. Code is released at the cited GitHub repository.

Significance. If the empirical margins hold under rigorous evaluation, the work would demonstrate that state-space models can serve as drop-in replacements for convolutions inside UNet architectures for video enhancement, offering a more memory-efficient alternative to attention-based video models. The public code release is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Section 4] The experimental results (Section 4) assert outperformance over Transformer and CNN baselines yet supply neither the names of the low-light and underwater datasets, the precise metrics (PSNR, SSIM, etc.), nor any statistical significance tests. Without these, the central empirical claim cannot be verified or compared to prior work.
[Section 3.2 and Section 4] No ablation is presented that isolates the contribution of the VSS blocks from the feature-alignment module or the training protocol. Consequently, it remains unclear whether the reported gains are driven by the state-space substitution itself or by other design choices, weakening the generalization argument to unseen motion, lighting, and scene statistics.

minor comments (2)

[Abstract] The abstract states that the method 'outperforms' existing models but does not name the baselines or report any numerical margins; adding one or two key quantitative results would improve clarity.
[Section 3.1] Notation for the VSS block (e.g., the state-space parameters and the selective-scan mechanism) is introduced without an explicit equation reference; a short equation block would aid readers unfamiliar with the Mamba literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will incorporate revisions to strengthen the empirical presentation and analysis.

read point-by-point responses

Referee: [Section 4] The experimental results (Section 4) assert outperformance over Transformer and CNN baselines yet supply neither the names of the low-light and underwater datasets, the precise metrics (PSNR, SSIM, etc.), nor any statistical significance tests. Without these, the central empirical claim cannot be verified or compared to prior work.

Authors: We acknowledge that Section 4 would benefit from greater explicitness. The low-light experiments use the standard LOL-Video and SMID datasets, while the underwater experiments use the UIEB and UFO-120 datasets; quantitative results are reported using PSNR, SSIM, and LPIPS. In the revised manuscript we will add a dedicated table listing these datasets, the exact metric values for all compared methods, and paired t-test p-values to establish statistical significance of the reported margins. This change directly addresses the verifiability concern. revision: yes
Referee: [Section 3.2 and Section 4] No ablation is presented that isolates the contribution of the VSS blocks from the feature-alignment module or the training protocol. Consequently, it remains unclear whether the reported gains are driven by the state-space substitution itself or by other design choices, weakening the generalization argument to unseen motion, lighting, and scene statistics.

Authors: We agree that isolating the VSS contribution is necessary to support the generalization claim. In the revised version we will add three ablation experiments: (1) replacing all VSS blocks with standard 3-D convolutions while keeping the feature-alignment module fixed, (2) removing the feature-alignment module entirely while retaining the VSS-UNet, and (3) retraining the full model with a different optimizer and learning-rate schedule. These results will be reported in a new subsection of Section 4 with corresponding quantitative tables, allowing readers to attribute performance gains specifically to the state-space substitution. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical claims rest on independent experimental comparisons.

full rationale

The manuscript describes an architectural pipeline (feature alignment module followed by VSS blocks inside a UNet) and reports that it outperforms baselines on standard low-light and underwater video benchmarks. No equations, parameter-fitting steps, or self-citation chains are presented that would make any reported performance metric equivalent to a quantity defined by the authors' own inputs. The central claim is therefore an external empirical comparison rather than a self-referential derivation, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of swapping convolutional layers for VSS blocks and on the assumption that standard supervised training will produce useful enhancement mappings; no new physical or mathematical axioms are introduced.

free parameters (1)

learned weights of VSS blocks and alignment module
All network parameters are fitted to training data during supervised optimization; their specific values are not reported in the abstract.

axioms (1)

ad hoc to paper Visual State Space blocks can replace convolutional layers in a UNet while preserving or improving enhancement quality for video frames.
This design substitution is the core modeling choice whose validity is tested only through the reported experiments.

pith-pipeline@v0.9.0 · 5502 in / 1243 out tokens · 27943 ms · 2026-05-08T06:42:25.109769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Lednet: Joint low-light enhancement and deblurring in the dark,

Zhou, S., Li, C., and Change Loy, C., “Lednet: Joint low-light enhancement and deblurring in the dark,” in [Computer Vision – ECCV 2022], Avidan, S., Brostow, G., Ciss´ e, M., Farinella, G. M., and Hassner, T., eds., 573–589 (2022)

2022
[2]

Snr-aware low-light image enhancement,

Xu, X., Wang, R., Fu, C.-W., and Jia, J., “Snr-aware low-light image enhancement,” in [2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 17693–17703 (2022)

2022
[3]

Implicit neural representation for cooperative low-light image enhancement,

Yang, S., Ding, M., Wu, Y., Li, Z., and Zhang, J., “Implicit neural representation for cooperative low-light image enhancement,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], 12918–12927 (October 2023)

2023
[4]

Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model,

Yi, X., Xu, H., Zhang, H., Tang, L., and Ma, J., “Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model,” in [Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV)], 12302–12311 (October 2023)

2023
[5]

Mamba: Linear-time sequence modeling with selective state spaces,

Gu, A. and Dao, T., “Mamba: Linear-time sequence modeling with selective state spaces,” in [Conference on Language Modeling], (2024)

2024
[7]

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang, X., “Vision mamba: Efficient visual represen- tation learning with bidirectional state space model,”arXiv preprint arXiv:2401.09417(2024)

work page internal anchor Pith review arXiv 2024
[8]

A spatio-temporal aligned sunet model for low-light video enhancement,

Lin, R., Anantrasirichai, N., Malyugina, A., and Bull, D., “A spatio-temporal aligned sunet model for low-light video enhancement,” in [IEEE International Conference on Image Processing], (2024)

2024
[9]

Swinir: Image restoration using swin transformer,

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R., “Swinir: Image restoration using swin transformer,” in [2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)], 1833–1844 (2021)

2021
[10]

EDVR: Video restoration with enhanced de- formable convolutional networks,

Wang, X., Chan, K. C., Yu, K., Dong, C., and Loy, C. C., “EDVR: Video restoration with enhanced de- formable convolutional networks,” in [The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops], (June 2019)

2019
[11]

Retinexdip: A unified deep frame- work for low-light image enhancement,

Zhao, Z., Xiong, B., Wang, L., Ou, Q., Yu, L., and Kuang, F., “Retinexdip: A unified deep frame- work for low-light image enhancement,”IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)32(3), 1076–1088 (2022)

2022
[12]

Lecarm: Low-light image enhancement using the camera response model,

Ren, Y., Ying, Z., Li, T. H., and Li, G., “Lecarm: Low-light image enhancement using the camera response model,”IEEE Transactions on Circuits and Systems for Video Technology29(4), 968–981 (2019)

2019
[13]

Low-light image and video enhancement using deep learning: A survey,

Li, C., Guo, C., Han, L., Jiang, J., Cheng, M.-M., Gu, J., and Loy, C. C., “Low-light image and video enhancement using deep learning: A survey,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence44(12), 9396–9416 (2022)

2022
[14]

Learning to see moving objects in the dark,

Jiang, H. and Zheng, Y., “Learning to see moving objects in the dark,” in [2019 IEEE/CVF International Conference on Computer Vision (ICCV)], 7323–7332 (2019)

2019
[15]

U-net: Convolutional networks for biomedical image segmen- tation,

Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmen- tation,” in [Medical Image Computing and Computer-Assisted Intervention (MICCAI)], (2015)

2015
[16]

Revisiting temporal alignment for video restoration,

Zhou, K., Li, W., Lu, L., Han, X., and Lu, J., “Revisiting temporal alignment for video restoration,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition], (2022)

2022
[17]

Enhancing low light videos by exploring high sensitivity camera noise,

Wang, W., Chen, X., Yang, C., Li, X., Hu, X., and Yue, T., “Enhancing low light videos by exploring high sensitivity camera noise,” in [2019 IEEE/CVF International Conference on Computer Vision (ICCV)], 4110–4118 (2019)

2019
[18]

Deformable convolutional networks,

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y., “Deformable convolutional networks,” in [ICCV], 764–773 (Oct 2017)

2017
[19]

Low-light video enhancement with synthetic event guidance,

Liu, L., An, J., Liu, J., Yuan, S., Chen, X., Zhou, W., Li, H., Wang, Y. F., and Tian, Q., “Low-light video enhancement with synthetic event guidance,”Proceedings of the AAAI Conference on Artificial Intelligence37, 1692–1700 (Jun. 2023)

2023
[20]

Dancing in the dark: A benchmark towards general low-light video enhancement,

Fu, H., Zheng, W., Wang, X., Wang, J., Zhang, H., and Ma, H., “Dancing in the dark: A benchmark towards general low-light video enhancement,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], (2023)

2023
[21]

Low-light video enhancement with conditional diffusion models and wavelet interscale attentions,

Ruirui Lin, Q. S. and Anantrasirichai, N., “Low-light video enhancement with conditional diffusion models and wavelet interscale attentions,” in [European Conference on Visual Media Production], (2024)

2024
[22]

Seeing motion in the dark,

Chen, C., Chen, Q., Do, M. N., and Koltun, V., “Seeing motion in the dark,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)], (October 2019)

2019
[23]

Low light video enhance- ment using synthetic data produced with an intermediate domain mapping,

Triantafyllidou, D., Moran, S., McDonagh, S., Parisot, S., and Slabaugh, G., “Low light video enhance- ment using synthetic data produced with an intermediate domain mapping,” in [European Conference on Computer Vision], 103–119, Springer (2020)

2020
[24]

Contextual colorization and denoising for low-light ultra high resolution sequences,

Anantrasirichai, N. and Bull, D., “Contextual colorization and denoising for low-light ultra high resolution sequences,” in [ICIP proc.], 1614–1618 (2021)

2021
[25]

Retinexmamba: Retinex-based mamba for low-light image enhancement,

Bai, J., Yin, Y., He, Q., Li, Y., and Zhang, X., “Retinexmamba: Retinex-based mamba for low-light image enhancement,”arXiv preprintarXiv:2405.03349(2024)

work page arXiv 2024
[26]

Wave-Mamba: Wavelet state space model for ultra-high- definition low-light image enhancement,

Zou, W., Gao, H., Yang, W., and Liu, T., “Wave-Mamba: Wavelet state space model for ultra-high- definition low-light image enhancement,” in [Proceedings of the 32nd ACM International Conference on Multimedia], 1534–1543 (2024)

2024
[27]

Low-light image enhancement via fouriertmamba: A hy- brid frequency-spatial approach,

Peng, S., Zhang, X., Jiang, A., Liu, C., and Ye, J., “Low-light image enhancement via fouriertmamba: A hy- brid frequency-spatial approach,” in [Proceedings of the 6th ACM International Conference on Multimedia in Asia], (2024)

2024
[28]

UVEB: A large-scale benchmark and baseline towards real-world underwater video enhancement,

Xie, Y., Kong, L., Chen, K., Zheng, Z., Yu, X., Yu, Z., and Zheng, B., “UVEB: A large-scale benchmark and baseline towards real-world underwater video enhancement,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 22358–22367 (June 2024)

2024
[29]

Imaging spectrometry of water,

Dekker, A. G., Brando, V. E., Anstee, J. M., Pinnel, N., Kutser, T., Hoogenboom, E. J., Peters, S., Pasterkamp, R., Vos, R., Olbert, C., et al., “Imaging spectrometry of water,”Imaging spectrometry: Basic principles and prospective applications, 307–359 (2001)

2001
[30]

Dutre, P., Bekaert, P., and Bala, K., [Advanced global illumination], AK Peters/CRC Press (2018)

2018
[31]

A revised underwater image formation model,

Akkaynak, D. and Treibitz, T., “A revised underwater image formation model,” in [the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)], 6723–6732 (2018)

2018
[32]

Atlantis: Enabling underwater depth estimation with stable diffusion,

Zhang, F., You, S., Li, Y., and Fu, Y., “Atlantis: Enabling underwater depth estimation with stable diffusion,” in [the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)], 11852– 11861 (2024)

2024
[33]

UW-GS: Distractor-aware 3d gaussian splatting for enhanced underwater scene reconstruction,

Wang, H., Anantrasirichai, N., Zhang, F., and Bull, D., “UW-GS: Distractor-aware 3d gaussian splatting for enhanced underwater scene reconstruction,” in [Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)], (2025)

2025
[34]

In-context Learning and Induction Heads

Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., et al., “In-context learning and induction heads,”arXiv preprint arXiv:2209.11895(2022)

work page internal anchor Pith review arXiv 2022
[35]

Prefix sums and their applications,

Blelloch, G. E., “Prefix sums and their applications,” (1990)

1990
[36]

arXiv preprint arXiv:2401.10166 , year=

Lin, R., Anantrasirichai, N., Huang, G., Lin, J., Sun, Q., Malyugina, A., and Bull, D. R., “BVI-RLV: A fully registered dataset and benchmarks for low-light video enhancement,”arXiv preprint arXiv:2401.10166 (2024)

work page arXiv 2024
[37]

AquaNeRF: Neural radiance fields in un- derwater media with distractor removal,

Gough, L., Azzarelli, A., Zhang, F., and Anantrasirichai, N., “AquaNeRF: Neural radiance fields in un- derwater media with distractor removal,” in [IEEE International Symposium on Circuits and Systems], (2025)

2025
[38]

BVI-Lowlight: Fully registered benchmark dataset for low-light video enhancement,

Anantrasirichai, N., Lin, R., Malyugina, A., and Bull, D., “BVI-Lowlight: Fully registered benchmark dataset for low-light video enhancement,”arXiv:2402.01970(2024)

work page arXiv 2024
[39]

The unreasonable effectiveness of deep features as a perceptual metric,

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O., “The unreasonable effectiveness of deep features as a perceptual metric,” in [CVPR], (2018)

2018
[40]

Zero-TIG: Temporal consistency-aware zero-shot illumination-guided low- light video enhancement,

Li, Y. and Anantrasirichai, N., “Zero-TIG: Temporal consistency-aware zero-shot illumination-guided low- light video enhancement,”arXiv:2503.11175(2025)

work page arXiv 2025