arxiv: 2605.04560 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression

Jiaqian Zhang , Hao Wei , Chenyang Ge , Yanhui Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords perceptual image compressionMambastate space modelssemantic awarenessrate-distortion-perceptionlow-rank approximationlightweight compressionimage coding

0 comments

The pith

Semantic-aware Mamba scanning with SVD redundancy reduction yields efficient perceptual image compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a compression framework that replaces heavy generative models with state space models to maintain visual quality at low bitrates. It introduces semantic-aware Mamba blocks that cluster features dynamically to guide the scanning order, which counters fixed-order limitations and information decay. An SVD-inspired module then applies low-rank approximation to latent features using a learnable threshold, cutting channel redundancy. Experiments position this as competitive in rate-distortion-perception balance while using fewer parameters than prior art.

Core claim

The authors show that semantic feature clustering to direct Mamba scanning, combined with learnable soft-threshold low-rank approximation on latents, produces a full encoder-decoder system that matches state-of-the-art perceptual compression results at lower model complexity.

What carries the argument

Semantic-aware Mamba block (SAMB) that clusters semantic features to guide scanning and ease causality constraints, together with SVD-inspired redundancy reduction module (SVD-RRM) that performs low-rank approximation on encoder latents via a learnable soft threshold.

If this is right

Perceptual compression becomes feasible with linear-complexity state space models rather than quadratic or generative overhead.
Dynamic semantic guidance maintains spatial structure across the compression pipeline.
Channel-wise redundancy drops through low-rank approximation without separate post-processing.
The same SAMB design works in both encoder and decoder for end-to-end gains.
Overall model size shrinks while visual fidelity at low rates stays competitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering idea could address scanning-order problems in Mamba-based video or 3D compression.
Learnable low-rank modules may transfer to other latent-space tasks that need redundancy control.
Semantic guidance might serve as a general fix for causality limits in state space vision models beyond compression.
Edge-device deployment becomes more practical if the complexity reduction holds on mobile hardware.

Load-bearing premise

Semantic clustering in the Mamba block preserves spatial correlations without creating artifacts or training instability.

What would settle it

A side-by-side rate-distortion-perception curve on Kodak or CLIC benchmarks where the new method falls below existing approaches or requires equal or higher compute.

Figures

Figures reproduced from arXiv: 2605.04560 by Chenyang Ge, Hao Wei, Jiaqian Zhang, Yanhui Zhou.

**Figure 1.** Figure 1: Comparison of scanning strategies in Mamba. (a) view at source ↗

**Figure 2.** Figure 2: Architecture of the proposed SAMIC framework. Specifically, we develop an effective semantic-aware Mamba block view at source ↗

**Figure 3.** Figure 3: Detailed architecture. (a) Semantic-aware Mamba block (SAMB) adopts the proposed SASS for global modeling. (b) view at source ↗

**Figure 4.** Figure 4: Detailed architecture of the proposed. SVD-inspired view at source ↗

**Figure 5.** Figure 5: Quantitative rate-distortion-perception performance comparisons on the Kodak and CLIC2020 benchmark datasets. view at source ↗

**Figure 6.** Figure 6: Visual comparisons on the Kodak and CLIC2020 datasets. The value inside the parentheses represents bpp. SAMIC view at source ↗

**Figure 7.** Figure 7: Visualization of effective receptive field. Zoom in view at source ↗

**Figure 8.** Figure 8: Latent correlation of (y − 𝜇)/𝜎 with training 𝜆 = 0.025. offers a far superior trade-off between compression performance and computational efficiency. 4.5.4 Visualization of ERF. We further visualize the effective receptive field (ERF) in view at source ↗

**Figure 10.** Figure 10: Quantitative performance comparisons under different semantic cluster numbers ( view at source ↗

**Figure 11.** Figure 11: Visual comparisons on the CLIC2020 dataset. The value inside the parentheses represents bpp. view at source ↗

read the original abstract

Perceptual image compression focuses on preserving high visual quality under low-bitrate constraints. Most existing approaches to perceptual compression leverage the strong generative capabilities of generative adversarial networks or diffusion models, at the cost of substantial model complexity. To this end, we present an efficient perceptual image compression method that exploits the long-range modeling capability and linear computational complexity of state space models, with a particular focus on Mamba. Unlike existing methods that rely on an inherently fixed scanning order and consequently impair semantic continuity and spatial correlation, we develop a semantic-aware Mamba block (SAMB) to enable scanning guided by dynamically clustered semantic features, thereby alleviating the strict causality constraints and long-range information decay inherent to Mamba. Inspired by singular value decomposition, we design an SVD-inspired redundancy reduction module (SVD-RRM) that performs a low-rank approximation on the latent features by introducing a learnable soft threshold, leading to channel-wise redundancy information reduction. The proposed SAMB is integrated into both the encoder and decoder of the compression framework, whereas the SVD-RRM is incorporated only in the encoder. Extensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of rate-distortion-perception tradeoff and model complexity. The source code and pretrained models will be available at https://github.com/Jasmine-aiq/SAMIC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAMIC adapts Mamba with semantic clustering for scanning and an SVD-style reducer to target efficiency in perceptual compression, but the gains rest on unshown experiments.

read the letter

This paper introduces a semantic-aware Mamba block for guiding scans based on clustered features and an SVD-inspired redundancy reduction module for perceptual image compression. The goal is to achieve good visual quality at low bitrates with lower model complexity than GAN or diffusion based approaches. The new elements are the dynamic semantic clustering in SAMB to handle Mamba's causality constraints and long-range decay by preserving semantic continuity, and the learnable soft threshold in SVD-RRM for low-rank approximation of latents to cut redundancy. These are placed thoughtfully, with SAMB in encoder and decoder and the reduction only in the encoder. It does well in framing the limitations of fixed-order scanning in existing Mamba vision work and proposing a targeted fix. The linear complexity of state space models is leveraged effectively for efficiency, which is a practical strength if the performance holds. The soft spots center on the unproven effectiveness of the semantic clustering. The abstract does not provide mechanism details, ablation studies, or any quantitative data on rate-distortion-perception metrics or comparisons. Without those, it is unclear if the clustering avoids new artifacts or instability, which could undermine the claimed advantages. The stress test note correctly flags this as the key assumption to check. This is for researchers in perceptual image compression who are exploring state space models for lighter codecs. A colleague working on efficient vision models would find the architectural adaptations interesting to build on or compare against. I think it deserves a serious referee. The combination of ideas is fresh enough and the problem is relevant, so peer review can properly assess the experiments and any limitations in the current description.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce SAMIC, a lightweight perceptual image compression framework that integrates a semantic-aware Mamba block (SAMB) using dynamic semantic feature clustering to guide scanning and mitigate Mamba's inherent causality constraints and long-range decay, plus an SVD-inspired redundancy reduction module (SVD-RRM) performing learnable low-rank approximation on encoder latents. It asserts superior rate-distortion-perception tradeoffs and reduced model complexity versus state-of-the-art methods, backed by extensive experiments, with code and models to be released.

Significance. If substantiated, the work would be significant as a demonstration of adapting state-space models like Mamba for perceptual compression tasks, offering a more efficient alternative to complex GAN- or diffusion-based methods while addressing Mamba's vision-specific limitations through semantic guidance. The explicit commitment to release source code and pretrained models supports reproducibility and broader adoption.

major comments (2)

[SAMB description] SAMB description: The central assumption that dynamic semantic clustering preserves spatial correlations and successfully alleviates Mamba causality/long-range decay without introducing artifacts or instability is load-bearing for the RD-perception claims, yet lacks supporting mechanism details, clustering algorithm specification, or ablation studies comparing clustered vs. fixed/random scanning on perceptual metrics and training stability.
[Results section] Results section: The claim of favorable performance against SOTA in rate-distortion-perception tradeoff and model complexity rests on unspecified experiments; without visible quantitative tables, baseline details (e.g., specific BD-rate, LPIPS, or FLOPs comparisons), or statistical controls, the magnitude and reliability of gains cannot be assessed.

minor comments (1)

The paper would benefit from explicit first-use definitions for all acronyms (SAMB, SVD-RRM) and a clearer notation section for the learnable soft threshold in SVD-RRM.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript. We have carefully addressed each major comment below and will incorporate revisions to enhance the clarity and completeness of the paper.

read point-by-point responses

Referee: The central assumption that dynamic semantic clustering preserves spatial correlations and successfully alleviates Mamba causality/long-range decay without introducing artifacts or instability is load-bearing for the RD-perception claims, yet lacks supporting mechanism details, clustering algorithm specification, or ablation studies comparing clustered vs. fixed/random scanning on perceptual metrics and training stability.

Authors: We agree that the SAMB description requires additional detail to fully support the claims. In the revised manuscript, we will expand Section 3.2 with a step-by-step explanation of how the dynamic semantic feature clustering guides the scanning order to preserve spatial correlations and mitigate Mamba's causality and decay issues. We will also specify the clustering algorithm and its hyperparameters as implemented. Furthermore, we will add ablation experiments in Section 4.3 that directly compare the proposed semantic-guided scanning against fixed and random alternatives, reporting effects on perceptual metrics (LPIPS, DISTS) and training stability indicators. These additions will substantiate the design choices without altering the core method. revision: yes
Referee: The claim of favorable performance against SOTA in rate-distortion-perception tradeoff and model complexity rests on unspecified experiments; without visible quantitative tables, baseline details (e.g., specific BD-rate, LPIPS, or FLOPs comparisons), or statistical controls, the magnitude and reliability of gains cannot be assessed.

Authors: We apologize if the quantitative results were not presented with sufficient visibility or detail in the reviewed version. The manuscript already includes Tables 1–3 and Figures 3–5 in Section 4, which report BD-rate, LPIPS, FID, PSNR, and complexity metrics (FLOPs, parameters) against multiple SOTA baselines. In the revision, we will add explicit descriptions of all baselines (including exact versions and training settings), provide the full set of numerical comparisons, and include statistical controls such as standard deviations across multiple runs. This will allow readers to better evaluate the reliability and magnitude of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; new modules and experimental validation are independent

full rationale

The paper proposes two novel components (SAMB for semantic-guided scanning in Mamba and SVD-RRM for low-rank latent approximation) and integrates them into an encoder-decoder framework. Claims of favorable RD-perception tradeoff rest on experimental comparisons to SOTA methods rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No uniqueness theorems, ansatzes smuggled via prior self-work, or renamings of known results appear in the abstract or method description. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method relies on standard deep-learning assumptions (end-to-end differentiability, perceptual loss functions) plus two ad-hoc design choices: dynamic semantic clustering for scanning order and a learnable soft threshold for low-rank approximation. No explicit free parameters beyond typical network weights are named in the abstract.

axioms (2)

domain assumption Mamba's linear complexity and long-range modeling hold when scanning order is dynamically altered by semantic clusters.
Invoked to justify the SAMB block design.
domain assumption Low-rank approximation via learnable soft threshold removes redundancy without harming perceptual quality.
Core premise of the SVD-RRM module.

pith-pipeline@v0.9.0 · 5541 in / 1325 out tokens · 22348 ms · 2026-05-08T16:45:26.723007+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. 2019. Generative adversarial networks for extreme learned image compression. InProceedings of the IEEE/CVF international conference on computer vision. 221–231

2019
[2]

Mohammad Akbari, Jie Liang, and Jingning Han. 2019. DSSLIC: Deep semantic segmentation-based layered image compression. InICASSP 2019-2019 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2042–2046

2019
[3]

Johannes Ballé, Valero Laparra, and Eero P Simoncelli. 2016. End-to-end optimized image compression.arXiv preprint arXiv:1611.01704(2016). 8 SAMIC: A Lightweight Semantic-Aware Mamba for Efficient Perceptual Image Compression Conference’17, July 2017, Washington, DC, USA

work page arXiv 2016
[4]

Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick John- ston. 2018. Variational image compression with a scale hyperprior.arXiv preprint arXiv:1802.01436(2018)

work page Pith review arXiv 2018
[5]

2015.BPG Image Format

Fabrice Bellard. 2015.BPG Image Format. https://bellard.org/bpg Accessed: 2024

2015
[6]

Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton
[7]

Demystifying mmd gans.arXiv preprint arXiv:1801.01401(2018)

work page internal anchor Pith review arXiv 2018
[8]

Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, and Guo Lu. 2025. CMIC: Content-Adaptive Mamba for Learned Image Compression.arXiv e-prints(2025), arXiv–2508

2025
[9]

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. 2020. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and machine intelligence44, 5 (2020), 2567–2581

2020
[10]

1999.Kodak PhotoCD Dataset

Rich Franzen. 1999.Kodak PhotoCD Dataset. http://r0k.us/graphics/kodak/ Accessed: 2024

1999
[11]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks.Commun. ACM63, 11 (2020), 139–144

2020
[12]

Albert Gu and Tri Dao. 2024. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling

2024
[13]

Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396(2021)

work page internal anchor Pith review arXiv 2021
[14]

Dailan He, Ziming Yang, Hongjiu Yu, Tongda Xu, Jixiang Luo, Yuan Chen, Chen- jian Gao, Xinjie Shi, Hongwei Qin, and Yan Wang. 2022. Po-elic: Perception- oriented efficient learned image coding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1764–1769

2022
[15]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

2017
[16]

Yiwen Jia, Hao Wei, Yanhui Zhou, and Chenyang Ge. 2025. One-Step Diffusion for Perceptual Image Compression. In2025 International Conference on Visual Communications and Image Processing (VCIP). IEEE, 1–5

2025
[17]

Xuhao Jiang, Weimin Tan, Tian Tan, Bo Yan, and Liquan Shen. 2023. Multi- modality deep network for extreme learned image compression. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 1033–1041

2023
[18]

2021.VVC Official Test Model VTM

JVET Team. 2021.VVC Official Test Model VTM. Technical Report. Joint Video Experts Team

2021
[19]

Haowei Kuang, Yiyang Ma, Wenhan Yang, Zongming Guo, and Jiaying Liu
[20]

InProceedings of the 32nd ACM International Conference on Multimedia

Consistency guided diffusion model with neural syntax for perceptual image compression. InProceedings of the 32nd ACM International Conference on Multimedia. 1622–1631
[21]

Hagyeong Lee, Minkyu Kim, Jun-Hyuk Kim, Seungeon Kim, Dokwan Oh, and Jaeho Lee. 2024. Neural image compression with text-guided encoding for both pixel-level and perceptual fidelity.arXiv preprint arXiv:2403.02944(2024)

work page arXiv 2024
[22]

Han Li, Shaohui Li, Wenrui Dai, Chenglin Li, Junni Zou, and Hongkai Xiong. 2023. Frequency-aware transformer for learned image compression.arXiv preprint arXiv:2310.16387(2023)

work page arXiv 2023
[23]

Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, et al. 2023. Lsdir: A large scale dataset for image restoration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1775–1787

2023
[24]

Zhuoyuan Li, Junqi Liao, Chuanbo Tang, Haotian Zhang, Yuqi Li, Yifan Bian, Xihua Sheng, Xinmin Feng, Yao Li, Changsheng Gao, et al. 2025. Ustc-td: A test dataset and benchmark for image and video coding in 2020s.IEEE Transactions on Multimedia(2025)

2025
[25]

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Jingwen Jiang. 2024. Toward extreme image compression with latent feature guidance and diffusion prior.IEEE Transactions on Circuits and Systems for Video Technology35, 1 (2024), 888–899

2024
[26]

Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Ajmal Mian. 2025. Rdeic: Accelerating diffusion-based extreme image compression with relay residual diffusion.IEEE Transactions on Circuits and Systems for Video Technology(2025)

2025
[27]

Jinming Liu, Heming Sun, and Jiro Katto. 2023. Learned image compression with mixed transformer-cnn architectures. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14388–14397

2023
[28]

Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Jianbin Jiao, and Yunfan Liu. 2024. Vmamba: Visual state space model. Advances in neural information processing systems37 (2024), 103031–103063

2024
[29]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer us- ing shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

2021
[30]

Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and Zhan Ma. 2021. Transformer-based image compression.arXiv preprint arXiv:2111.06707(2021)

work page arXiv 2021
[31]

Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson
[32]

High-fidelity generative image compression.Advances in neural information processing systems33 (2020), 11913–11924

2020
[33]

David Minnen, Johannes Ballé, and George D Toderici. 2018. Joint autoregressive and hierarchical priors for learned image compression.Advances in neural information processing systems31 (2018)

2018
[34]

Matthew J Muckley, Alaaeldin El-Nouby, Karen Ullrich, Herve Jegou, and Jakob Verbeek. 2023. Improving statistical fidelity for neural image compression with implicit local likelihood models. InInternational Conference on Machine Learning. PMLR, 25426–25443

2023
[35]

Yichen Qian, Ming Lin, Xiuyu Sun, Zhiyu Tan, and Rong Jin. 2022. Entroformer: A transformer-based entropy model for learned image compression.arXiv preprint arXiv:2202.05492(2022)

work page arXiv 2022
[36]

Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shutao Xia, and Yaowei Wang. 2024. Mambavc: Learned visual compression with selective state spaces.arXiv preprint arXiv:2405.15413(2024)

work page arXiv 2024
[37]

Shiyu Qin, Jinpeng Wang, Yimin Zhou, Bin Chen, Tianci Luo, Baoyi An, Tao Dai, Shu-Tao Xia, and Yaowei Wang. 2025. Cassic: Towards Content-Adaptive State- Space Models for Learned Image Compression. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15727–15736

2025
[38]

George Toderici, Lucas Theis, Nick Johnston, Eirikur Agustsson, Fabian Mentzer, Johannes Ballé, Wenzhe Shi, and Radu Timofte. 2020. Clic 2020: Challenge on learned image compression.Retrieved March29 (2020), 2021

2020
[39]

Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning.Advances in neural information processing systems30 (2017)

2017
[40]

Gregory K. Wallace. 1991. The JPEG still picture compression standard.Commun. ACM34, 4 (April 1991), 30–44. doi:10.1145/103085.103089

work page doi:10.1145/103085.103089 1991
[41]

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural similarity for image quality assessment. InThe thrity-seventh asilomar conference on signals, systems & computers, 2003, Vol. 2. Ieee, 1398–1402

2003
[42]

Hao Wei, Yanhui Zhou, Yiwen Jia, Chenyang Ge, Saeed Anwar, and Ajmal Mian
[43]

Neural Networks(2025), 108279

A lightweight model for perceptual image compression via implicit priors. Neural Networks(2025), 108279

2025
[44]

Ruihan Yang and Stephan Mandt. 2023. Lossy image compression with conditional diffusion models.Advances in Neural Information Processing Systems36 (2023), 64971–64995

2023
[45]

Fanhu Zeng, Hao Tang, Yihua Shao, Siyu Chen, Ling Shao, and Yan Wang. 2025. Mambaic: State space models for high-performance learned image compression. InProceedings of the Computer Vision and Pattern Recognition Conference. 18041– 18050

2025
[46]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
[47]

InProceedings of the IEEE conference on computer vision and pattern recognition

The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595
[48]

Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. 2024. Vision mamba: Efficient visual representation learning with bidirectional state space model.arXiv preprint arXiv:2401.09417(2024)

work page internal anchor Pith review arXiv 2024
[49]

Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. 2022. The devil is in the details: Window-based attention for image compression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17492–17501

2022
[50]

Renjie Zou, Chunfeng Song, and Zhaoxiang Zhang. 2022. The devil is in the details: Window-based attention for image compression. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 17492–17501. 9 Conference’17, July 2017, Washington, DC, USA Jiaqian Zhang*, Hao Wei*, Chenyang Ge, and Yanhui Zhou A Effect of the Number of C...

2022