A Compact Hybrid Convolution--Frequency State Space Network for Learned Image Compression
Pith reviewed 2026-05-17 04:47 UTC · model grok-4.3
The pith
A hybrid convolution and frequency state space network achieves competitive rate-distortion performance in learned image compression by preserving 2D neighborhood relations while modeling long-range dependencies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that HCFSSNet, built from convolutional layers and Vision Frequency State Space blocks containing an omni-directional neighborhood state space module plus an adaptive frequency modulation module, together with a Frequency Swin Transformer Attention Module in the hyperprior, reaches competitive rate-distortion performance on benchmark datasets against recent learned image compression codecs while addressing quadratic complexity and loss of 2D continuity.
What carries the argument
The Vision Frequency State Space block, which scans features in four directions to keep 2D neighborhood relations and applies discrete-cosine-transform-based adaptive reweighting to frequency components for long-range aggregation.
If this is right
- Convolutional layers continue to handle fine local details while the state space path supplies complementary global context.
- Multi-directional scanning reduces the neighborhood discontinuity that occurs when 2D features are flattened to 1D sequences.
- Adaptive frequency reweighting improves the modeling of important spectral components without quadratic attention cost.
- The frequency-aware hyperprior module supplies better side information for entropy coding.
- The overall network reaches performance levels comparable to recent learned image compression methods on standard test sets.
Where Pith is reading between the lines
- The four-direction scan pattern could be tested on other dense 2D tasks such as semantic segmentation where neighborhood continuity matters.
- Frequency reweighting might reduce ringing or blocking artifacts in regions with strong textures under heavy compression.
- The hybrid pattern may transfer to video codecs that need both spatial continuity and efficient temporal modeling.
- If the method proves stable across datasets, it could support lighter decoder implementations for mobile or edge devices.
Load-bearing premise
Scanning along four directions plus adaptive frequency reweighting is sufficient to restore 2D neighborhood continuity and long-range modeling without introducing artifacts or requiring undisclosed hyperparameter tuning.
What would settle it
If rate-distortion curves on Kodak or CLIC datasets place HCFSSNet below recent LIC codecs at multiple bit rates, or if visual inspection shows new artifacts traceable to the four-direction scan and frequency module, the competitive-performance claim would be refuted.
Figures
read the original abstract
Learned image compression (LIC) has recently benefited from Transformer- and state space models (SSM)- based backbones for modeling long-range dependencies. However, the former typically incurs quadratic complexity, whereas the latter often disrupts neighborhood continuity by flattening 2D features into 1D sequences. To address these issues, we propose a compact Hybrid Convolution and Frequency State Space Network (HCFSSNet) for LIC. HCFSSNet combines convolutional layers for local detail modeling with a Vision Frequency State Space (VFSS) block for complementary long-range contextual aggregation. Specifically, the VFSS block consists of a Vision Omni-directional Neighborhood State Space (VONSS) module, which scans features along horizontal, vertical, and diagonal directions to better preserve 2D neighborhood relations, and an Adaptive Frequency Modulation Module (AFMM), which performs discrete cosine transform-based adaptive reweighting of frequency components. In addition, we introduce a Frequency Swin Transformer Attention Module (FSTAM) in the hyperprior path to enhance frequency-aware side information modeling. Experiments on the benchmark datasets show that the proposed HCFSSNet achieves a competitive rate-distortion performance against recent LIC codecs. The source code and models will be made publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HCFSSNet, a compact hybrid network for learned image compression that integrates convolutional layers for local modeling with a Vision Frequency State Space (VFSS) block. The VFSS comprises a Vision Omni-directional Neighborhood State Space (VONSS) module that scans features along horizontal, vertical, and diagonal directions, paired with an Adaptive Frequency Modulation Module (AFMM) that applies DCT-based adaptive reweighting of frequency components. A Frequency Swin Transformer Attention Module (FSTAM) is added in the hyperprior path. The central claim is that this architecture delivers competitive rate-distortion performance against recent LIC codecs on benchmark datasets, with source code and models to be released publicly.
Significance. If the rate-distortion claims are substantiated with quantitative evidence, the work could advance efficient long-range modeling in LIC by addressing 2D neighborhood continuity in SSMs via multi-directional scanning and frequency modulation, offering a lower-complexity alternative to Transformer backbones. The explicit commitment to public code release strengthens reproducibility and allows direct verification of the hybrid design.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim of competitive rate-distortion performance against recent LIC codecs is stated without any quantitative tables, BD-rate values, PSNR/BPP curves, error bars, or direct numerical comparisons to baselines; this leaves the primary empirical result unverified and load-bearing for acceptance.
- [§3.2] §3.2 (VONSS module): The assertion that scanning along four directions (horizontal, vertical, diagonals) plus AFMM fusion sufficiently restores 2D neighborhood continuity lacks supporting analysis, such as feature visualizations, directional bias metrics, or ablation isolating VONSS from AFMM; without this, the design's ability to avoid artifacts or hidden tuning remains untested and directly affects the performance claim.
- [§4.1] §4.1 (Training and evaluation protocol): No details are provided on dataset splits, training hyperparameters, optimization settings, or evaluation protocol (e.g., Kodak, CLIC, or Tecnick splits), which are required to assess whether the reported competitive results are reproducible or robust.
minor comments (2)
- Ensure consistent acronym usage and expansion on first mention for VFSS, VONSS, AFMM, and FSTAM throughout the text and figures.
- If architecture diagrams are present, add explicit labels for the fusion points between convolutional paths, VONSS scans, and AFMM reweighting to improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address each major comment point-by-point below and commit to incorporating the suggested improvements in the revised manuscript to enhance clarity and substantiation of our claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of competitive rate-distortion performance against recent LIC codecs is stated without any quantitative tables, BD-rate values, PSNR/BPP curves, error bars, or direct numerical comparisons to baselines; this leaves the primary empirical result unverified and load-bearing for acceptance.
Authors: We acknowledge that the quantitative evidence supporting the competitive rate-distortion performance needs to be more explicitly presented. In the original manuscript, the experiments section discusses the results qualitatively, but to address this concern, we will add detailed tables with BD-rate values relative to baselines such as VVC, Ballé et al., and recent SSM-based methods. We will also include rate-distortion curves and specify any error bars or multiple runs for robustness. This revision will make the empirical claims verifiable. revision: yes
-
Referee: [§3.2] §3.2 (VONSS module): The assertion that scanning along four directions (horizontal, vertical, diagonals) plus AFMM fusion sufficiently restores 2D neighborhood continuity lacks supporting analysis, such as feature visualizations, directional bias metrics, or ablation isolating VONSS from AFMM; without this, the design's ability to avoid artifacts or hidden tuning remains untested and directly affects the performance claim.
Authors: We agree that empirical validation of the VONSS design choice is important. We will include an ablation study in the revised paper that isolates the effect of multi-directional scanning versus single-direction and the contribution of AFMM. Additionally, we will add visualizations of feature maps or attention patterns to illustrate the preservation of 2D neighborhood continuity. While the current results demonstrate the overall effectiveness, these additions will provide direct support for the architectural decisions. revision: yes
-
Referee: [§4.1] §4.1 (Training and evaluation protocol): No details are provided on dataset splits, training hyperparameters, optimization settings, or evaluation protocol (e.g., Kodak, CLIC, or Tecnick splits), which are required to assess whether the reported competitive results are reproducible or robust.
Authors: This is a valid point, and we apologize for not including these details in the initial submission. The revised manuscript will expand the training and evaluation section to specify the dataset splits (e.g., training on 256x256 patches from ImageNet or similar, testing on Kodak, CLIC professional, Tecnick), hyperparameters including learning rate schedule, batch size, optimizer settings, loss function weights, and the exact evaluation protocol for computing PSNR and bpp. We will also mention the hardware used for training to aid reproducibility. The planned public release of code and models will further support verification. revision: yes
Circularity Check
No circularity: empirical results from explicit architecture choices on standard benchmarks
full rationale
The paper introduces HCFSSNet as a hybrid convolutional and frequency state-space architecture for learned image compression, with VFSS blocks containing VONSS (multi-directional scanning) and AFMM (DCT reweighting), plus FSTAM in the hyperprior. The central claim of competitive rate-distortion performance is supported by experiments on benchmark datasets rather than any mathematical derivation. No equations, predictions, or results reduce by construction to fitted inputs or self-citations; the design decisions are stated explicitly as module choices, and performance is measured externally against other codecs without self-referential fitting or renaming of known patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- network weights and hyperparameters
axioms (2)
- domain assumption Discrete cosine transform provides a useful frequency basis for reweighting image features
- domain assumption Scanning along horizontal, vertical, and diagonal directions sufficiently preserves 2D spatial continuity
Reference graph
Works this paper leans on
-
[1]
G. Wallace, The jpeg still picture compression standard, IEEE Transactions on Consumer Electronics 38 (1) (1992) xviii–xxxiv. doi:10.1109/30.125072
-
[2]
A. Skodras, C. Christopoulos, T. Ebrahimi, The jpeg 2000 still image compression standard, IEEE Signal Processing Magazine 18 (5) (2001) 36–58. doi:10.1109/79.952804
- [3]
-
[4]
Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct
F. Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct. 30, 2018 (2014)
work page 2018
-
[5]
B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, J.-R. Ohm, Overview of the versatile video coding (vvc) standard and its applica- tions, IEEE Transactions on Circuits and Systems for Video Technology 31 (10) (2021) 3736–3764
work page 2021
-
[6]
Y. Xie, K. L. Cheng, Q. Chen, Enhanced invertible encoding for learned image compression, in: Proceedings of the ACM International Confer- ence on Multimedia, 2021, pp. 162–170
work page 2021
-
[7]
J. Liu, H. Sun, J. Katto, Learned image compression with mixed transformer-cnn architectures, in: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2023, pp. 14388– 14397. 29
work page 2023
-
[8]
H. Li, S. Li, W. Dai, C. Li, J. Zou, H. Xiong, Frequency-aware trans- former for learned image compression, in: The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=HKGQDDTuvZ
work page 2024
-
[9]
C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (3) (1948) 379–423. doi:10.1002/j.1538- 7305.1948.tb01338.x
-
[10]
Auto-Encoding Variational Bayes
D.P.Kingma, M.Welling, Auto-EncodingVariationalBayes, in: 2ndIn- ternational Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. arXiv:http://arxiv.org/abs/1312.6114v10
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[11]
J. Rissanen, G. Langdon, Universal modeling and coding, IEEE Transactions on Information Theory 27 (1) (1981) 12–23. doi:10.1109/TIT.1981.1056282
-
[12]
G. N. N. Martin, Range encoding: an algorithm for removing redun- dancy from a digitised message, in: Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Record- ing, Vol. 2, 1979
work page 1979
-
[14]
Y. Zhu, Y. Yang, T. Cohen, Transformer-based transform coding, in: International Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=IDwN6xjHnK8
work page 2022
- [15]
-
[16]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 30
work page 2021
-
[17]
H. Wei, Y. Zhou, Y. Jia, C. Ge, S. Anwar, A. Mian, A lightweight model for perceptual image compression via implicit priors, Neural Networks (2025) 108279
work page 2025
-
[18]
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, X. Wang, Vision mamba: Efficient visual representation learning with bidirectional state space model, in: Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[19]
Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, Y. Liu, VMamba: Visual state space model, in: The Thirty-eighth Annual Con- ference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=ZgtLQQR1K7
work page 2024
-
[20]
F. Zeng, H. Tang, Y. Shao, S. Chen, L. Shao, Y. Wang, Mambaic: State space models for high-performance learned image compression, in: Pro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18041–18050
work page 2025
-
[21]
D.Minnen, J.Ballé, G.D.Toderici, Jointautoregressiveandhierarchical priors for learned image compression, Advances in neural information processing systems 31 (2018)
work page 2018
-
[22]
J. Zhou, Multi-scale and context-adaptive entropy model for image com- pression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019
work page 2019
-
[23]
Z. Cui, J. Wang, S. Gao, T. Guo, Y. Feng, B. Bai, Asymmetric gained deep image compression with continuous rate adaptation, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10532–10541
work page 2021
-
[24]
A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu, Conditional image generation with pixelcnn decoders, in: Proceedings of the 30th International Conference on Neural Informa- tion Processing Systems, NIPS’16, Curran Associates Inc., Red Hook, NY, USA, 2016, p. 4797–4805
work page 2016
-
[25]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: 31 J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...
-
[26]
C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, L. J. Guibas, Volumetric and multi-view cnns for object classification on 3d data, in: 2016 IEEE ConferenceonComputerVisionandPatternRecognition(CVPR),2016, pp. 5648–5656. doi:10.1109/CVPR.2016.609
-
[27]
S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2012) 221–231
work page 2012
-
[28]
T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, Y. Wang, End-to-end learnt image compression via non-local attention optimization and improved context modeling, IEEE Transactions on Image Processing 30 (2021) 3179–3191. doi:10.1109/TIP.2021.3058615
-
[29]
F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, L. V. Gool, Conditional probability models for deep image compression, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4394–4402. doi:10.1109/CVPR.2018.00462
-
[30]
Z. Tang, H. Wang, X. Yi, Y. Zhang, S. Kwong, C.-C. J. Kuo, Joint graph attention and asymmetric convolutional neural network for deep im- age compression, IEEE Transactions on Circuits and Systems for Video Technology 33 (1) (2023) 421–433. doi:10.1109/TCSVT.2022.3199472
- [31]
-
[32]
D. He, Z. Yang, W. Peng, R. Ma, H. Qin, Y. Wang, Elic: Efficient learned image compression with unevenly grouped space-channel con- textual adaptive coding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5718–5727. 32
work page 2022
-
[34]
J. Lu, L. Zhang, X. Zhou, M. Li, W. Li, S. Gu, Learned image com- pression with dictionary-based entropy model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12850– 12859
work page 2025
-
[35]
D. Feng, Z. Cheng, S. Wang, R. Wu, H. Hu, G. Lu, L. Song, Linear attention modeling for learned image compression, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 7623–7632
work page 2025
- [36]
-
[37]
Z. Zhang, S. Esenlik, Y. Wu, M. Wang, K. Zhang, L. Zhang, End- to-end learning-based image compression with a decoupled framework, IEEE Transactions on Circuits and Systems for Video Technology 34 (5) (2024) 3067–3081. doi:10.1109/TCSVT.2023.3313974
-
[38]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010
work page 2017
-
[39]
R. Zou, C. Song, Z. Zhang, The devil is in the details: Window-based attention for image compression, in: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR),2022, pp. 17492–17501
work page 2022
-
[40]
M. Lu, P. Guo, H. Shi, C. Cao, Z. Ma, Transformer-based image com- pression, in: 2022 Data Compression Conference (DCC), 2022, pp. 469–
work page 2022
-
[41]
doi:10.1109/DCC52660.2022.00080. 33
-
[42]
J. Wang, Q. Ling, Fdnet: Frequency decomposition network for learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (11) (2024) 11241–11255. doi:10.1109/TCSVT.2024.3415823
-
[43]
Z. Ge, S. Ma, W. Gao, J. Pan, C. Jia, Nlic: Non-uniform quantization-based learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9647–9663. doi:10.1109/TCSVT.2024.3401872
-
[44]
A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: First Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2
work page 2024
-
[45]
Y. Xiao, Y. Xia, Pixel adaptive deep unfolding network with state space model for image deraining, Neural Networks (2025) 107845
work page 2025
- [46]
-
[47]
D. He, Y. Zheng, B. Sun, Y. Wang, H. Qin, Checkerboard context model for efficient learned image compression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14771–14780
work page 2021
-
[48]
Y. Wang, H. Fu, Q. Cao, S. Wang, Z. Chen, F. Liang, S2lic: Learned image compression with the swinv2 block, adaptive channel-wise and global-inter attention context, Neural Networks (2025) 107590
work page 2025
-
[49]
D. Li, Y. Bai, K. Wang, J. Jiang, X. Liu, W. Gao, Groupedmixer: An entropy model with group-wise token-mixers for learned image compres- sion, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9606–9619. doi:10.1109/TCSVT.2024.3395481
-
[50]
T. Yao, Y. Pan, Y. Li, C.-W. Ngo, T. Mei, Wave-vit: Unifying wavelet and transformers for visual representation learning, in: European Con- ference on Computer Vision, Springer, 2022, pp. 328–345. 34
work page 2022
-
[51]
S. Tang, H. Zhang, X. Gao, S. Yang, J. Leng, Z. Pan, H. Tian, A spatial-frequency hybrid restoration network for jpeg compressed image deblurring, Neural Networks (2025) 108059
work page 2025
-
[52]
Density modeling of images using a generalized normalization transformation
J. Ballé, V. Laparra, E. P. Simoncelli, Density modeling of images us- ing a generalized normalization transformation, in: Y. Bengio, Y. LeCun (Eds.), 4thInternationalConferenceonLearning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceed- ings, 2016. URLhttp://arxiv.org/abs/1511.06281
- [53]
- [54]
-
[55]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255
work page 2009
-
[56]
Y. Li, K. Zhang, J. Liang, J. Cao, C. Liu, R. Gong, Y. Zhang, H. Tang, Y. Liu, D. Demandolx, et al., Lsdir: A large scale dataset for image restoration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1775–1787
work page 2023
-
[57]
Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k
E. Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k. us/graphics/kodak 6 (1993) 2
work page 1993
- [58]
-
[59]
G. Toderici, W. Shi, R. Timofte, L. Theis, J. Ballé, E. Agustsson, N. Johnston, F. Mentzer, Clic: Workshop and challenge on learned image compression, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 2020. 35
work page 2020
- [60]
-
[61]
M. Han, S. Jiang, S. Li, X. Deng, M. Xu, C. Zhu, S. Gu, Causal con- text adjustment loss for learned image compression, Advances in Neural Information Processing Systems 37 (2024) 133231–133253. 36
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.