A Compact Hybrid Convolution--Frequency State Space Network for Learned Image Compression

Caigui Jiang; Haodong Pan; Hao Wei; Nanning Zheng; Yusong Wang

arxiv: 2511.20151 · v2 · submitted 2025-11-25 · 💻 cs.CV

A Compact Hybrid Convolution--Frequency State Space Network for Learned Image Compression

Haodong Pan , Hao Wei , Yusong Wang , Nanning Zheng , Caigui Jiang This is my paper

Pith reviewed 2026-05-17 04:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords learned image compressionstate space modelhybrid convolutionfrequency modulationrate-distortion performancelong-range dependencieshyperprior modeling

0 comments

The pith

A hybrid convolution and frequency state space network achieves competitive rate-distortion performance in learned image compression by preserving 2D neighborhood relations while modeling long-range dependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HCFSSNet, a backbone for learned image compression that pairs convolutional layers for local pixel details with a Vision Frequency State Space block for distant context. Standard transformers carry quadratic cost for long-range links, while plain state space models flatten 2D features into sequences and lose neighborhood structure. The design adds a module that scans features along horizontal, vertical, and diagonal paths and another that reweights frequency components after a discrete cosine transform. A frequency-aware attention module is also placed in the hyperprior path. This combination matters because it aims to deliver smaller compressed files at the same visual quality on standard benchmarks.

Core claim

The central claim is that HCFSSNet, built from convolutional layers and Vision Frequency State Space blocks containing an omni-directional neighborhood state space module plus an adaptive frequency modulation module, together with a Frequency Swin Transformer Attention Module in the hyperprior, reaches competitive rate-distortion performance on benchmark datasets against recent learned image compression codecs while addressing quadratic complexity and loss of 2D continuity.

What carries the argument

The Vision Frequency State Space block, which scans features in four directions to keep 2D neighborhood relations and applies discrete-cosine-transform-based adaptive reweighting to frequency components for long-range aggregation.

If this is right

Convolutional layers continue to handle fine local details while the state space path supplies complementary global context.
Multi-directional scanning reduces the neighborhood discontinuity that occurs when 2D features are flattened to 1D sequences.
Adaptive frequency reweighting improves the modeling of important spectral components without quadratic attention cost.
The frequency-aware hyperprior module supplies better side information for entropy coding.
The overall network reaches performance levels comparable to recent learned image compression methods on standard test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The four-direction scan pattern could be tested on other dense 2D tasks such as semantic segmentation where neighborhood continuity matters.
Frequency reweighting might reduce ringing or blocking artifacts in regions with strong textures under heavy compression.
The hybrid pattern may transfer to video codecs that need both spatial continuity and efficient temporal modeling.
If the method proves stable across datasets, it could support lighter decoder implementations for mobile or edge devices.

Load-bearing premise

Scanning along four directions plus adaptive frequency reweighting is sufficient to restore 2D neighborhood continuity and long-range modeling without introducing artifacts or requiring undisclosed hyperparameter tuning.

What would settle it

If rate-distortion curves on Kodak or CLIC datasets place HCFSSNet below recent LIC codecs at multiple bit rates, or if visual inspection shows new artifacts traceable to the four-direction scan and frequency module, the competitive-performance claim would be refuted.

Figures

Figures reproduced from arXiv: 2511.20151 by Caigui Jiang, Haodong Pan, Hao Wei, Nanning Zheng, Yusong Wang.

**Figure 2.** Figure 2: Overall architecture of the proposed HCFSSNet. The HCFSS block denotes the [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Illustration of the proposed HCFSS block. LReLU denotes the Leaky ReLU [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Details of the Vision Omni-directional Neighborhood State Space Module (VON [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Architectural details of the proposed channel-wise entropy model. (a) Overall [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Rate–distortion results (PSNR vs. bitrate, in bpp). From left to right: (a) [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparisons on the Tecnick dataset [57]. Compared with the other [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation studies on the Kodak dataset. (a) comparison between the Cross-Scan [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

Learned image compression (LIC) has recently benefited from Transformer- and state space models (SSM)- based backbones for modeling long-range dependencies. However, the former typically incurs quadratic complexity, whereas the latter often disrupts neighborhood continuity by flattening 2D features into 1D sequences. To address these issues, we propose a compact Hybrid Convolution and Frequency State Space Network (HCFSSNet) for LIC. HCFSSNet combines convolutional layers for local detail modeling with a Vision Frequency State Space (VFSS) block for complementary long-range contextual aggregation. Specifically, the VFSS block consists of a Vision Omni-directional Neighborhood State Space (VONSS) module, which scans features along horizontal, vertical, and diagonal directions to better preserve 2D neighborhood relations, and an Adaptive Frequency Modulation Module (AFMM), which performs discrete cosine transform-based adaptive reweighting of frequency components. In addition, we introduce a Frequency Swin Transformer Attention Module (FSTAM) in the hyperprior path to enhance frequency-aware side information modeling. Experiments on the benchmark datasets show that the proposed HCFSSNet achieves a competitive rate-distortion performance against recent LIC codecs. The source code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes HCFSSNet, a compact hybrid network for learned image compression that integrates convolutional layers for local modeling with a Vision Frequency State Space (VFSS) block. The VFSS comprises a Vision Omni-directional Neighborhood State Space (VONSS) module that scans features along horizontal, vertical, and diagonal directions, paired with an Adaptive Frequency Modulation Module (AFMM) that applies DCT-based adaptive reweighting of frequency components. A Frequency Swin Transformer Attention Module (FSTAM) is added in the hyperprior path. The central claim is that this architecture delivers competitive rate-distortion performance against recent LIC codecs on benchmark datasets, with source code and models to be released publicly.

Significance. If the rate-distortion claims are substantiated with quantitative evidence, the work could advance efficient long-range modeling in LIC by addressing 2D neighborhood continuity in SSMs via multi-directional scanning and frequency modulation, offering a lower-complexity alternative to Transformer backbones. The explicit commitment to public code release strengthens reproducibility and allows direct verification of the hybrid design.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of competitive rate-distortion performance against recent LIC codecs is stated without any quantitative tables, BD-rate values, PSNR/BPP curves, error bars, or direct numerical comparisons to baselines; this leaves the primary empirical result unverified and load-bearing for acceptance.
[§3.2] §3.2 (VONSS module): The assertion that scanning along four directions (horizontal, vertical, diagonals) plus AFMM fusion sufficiently restores 2D neighborhood continuity lacks supporting analysis, such as feature visualizations, directional bias metrics, or ablation isolating VONSS from AFMM; without this, the design's ability to avoid artifacts or hidden tuning remains untested and directly affects the performance claim.
[§4.1] §4.1 (Training and evaluation protocol): No details are provided on dataset splits, training hyperparameters, optimization settings, or evaluation protocol (e.g., Kodak, CLIC, or Tecnick splits), which are required to assess whether the reported competitive results are reproducible or robust.

minor comments (2)

Ensure consistent acronym usage and expansion on first mention for VFSS, VONSS, AFMM, and FSTAM throughout the text and figures.
If architecture diagrams are present, add explicit labels for the fusion points between convolutional paths, VONSS scans, and AFMM reweighting to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment point-by-point below and commit to incorporating the suggested improvements in the revised manuscript to enhance clarity and substantiation of our claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of competitive rate-distortion performance against recent LIC codecs is stated without any quantitative tables, BD-rate values, PSNR/BPP curves, error bars, or direct numerical comparisons to baselines; this leaves the primary empirical result unverified and load-bearing for acceptance.

Authors: We acknowledge that the quantitative evidence supporting the competitive rate-distortion performance needs to be more explicitly presented. In the original manuscript, the experiments section discusses the results qualitatively, but to address this concern, we will add detailed tables with BD-rate values relative to baselines such as VVC, Ballé et al., and recent SSM-based methods. We will also include rate-distortion curves and specify any error bars or multiple runs for robustness. This revision will make the empirical claims verifiable. revision: yes
Referee: [§3.2] §3.2 (VONSS module): The assertion that scanning along four directions (horizontal, vertical, diagonals) plus AFMM fusion sufficiently restores 2D neighborhood continuity lacks supporting analysis, such as feature visualizations, directional bias metrics, or ablation isolating VONSS from AFMM; without this, the design's ability to avoid artifacts or hidden tuning remains untested and directly affects the performance claim.

Authors: We agree that empirical validation of the VONSS design choice is important. We will include an ablation study in the revised paper that isolates the effect of multi-directional scanning versus single-direction and the contribution of AFMM. Additionally, we will add visualizations of feature maps or attention patterns to illustrate the preservation of 2D neighborhood continuity. While the current results demonstrate the overall effectiveness, these additions will provide direct support for the architectural decisions. revision: yes
Referee: [§4.1] §4.1 (Training and evaluation protocol): No details are provided on dataset splits, training hyperparameters, optimization settings, or evaluation protocol (e.g., Kodak, CLIC, or Tecnick splits), which are required to assess whether the reported competitive results are reproducible or robust.

Authors: This is a valid point, and we apologize for not including these details in the initial submission. The revised manuscript will expand the training and evaluation section to specify the dataset splits (e.g., training on 256x256 patches from ImageNet or similar, testing on Kodak, CLIC professional, Tecnick), hyperparameters including learning rate schedule, batch size, optimizer settings, loss function weights, and the exact evaluation protocol for computing PSNR and bpp. We will also mention the hardware used for training to aid reproducibility. The planned public release of code and models will further support verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from explicit architecture choices on standard benchmarks

full rationale

The paper introduces HCFSSNet as a hybrid convolutional and frequency state-space architecture for learned image compression, with VFSS blocks containing VONSS (multi-directional scanning) and AFMM (DCT reweighting), plus FSTAM in the hyperprior. The central claim of competitive rate-distortion performance is supported by experiments on benchmark datasets rather than any mathematical derivation. No equations, predictions, or results reduce by construction to fitted inputs or self-citations; the design decisions are stated explicitly as module choices, and performance is measured externally against other codecs without self-referential fitting or renaming of known patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central performance claim rests on the empirical behavior of a new neural architecture whose weights are fitted to standard image-compression training sets; no external physical law or closed-form derivation is invoked.

free parameters (1)

network weights and hyperparameters
All convolutional kernels, SSM parameters, and frequency-modulation coefficients are learned from data; their specific values are not derived from first principles.

axioms (2)

domain assumption Discrete cosine transform provides a useful frequency basis for reweighting image features
Invoked inside the Adaptive Frequency Modulation Module without further justification.
domain assumption Scanning along horizontal, vertical, and diagonal directions sufficiently preserves 2D spatial continuity
Core premise of the Vision Omni-directional Neighborhood State Space module.

pith-pipeline@v0.9.0 · 5525 in / 1363 out tokens · 41260 ms · 2026-05-17T04:47:20.496571+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

[1]

Wallace, The jpeg still picture compression standard, IEEE Transactions on Consumer Electronics 38 (1) (1992) xviii–xxxiv

G. Wallace, The jpeg still picture compression standard, IEEE Transactions on Consumer Electronics 38 (1) (1992) xviii–xxxiv. doi:10.1109/30.125072

work page doi:10.1109/30.125072 1992
[2]

Skodras, C

A. Skodras, C. Christopoulos, T. Ebrahimi, The jpeg 2000 still image compression standard, IEEE Signal Processing Magazine 18 (5) (2001) 36–58. doi:10.1109/79.952804

work page doi:10.1109/79.952804 2000
[3]

Ginesu, M

G. Ginesu, M. Pintus, D. D. Giusto, Objective assessment of the webp image coding algorithm, Signal processing: image communication 27 (8) (2012) 867–874

work page 2012
[4]

Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct

F. Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct. 30, 2018 (2014)

work page 2018
[5]

Bross, Y.-K

B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, J.-R. Ohm, Overview of the versatile video coding (vvc) standard and its applica- tions, IEEE Transactions on Circuits and Systems for Video Technology 31 (10) (2021) 3736–3764

work page 2021
[6]

Y. Xie, K. L. Cheng, Q. Chen, Enhanced invertible encoding for learned image compression, in: Proceedings of the ACM International Confer- ence on Multimedia, 2021, pp. 162–170

work page 2021
[7]

J. Liu, H. Sun, J. Katto, Learned image compression with mixed transformer-cnn architectures, in: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2023, pp. 14388– 14397. 29

work page 2023
[8]

H. Li, S. Li, W. Dai, C. Li, J. Zou, H. Xiong, Frequency-aware trans- former for learned image compression, in: The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=HKGQDDTuvZ

work page 2024
[9]

C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (3) (1948) 379–423. doi:10.1002/j.1538- 7305.1948.tb01338.x

work page doi:10.1002/j.1538- 1948
[10]

Auto-Encoding Variational Bayes

D.P.Kingma, M.Welling, Auto-EncodingVariationalBayes, in: 2ndIn- ternational Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. arXiv:http://arxiv.org/abs/1312.6114v10

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Rissanen, G

J. Rissanen, G. Langdon, Universal modeling and coding, IEEE Transactions on Information Theory 27 (1) (1981) 12–23. doi:10.1109/TIT.1981.1056282

work page doi:10.1109/tit.1981.1056282 1981
[12]

G. N. N. Martin, Range encoding: an algorithm for removing redun- dancy from a digitised message, in: Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Record- ing, Vol. 2, 1979

work page 1979
[14]

Y. Zhu, Y. Yang, T. Cohen, Transformer-based transform coding, in: International Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=IDwN6xjHnK8

work page 2022
[15]

S. Qin, J. Wang, Y. Zhou, B. Chen, T. Luo, B. An, T. Dai, S. Xia, Y. Wang, Mambavc: Learned visual compression with selective state spaces, arXiv preprint arXiv:2405.15413 (2024)

work page arXiv 2024
[16]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 30

work page 2021
[17]

H. Wei, Y. Zhou, Y. Jia, C. Ge, S. Anwar, A. Mian, A lightweight model for perceptual image compression via implicit priors, Neural Networks (2025) 108279

work page 2025
[18]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, X. Wang, Vision mamba: Efficient visual representation learning with bidirectional state space model, in: Forty-first International Conference on Machine Learning, 2024

work page 2024
[19]

Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, Y. Liu, VMamba: Visual state space model, in: The Thirty-eighth Annual Con- ference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=ZgtLQQR1K7

work page 2024
[20]

F. Zeng, H. Tang, Y. Shao, S. Chen, L. Shao, Y. Wang, Mambaic: State space models for high-performance learned image compression, in: Pro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18041–18050

work page 2025
[21]

D.Minnen, J.Ballé, G.D.Toderici, Jointautoregressiveandhierarchical priors for learned image compression, Advances in neural information processing systems 31 (2018)

work page 2018
[22]

J. Zhou, Multi-scale and context-adaptive entropy model for image com- pression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019

work page 2019
[23]

Z. Cui, J. Wang, S. Gao, T. Guo, Y. Feng, B. Bai, Asymmetric gained deep image compression with continuous rate adaptation, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10532–10541

work page 2021
[24]

A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu, Conditional image generation with pixelcnn decoders, in: Proceedings of the 30th International Conference on Neural Informa- tion Processing Systems, NIPS’16, Curran Associates Inc., Red Hook, NY, USA, 2016, p. 4797–4805

work page 2016
[25]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: 31 J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

work page doi:10.18653/v1/n19-1423 2019
[26]

C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, L. J. Guibas, Volumetric and multi-view cnns for object classification on 3d data, in: 2016 IEEE ConferenceonComputerVisionandPatternRecognition(CVPR),2016, pp. 5648–5656. doi:10.1109/CVPR.2016.609

work page doi:10.1109/cvpr.2016.609 2016
[27]

S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2012) 221–231

work page 2012
[28]

T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, Y. Wang, End-to-end learnt image compression via non-local attention optimization and improved context modeling, IEEE Transactions on Image Processing 30 (2021) 3179–3191. doi:10.1109/TIP.2021.3058615

work page doi:10.1109/tip.2021.3058615 2021
[29]

Mentzer, E

F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, L. V. Gool, Conditional probability models for deep image compression, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4394–4402. doi:10.1109/CVPR.2018.00462

work page doi:10.1109/cvpr.2018.00462 2018
[30]

Z. Tang, H. Wang, X. Yi, Y. Zhang, S. Kwong, C.-C. J. Kuo, Joint graph attention and asymmetric convolutional neural network for deep im- age compression, IEEE Transactions on Circuits and Systems for Video Technology 33 (1) (2023) 421–433. doi:10.1109/TCSVT.2022.3199472

work page doi:10.1109/tcsvt.2022.3199472 2023
[31]

Minnen, S

D. Minnen, S. Singh, Channel-wise autoregressive entropy models for learned image compression, in: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, 2020, pp. 3339–3343

work page 2020
[32]

D. He, Z. Yang, W. Peng, R. Ma, H. Qin, Y. Wang, Elic: Efficient learned image compression with unevenly grouped space-channel con- textual adaptive coding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5718–5727. 32

work page 2022
[34]

J. Lu, L. Zhang, X. Zhou, M. Li, W. Li, S. Gu, Learned image com- pression with dictionary-based entropy model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12850– 12859

work page 2025
[35]

D. Feng, Z. Cheng, S. Wang, R. Wu, H. Hu, G. Lu, L. Song, Linear attention modeling for learned image compression, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 7623–7632

work page 2025
[36]

Ballé, V

J. Ballé, V. Laparra, E. P. Simoncelli, End-to-end optimized image compression, in: International Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=rJxdQ3jeg

work page 2017
[37]

Zhang, S

Z. Zhang, S. Esenlik, Y. Wu, M. Wang, K. Zhang, L. Zhang, End- to-end learning-based image compression with a decoupled framework, IEEE Transactions on Circuits and Systems for Video Technology 34 (5) (2024) 3067–3081. doi:10.1109/TCSVT.2023.3313974

work page doi:10.1109/tcsvt.2023.3313974 2024
[38]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010

work page 2017
[39]

R. Zou, C. Song, Z. Zhang, The devil is in the details: Window-based attention for image compression, in: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR),2022, pp. 17492–17501

work page 2022
[40]

M. Lu, P. Guo, H. Shi, C. Cao, Z. Ma, Transformer-based image com- pression, in: 2022 Data Compression Conference (DCC), 2022, pp. 469–

work page 2022
[41]

doi:10.1109/DCC52660.2022.00080. 33

work page doi:10.1109/dcc52660.2022.00080 2022
[42]

J. Wang, Q. Ling, Fdnet: Frequency decomposition network for learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (11) (2024) 11241–11255. doi:10.1109/TCSVT.2024.3415823

work page doi:10.1109/tcsvt.2024.3415823 2024
[43]

Z. Ge, S. Ma, W. Gao, J. Pan, C. Jia, Nlic: Non-uniform quantization-based learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9647–9663. doi:10.1109/TCSVT.2024.3401872

work page doi:10.1109/tcsvt.2024.3401872 2024
[44]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: First Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

work page 2024
[45]

Y. Xiao, Y. Xia, Pixel adaptive deep unfolding network with state space model for image deraining, Neural Networks (2025) 107845

work page 2025
[46]

Ballé, D

J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, Variational image compression with a scale hyperprior, in: International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rkcQFMZRb

work page 2018
[47]

D. He, Y. Zheng, B. Sun, Y. Wang, H. Qin, Checkerboard context model for efficient learned image compression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14771–14780

work page 2021
[48]

Y. Wang, H. Fu, Q. Cao, S. Wang, Z. Chen, F. Liang, S2lic: Learned image compression with the swinv2 block, adaptive channel-wise and global-inter attention context, Neural Networks (2025) 107590

work page 2025
[49]

D. Li, Y. Bai, K. Wang, J. Jiang, X. Liu, W. Gao, Groupedmixer: An entropy model with group-wise token-mixers for learned image compres- sion, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9606–9619. doi:10.1109/TCSVT.2024.3395481

work page doi:10.1109/tcsvt.2024.3395481 2024
[50]

T. Yao, Y. Pan, Y. Li, C.-W. Ngo, T. Mei, Wave-vit: Unifying wavelet and transformers for visual representation learning, in: European Con- ference on Computer Vision, Springer, 2022, pp. 328–345. 34

work page 2022
[51]

S. Tang, H. Zhang, X. Gao, S. Yang, J. Leng, Z. Pan, H. Tian, A spatial-frequency hybrid restoration network for jpeg compressed image deblurring, Neural Networks (2025) 108059

work page 2025
[52]

Density modeling of images using a generalized normalization transformation

J. Ballé, V. Laparra, E. P. Simoncelli, Density modeling of images us- ing a generalized normalization transformation, in: Y. Bengio, Y. LeCun (Eds.), 4thInternationalConferenceonLearning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceed- ings, 2016. URLhttp://arxiv.org/abs/1511.06281

work page arXiv 2016
[53]

Cheng, H

Z. Cheng, H. Sun, M. Takeuchi, J. Katto, Learned image compression with discretized gaussian mixture likelihoods and attention modules, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[54]

H. Fu, J. Liang, Z. Fang, J. Han, F. Liang, G. Zhang, Weconvene: Learned image compression with wavelet-domain convolution and en- tropy model (2024). arXiv:2407.09983. URLhttps://arxiv.org/abs/2407.09983

work page arXiv 2024
[55]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255

work page 2009
[56]

Y. Li, K. Zhang, J. Liang, J. Cao, C. Liu, R. Gong, Y. Zhang, H. Tang, Y. Liu, D. Demandolx, et al., Lsdir: A large scale dataset for image restoration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1775–1787

work page 2023
[57]

Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k

E. Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k. us/graphics/kodak 6 (1993) 2

work page 1993
[58]

Asuni, A

N. Asuni, A. Giachetti, et al., Testimages: a large-scale archive for test- ing visual devices and basic image processing algorithms., in: STAG, 2014, pp. 63–70

work page 2014
[59]

Toderici, W

G. Toderici, W. Shi, R. Timofte, L. Theis, J. Ballé, E. Agustsson, N. Johnston, F. Mentzer, Clic: Workshop and challenge on learned image compression, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 2020. 35

work page 2020
[60]

Jiang, R

W. Jiang, R. Wang, Mlic++: Linear complexity multi-reference entropy modeling for learned image compression, in: ICML 2023 Workshop Neu- ral Compression: From Information Theory to Applications, 2023. URLhttps://openreview.net/forum?id=hxIpcSoz2t

work page 2023
[61]

M. Han, S. Jiang, S. Li, X. Deng, M. Xu, C. Zhu, S. Gu, Causal con- text adjustment loss for learned image compression, Advances in Neural Information Processing Systems 37 (2024) 133231–133253. 36

work page 2024

[1] [1]

Wallace, The jpeg still picture compression standard, IEEE Transactions on Consumer Electronics 38 (1) (1992) xviii–xxxiv

G. Wallace, The jpeg still picture compression standard, IEEE Transactions on Consumer Electronics 38 (1) (1992) xviii–xxxiv. doi:10.1109/30.125072

work page doi:10.1109/30.125072 1992

[2] [2]

Skodras, C

A. Skodras, C. Christopoulos, T. Ebrahimi, The jpeg 2000 still image compression standard, IEEE Signal Processing Magazine 18 (5) (2001) 36–58. doi:10.1109/79.952804

work page doi:10.1109/79.952804 2000

[3] [3]

Ginesu, M

G. Ginesu, M. Pintus, D. D. Giusto, Objective assessment of the webp image coding algorithm, Signal processing: image communication 27 (8) (2012) 867–874

work page 2012

[4] [4]

Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct

F. Bellard, Bpg image format,http://bellard.org/bpg/, accessed: Oct. 30, 2018 (2014)

work page 2018

[5] [5]

Bross, Y.-K

B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, J.-R. Ohm, Overview of the versatile video coding (vvc) standard and its applica- tions, IEEE Transactions on Circuits and Systems for Video Technology 31 (10) (2021) 3736–3764

work page 2021

[6] [6]

Y. Xie, K. L. Cheng, Q. Chen, Enhanced invertible encoding for learned image compression, in: Proceedings of the ACM International Confer- ence on Multimedia, 2021, pp. 162–170

work page 2021

[7] [7]

J. Liu, H. Sun, J. Katto, Learned image compression with mixed transformer-cnn architectures, in: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, 2023, pp. 14388– 14397. 29

work page 2023

[8] [8]

H. Li, S. Li, W. Dai, C. Li, J. Zou, H. Xiong, Frequency-aware trans- former for learned image compression, in: The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=HKGQDDTuvZ

work page 2024

[9] [9]

C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal 27 (3) (1948) 379–423. doi:10.1002/j.1538- 7305.1948.tb01338.x

work page doi:10.1002/j.1538- 1948

[10] [10]

Auto-Encoding Variational Bayes

D.P.Kingma, M.Welling, Auto-EncodingVariationalBayes, in: 2ndIn- ternational Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. arXiv:http://arxiv.org/abs/1312.6114v10

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Rissanen, G

J. Rissanen, G. Langdon, Universal modeling and coding, IEEE Transactions on Information Theory 27 (1) (1981) 12–23. doi:10.1109/TIT.1981.1056282

work page doi:10.1109/tit.1981.1056282 1981

[12] [12]

G. N. N. Martin, Range encoding: an algorithm for removing redun- dancy from a digitised message, in: Proc. Institution of Electronic and Radio Engineers International Conference on Video and Data Record- ing, Vol. 2, 1979

work page 1979

[13] [14]

Y. Zhu, Y. Yang, T. Cohen, Transformer-based transform coding, in: International Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=IDwN6xjHnK8

work page 2022

[14] [15]

S. Qin, J. Wang, Y. Zhou, B. Chen, T. Luo, B. An, T. Dai, S. Xia, Y. Wang, Mambavc: Learned visual compression with selective state spaces, arXiv preprint arXiv:2405.15413 (2024)

work page arXiv 2024

[15] [16]

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 30

work page 2021

[16] [17]

H. Wei, Y. Zhou, Y. Jia, C. Ge, S. Anwar, A. Mian, A lightweight model for perceptual image compression via implicit priors, Neural Networks (2025) 108279

work page 2025

[17] [18]

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, X. Wang, Vision mamba: Efficient visual representation learning with bidirectional state space model, in: Forty-first International Conference on Machine Learning, 2024

work page 2024

[18] [19]

Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, Y. Liu, VMamba: Visual state space model, in: The Thirty-eighth Annual Con- ference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=ZgtLQQR1K7

work page 2024

[19] [20]

F. Zeng, H. Tang, Y. Shao, S. Chen, L. Shao, Y. Wang, Mambaic: State space models for high-performance learned image compression, in: Pro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18041–18050

work page 2025

[20] [21]

D.Minnen, J.Ballé, G.D.Toderici, Jointautoregressiveandhierarchical priors for learned image compression, Advances in neural information processing systems 31 (2018)

work page 2018

[21] [22]

J. Zhou, Multi-scale and context-adaptive entropy model for image com- pression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019

work page 2019

[22] [23]

Z. Cui, J. Wang, S. Gao, T. Guo, Y. Feng, B. Bai, Asymmetric gained deep image compression with continuous rate adaptation, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10532–10541

work page 2021

[23] [24]

A. v. d. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, K. Kavukcuoglu, Conditional image generation with pixelcnn decoders, in: Proceedings of the 30th International Conference on Neural Informa- tion Processing Systems, NIPS’16, Curran Associates Inc., Red Hook, NY, USA, 2016, p. 4797–4805

work page 2016

[24] [25]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: 31 J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)...

work page doi:10.18653/v1/n19-1423 2019

[25] [26]

C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, L. J. Guibas, Volumetric and multi-view cnns for object classification on 3d data, in: 2016 IEEE ConferenceonComputerVisionandPatternRecognition(CVPR),2016, pp. 5648–5656. doi:10.1109/CVPR.2016.609

work page doi:10.1109/cvpr.2016.609 2016

[26] [27]

S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2012) 221–231

work page 2012

[27] [28]

T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, Y. Wang, End-to-end learnt image compression via non-local attention optimization and improved context modeling, IEEE Transactions on Image Processing 30 (2021) 3179–3191. doi:10.1109/TIP.2021.3058615

work page doi:10.1109/tip.2021.3058615 2021

[28] [29]

Mentzer, E

F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, L. V. Gool, Conditional probability models for deep image compression, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4394–4402. doi:10.1109/CVPR.2018.00462

work page doi:10.1109/cvpr.2018.00462 2018

[29] [30]

Z. Tang, H. Wang, X. Yi, Y. Zhang, S. Kwong, C.-C. J. Kuo, Joint graph attention and asymmetric convolutional neural network for deep im- age compression, IEEE Transactions on Circuits and Systems for Video Technology 33 (1) (2023) 421–433. doi:10.1109/TCSVT.2022.3199472

work page doi:10.1109/tcsvt.2022.3199472 2023

[30] [31]

Minnen, S

D. Minnen, S. Singh, Channel-wise autoregressive entropy models for learned image compression, in: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, 2020, pp. 3339–3343

work page 2020

[31] [32]

D. He, Z. Yang, W. Peng, R. Ma, H. Qin, Y. Wang, Elic: Efficient learned image compression with unevenly grouped space-channel con- textual adaptive coding, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5718–5727. 32

work page 2022

[32] [34]

J. Lu, L. Zhang, X. Zhou, M. Li, W. Li, S. Gu, Learned image com- pression with dictionary-based entropy model, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12850– 12859

work page 2025

[33] [35]

D. Feng, Z. Cheng, S. Wang, R. Wu, H. Hu, G. Lu, L. Song, Linear attention modeling for learned image compression, in: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 7623–7632

work page 2025

[34] [36]

Ballé, V

J. Ballé, V. Laparra, E. P. Simoncelli, End-to-end optimized image compression, in: International Conference on Learning Representations, 2017. URLhttps://openreview.net/forum?id=rJxdQ3jeg

work page 2017

[35] [37]

Zhang, S

Z. Zhang, S. Esenlik, Y. Wu, M. Wang, K. Zhang, L. Zhang, End- to-end learning-based image compression with a decoupled framework, IEEE Transactions on Circuits and Systems for Video Technology 34 (5) (2024) 3067–3081. doi:10.1109/TCSVT.2023.3313974

work page doi:10.1109/tcsvt.2023.3313974 2024

[36] [38]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010

work page 2017

[37] [39]

R. Zou, C. Song, Z. Zhang, The devil is in the details: Window-based attention for image compression, in: Proceedings of the IEEE/CVF ConferenceonComputerVisionandPatternRecognition(CVPR),2022, pp. 17492–17501

work page 2022

[38] [40]

M. Lu, P. Guo, H. Shi, C. Cao, Z. Ma, Transformer-based image com- pression, in: 2022 Data Compression Conference (DCC), 2022, pp. 469–

work page 2022

[39] [41]

doi:10.1109/DCC52660.2022.00080. 33

work page doi:10.1109/dcc52660.2022.00080 2022

[40] [42]

J. Wang, Q. Ling, Fdnet: Frequency decomposition network for learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (11) (2024) 11241–11255. doi:10.1109/TCSVT.2024.3415823

work page doi:10.1109/tcsvt.2024.3415823 2024

[41] [43]

Z. Ge, S. Ma, W. Gao, J. Pan, C. Jia, Nlic: Non-uniform quantization-based learned image compression, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9647–9663. doi:10.1109/TCSVT.2024.3401872

work page doi:10.1109/tcsvt.2024.3401872 2024

[42] [44]

A. Gu, T. Dao, Mamba: Linear-time sequence modeling with selective state spaces, in: First Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=tEYskw1VY2

work page 2024

[43] [45]

Y. Xiao, Y. Xia, Pixel adaptive deep unfolding network with state space model for image deraining, Neural Networks (2025) 107845

work page 2025

[44] [46]

Ballé, D

J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, Variational image compression with a scale hyperprior, in: International Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=rkcQFMZRb

work page 2018

[45] [47]

D. He, Y. Zheng, B. Sun, Y. Wang, H. Qin, Checkerboard context model for efficient learned image compression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14771–14780

work page 2021

[46] [48]

Y. Wang, H. Fu, Q. Cao, S. Wang, Z. Chen, F. Liang, S2lic: Learned image compression with the swinv2 block, adaptive channel-wise and global-inter attention context, Neural Networks (2025) 107590

work page 2025

[47] [49]

D. Li, Y. Bai, K. Wang, J. Jiang, X. Liu, W. Gao, Groupedmixer: An entropy model with group-wise token-mixers for learned image compres- sion, IEEE Transactions on Circuits and Systems for Video Technology 34 (10) (2024) 9606–9619. doi:10.1109/TCSVT.2024.3395481

work page doi:10.1109/tcsvt.2024.3395481 2024

[48] [50]

T. Yao, Y. Pan, Y. Li, C.-W. Ngo, T. Mei, Wave-vit: Unifying wavelet and transformers for visual representation learning, in: European Con- ference on Computer Vision, Springer, 2022, pp. 328–345. 34

work page 2022

[49] [51]

S. Tang, H. Zhang, X. Gao, S. Yang, J. Leng, Z. Pan, H. Tian, A spatial-frequency hybrid restoration network for jpeg compressed image deblurring, Neural Networks (2025) 108059

work page 2025

[50] [52]

Density modeling of images using a generalized normalization transformation

J. Ballé, V. Laparra, E. P. Simoncelli, Density modeling of images us- ing a generalized normalization transformation, in: Y. Bengio, Y. LeCun (Eds.), 4thInternationalConferenceonLearning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceed- ings, 2016. URLhttp://arxiv.org/abs/1511.06281

work page arXiv 2016

[51] [53]

Cheng, H

Z. Cheng, H. Sun, M. Takeuchi, J. Katto, Learned image compression with discretized gaussian mixture likelihoods and attention modules, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[52] [54]

H. Fu, J. Liang, Z. Fang, J. Han, F. Liang, G. Zhang, Weconvene: Learned image compression with wavelet-domain convolution and en- tropy model (2024). arXiv:2407.09983. URLhttps://arxiv.org/abs/2407.09983

work page arXiv 2024

[53] [55]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255

work page 2009

[54] [56]

Y. Li, K. Zhang, J. Liang, J. Cao, C. Liu, R. Gong, Y. Zhang, H. Tang, Y. Liu, D. Demandolx, et al., Lsdir: A large scale dataset for image restoration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1775–1787

work page 2023

[55] [57]

Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k

E. Kodak, Kodak lossless true color image suite (photocd pcd0992), URL http://r0k. us/graphics/kodak 6 (1993) 2

work page 1993

[56] [58]

Asuni, A

N. Asuni, A. Giachetti, et al., Testimages: a large-scale archive for test- ing visual devices and basic image processing algorithms., in: STAG, 2014, pp. 63–70

work page 2014

[57] [59]

Toderici, W

G. Toderici, W. Shi, R. Timofte, L. Theis, J. Ballé, E. Agustsson, N. Johnston, F. Mentzer, Clic: Workshop and challenge on learned image compression, in: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit, 2020. 35

work page 2020

[58] [60]

Jiang, R

W. Jiang, R. Wang, Mlic++: Linear complexity multi-reference entropy modeling for learned image compression, in: ICML 2023 Workshop Neu- ral Compression: From Information Theory to Applications, 2023. URLhttps://openreview.net/forum?id=hxIpcSoz2t

work page 2023

[59] [61]

M. Han, S. Jiang, S. Li, X. Deng, M. Xu, C. Zhu, S. Gu, Causal con- text adjustment loss for learned image compression, Advances in Neural Information Processing Systems 37 (2024) 133231–133253. 36

work page 2024