ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression
Pith reviewed 2026-06-29 09:53 UTC · model grok-4.3
The pith
Channel-wise wavelet transforms in transformer attention and entropy coding improve learned image compression performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed Channel-wise Wavelet-Domain Transformer Attention (ChWDTA) computes Q/K/V on channel-wise wavelet-transformed features while preserving windowed attention, and the channel-wise wavelet packet (ChWP) decomposition enables eight-slice autoregressive entropy modeling, resulting in BD-rate reductions of -17.82%, -19.15%, and -22.56% on Kodak, CLIC, and Tecnick datasets respectively.
What carries the argument
Channel-wise Wavelet-Domain Transformer Block (ChWDTB) that applies wavelet transform before Q/K/V projections and inverse after attention output.
If this is right
- The scheme achieves significant rate-distortion improvements on multiple test sets.
- Using single slice per subband retains most gains with lower complexity.
- Wavelet transforms provide an advantage in CNN-transformer based learned image compression.
- Channel-wise wavelet packet decomposition fits well with slice-based autoregressive entropy models.
Where Pith is reading between the lines
- Similar wavelet integrations could be tested in video compression frameworks.
- The sparsification of channel covariance might reduce attention computation costs in other domains.
- Further splits or different wavelet types could be explored for additional gains.
Load-bearing premise
That the channel-wise wavelet transform applied before Q/K/V projections and the four-subband wavelet-packet split will improve rate-distortion performance on the chosen test sets without introducing reconstruction artifacts or requiring extensive hyper-parameter retuning.
What would settle it
Removing the wavelet transforms from the ChWDTA and ChWP modules and re-measuring BD-rate on the Kodak, CLIC, and Tecnick sets; loss of the reported savings would support the claim's dependence on the wavelet components.
Figures
read the original abstract
State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Channel-wise Wavelet-Domain Transformer Attention (ChWDTA) and Channel-wise Wavelet Packet (ChWP) decomposition for learned image compression. ChWDTA applies a channel-wise wavelet transform before Q/K/V projections in windowed spatial self-attention while preserving the spatial tokenization pattern; ChWP splits channels into four subbands for improved slice-based autoregressive entropy modeling (yielding eight slices). The central claim is that these yield BD-rate reductions of -17.82% on Kodak, -19.15% on CLIC Professional Validation, and -22.56% on Tecnick, confirming the advantage of wavelet transforms in CNN-transformer LIC backbones.
Significance. If the reported gains are shown to arise specifically from the wavelet mechanisms rather than capacity or training differences, the work would provide a concrete demonstration of how wavelet bases can be inserted into transformer attention and entropy modules to improve rate-distortion performance while retaining efficient windowed spatial attention.
major comments (2)
- [Abstract] Abstract: The BD-rate reductions (-17.82%, -19.15%, -22.56%) are stated without any experimental protocol, baseline references, ablation studies, training details, or error bars. This renders the central empirical claim unverifiable and prevents assessment of whether the gains are attributable to ChWDTA/ChWP.
- [Abstract] Abstract: The claim that the channel-wise wavelet transform sparsifies channel covariance (and thereby improves attention) is presented without supporting analysis; no covariance visualizations, comparisons against random orthogonal transforms or 1x1 convolutions, or ablations isolating the wavelet from the eight-slice change are supplied. Because learned channels lack inherent spatial locality, the a-priori advantage of a wavelet basis remains unestablished.
minor comments (1)
- [Abstract] Abstract: The transition from four subbands to eight slices for entropy coding would benefit from an explicit diagram or equation clarifying the slicing procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below, clarifying that experimental details appear in the main text while agreeing that additional supporting analyses would strengthen the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The BD-rate reductions (-17.82%, -19.15%, -22.56%) are stated without any experimental protocol, baseline references, ablation studies, training details, or error bars. This renders the central empirical claim unverifiable and prevents assessment of whether the gains are attributable to ChWDTA/ChWP.
Authors: The abstract is kept concise per standard practice. Full experimental protocol (training on Vimeo-90K, Adam optimizer, etc.), baselines (Cheng et al. 2020, Minnen et al. 2020, and recent transformer LIC methods), ablation studies (including the single-slice ChWP variant that retains most gains), and training details are provided in Sections 3 and 4. Results are reported as BD-rate averages; error bars can be added in revision. The single-slice experiment helps isolate contributions from the entropy modeling component. revision: partial
-
Referee: [Abstract] Abstract: The claim that the channel-wise wavelet transform sparsifies channel covariance (and thereby improves attention) is presented without supporting analysis; no covariance visualizations, comparisons against random orthogonal transforms or 1x1 convolutions, or ablations isolating the wavelet from the eight-slice change are supplied. Because learned channels lack inherent spatial locality, the a-priori advantage of a wavelet basis remains unestablished.
Authors: The design draws on the established decorrelating property of wavelets, applied channel-wise before Q/K/V to reduce redundancy in the projections while preserving windowed spatial attention. We acknowledge that the manuscript does not include covariance visualizations, comparisons to random orthogonal bases or 1x1 convolutions, or an ablation that fully isolates the wavelet from the slice count change. We will add these analyses (covariance heatmaps and controlled ablations) in the revised version to directly substantiate the claim. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces ChWDTA and ChWP architectural components and reports empirical BD-rate gains on external standard test sets (Kodak, CLIC Professional Validation, Tecnick). No equations, self-citations, or claims reduce the reported gains to quantities fitted on the same data, self-defined quantities, or load-bearing prior self-work. The derivation consists of proposed transforms evaluated against independent benchmarks, making the result self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learned image com- pression with discretized gaussian mixture likelihoods and attention modules,
Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image com- pression with discretized gaussian mixture likelihoods and attention modules,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7939–7948
2020
-
[2]
Learned image compression with gaussian- laplacian-logistic mixture model and concatenated residual modules,
H. Fu, F. Liang, J. Lin, B. Li, M. Akbari, J. Liang, G. Zhang, D. Liu, C. Tu, and J. Han, “Learned image compression with gaussian- laplacian-logistic mixture model and concatenated residual modules,” IEEE Transactions on Image Processing, vol. 32, pp. 2063–2076, 2023
2063
-
[3]
Learned image compression with mixed transformer-cnn architectures,
J. Liu, H. Sun, and J. Katto, “Learned image compression with mixed transformer-cnn architectures,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 14 388–14 397
2023
-
[4]
Frequency- aware transformer for learned image compression,
H. Li, S. Li, W. Dai, C. Li, J. Zou, and H. Xiong, “Frequency- aware transformer for learned image compression,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=HKGQDDTuvZ
2024
-
[5]
Lin- ear attention modeling for learned image compression,
D. Feng, Z. Cheng, S. Wang, R. Wu, H. Hu, G. Lu, and L. Song, “Lin- ear attention modeling for learned image compression,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1–10
2025
-
[6]
Learned image compression with dictionary-based entropy model,
J. Lu, L. Zhang, X. Zhou, M. Li, W. Li, and S. Gu, “Learned image compression with dictionary-based entropy model,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 12 850–12 859
2025
-
[7]
Learned image compression with hierarchical progressive context modeling,
Y . Li, H. Zhang, L. Li, and D. Liu, “Learned image compression with hierarchical progressive context modeling,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 18 834–18 843
2025
-
[8]
Variational image compression with a scale hyperprior,
J. Ball ´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” inInternational Conference on Learning Representations, 2018, pp. 1–23
2018
-
[9]
Joint autoregressive and hierarchical priors for learned image compression,
D. Minnen, J. Ball ´e, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” inAdvances in Neural Information Processing Systems, 2018, pp. 10 794–10 803
2018
-
[10]
Mlic: Multi- reference entropy model for learned image compression,
W. Jiang, J. Yang, Y . Zhai, P. Ning, F. Gao, and R. Wang, “Mlic: Multi- reference entropy model for learned image compression,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7618–7627
2023
-
[11]
Elic: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,
D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “Elic: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5718–5727
2022
-
[12]
Mambavc: Learned visual compression with selective state spaces,
S. Qin, J. Wang, Y . Zhou, B. Chen, T. Luo, B. An, T. Dai, S. Xia, and Y . Wang, “Mambavc: Learned visual compression with selective state spaces,” 2024. [Online]. Available: https://arxiv.org/abs/2405.15413
-
[13]
Checkerboard context model for efficient learned image compression,
D. He, Y . Zheng, B. Sun, Y . Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 14 771–14 780
2021
-
[14]
Entroformer: A transformer-based entropy model for learned image compression,
Y . Qian, M. Lin, X. Sun, Z. Tan, and R. Jin, “Entroformer: A transformer-based entropy model for learned image compression,” in International Conference on Learning Representations, May 2022
2022
-
[15]
End-to-end optimized versatile image compression with wavelet-like transform,
H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized versatile image compression with wavelet-like transform,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1247– 1263, 2022. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
2022
-
[16]
Image compression with learned lifting- based dwt and learned tree-based entropy models,
U. B. Sahin and F. Kamisli, “Image compression with learned lifting- based dwt and learned tree-based entropy models,”Multimedia Systems, vol. 29, no. 6, pp. 3369–3384, 2023
2023
-
[17]
aiwave: V olumetric image compression with 3-d trained affine wavelet-like transform,
D. Xue, H. Ma, L. Li, D. Liu, and Z. Xiong, “aiwave: V olumetric image compression with 3-d trained affine wavelet-like transform,”IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 606–618, 2023
2023
-
[18]
Weconvene: Learned image compression with wavelet-domain convolution and en- tropy model,
H. Fu, J. Liang, Z. Fang, J. Han, F. Liang, and G. Zhang, “Weconvene: Learned image compression with wavelet-domain convolution and en- tropy model,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 37–53
2024
-
[19]
Channel-wise autoregressive entropy models for learned image compression,
D. Minnen and S. Singh, “Channel-wise autoregressive entropy models for learned image compression,” in2020 IEEE International Conference on Image Processing (ICIP), 2020, pp. 3339–3343
2020
-
[20]
End-to-end optimized image compression,
J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inInternational Conference on Learning Representations, 2017
2017
-
[21]
Asymmetric learned image compression with multi-scale residual block, importance scaling, and post-quantization filtering,
H. Fu, F. Liang, J. Liang, B. Li, G. Zhang, and J. Han, “Asymmetric learned image compression with multi-scale residual block, importance scaling, and post-quantization filtering,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4309–4321, 2023
2023
-
[22]
Learned multi-resolution variable-rate image compression with octave-based residual blocks,
M. Akbari, J. Liang, J. Han, and C. Tu, “Learned multi-resolution variable-rate image compression with octave-based residual blocks,” IEEE Transactions on Multimedia, vol. 23, pp. 3013–3021, Mar. 2021
2021
-
[23]
Learning context-based nonlocal entropy modeling for image compression,
M. Li, K. Zhang, J. Li, W. Zuo, R. Timofte, and D. Zhang, “Learning context-based nonlocal entropy modeling for image compression,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 3, pp. 1132–1145, 2023
2023
-
[24]
End-to-end optimized versatile image compression with wavelet-like transform,
H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized versatile image compression with wavelet-like transform,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1247– 1263, 2022
2022
-
[25]
Cassic: Towards content-adaptive state-space models for learned image compression,
S. Qin, J. Wang, Y . Zhou, B. Chen, T. Luo, B. An, T. Dai, S.-T. Xia, and Y . Wang, “Cassic: Towards content-adaptive state-space models for learned image compression,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 15 727–15 736
2025
-
[26]
MLIC++: Linear complexity multi-reference entropy modeling for learned image compression,
W. Jiang, J. Yang, Y . Zhai, F. Gao, and R. Wang, “MLIC++: Linear complexity multi-reference entropy modeling for learned image compression,” 2023, accepted to the ICML 2023 Neural Compression Workshop and ACM TOMM 2025. [Online]. Available: https://arxiv.org/abs/2307.15421
-
[27]
MLICv2: Enhanced multi-reference entropy modeling for learned image compression,
W. Jiang, Y . Zhai, J. Yang, F. Gao, and R. Wang, “MLICv2: Enhanced multi-reference entropy modeling for learned image compression,”
-
[28]
Available: https://arxiv.org/abs/2504.19119
[Online]. Available: https://arxiv.org/abs/2504.19119
-
[29]
What Matters in Practical Learned Image Compression
K. Tatwawadi, P. Rahimzadeh, Z. Sun, Z. Chen, Z. Yang, S. Nair, D. Hasteer, and O. Rippel, “What matters in practical learned image compression,” 2026. [Online]. Available: https://arxiv.org/abs/ 2605.05148
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Spatial competition for low-complexity learned image compression,
T. Blard, P. Philippe, T. Ladune, X. Jiang, and O. D ´eforges, “Spatial competition for low-complexity learned image compression,” 2026, accepted to ICIP 2026. [Online]. Available: https://arxiv.org/abs/2605. 13243
2026
-
[31]
Cool-chic 5.0: Faster Encoding and Inter-Feature Entropy Modeling for Overfitted Image Compression
T. Ladune, P. Philippe, P. Jaffuer, T. Blard, S. Kervadec, F. Henry, and G. Clare, “Cool-chic 5.0: Faster encoding and inter-feature entropy modeling for overfitted image compression,” 2026. [Online]. Available: https://arxiv.org/abs/2605.02726
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
The jpeg 2000 still im- age compression standard,
A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still im- age compression standard,”IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 36–58, 2001
2000
-
[33]
D. S. Taubman and M. W. Marcellin,JPEG2000: Image Compression Fundamentals, Standards and Practice, ser. The Springer International Series in Engineering and Computer Science. New York, NY: Springer, 2002, vol. 642
2002
-
[34]
Frequency disentangled features in neural image compression,
A. Zafari, A. Khoshkhahtinat, P. Mehta, M. S. Ebrahimi Saadabadi, M. Akyash, and N. M. Nasrabadi, “Frequency disentangled features in neural image compression,” in2023 IEEE International Conference on Image Processing (ICIP), 2023, pp. 2815–2819
2023
-
[35]
Channel-wise feature decorrelation for enhanced learned image compression,
F. Pakdaman and M. Gabbouj, “Channel-wise feature decorrelation for enhanced learned image compression,” 2024. [Online]. Available: https://arxiv.org/abs/2403.10936
-
[36]
On disentangled training for nonlinear transform in learned image compression,
H. Li, S. Li, W. Dai, M. Cao, N. Kan, C. Li, J. Zou, and H. Xiong, “On disentangled training for nonlinear transform in learned image compression,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=U67J0QNtzo
2025
-
[37]
Bi-level spatial and channel-aware transformer for learned image compression,
H. Soltani and E. Ghasemi, “Bi-level spatial and channel-aware transformer for learned image compression,” 2024. [Online]. Available: https://arxiv.org/abs/2408.03842
-
[38]
Window-based channel attention for wavelet-enhanced learned image compression,
H. Xu, B. Hai, Y . Tang, and Z. He, “Window-based channel attention for wavelet-enhanced learned image compression,” 2024. [Online]. Available: https://arxiv.org/abs/2409.14090
-
[39]
H. Fu, J. Liang, F. Liang, Z. Fang, G. Zhang, and J. Han, “3dm-weconvene: Learned image compression with 3d multi-level wavelet-domain convolution and entropy model,” 2025. [Online]. Available: https://arxiv.org/abs/2504.04658
-
[40]
Entropy-based algorithms for best basis selection,
R. Coifman and M. Wickerhauser, “Entropy-based algorithms for best basis selection,”IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 713–718, 1992
1992
-
[41]
The lifting scheme: A custom-design construction of biorthogonal wavelets,
W. Sweldens, “The lifting scheme: A custom-design construction of biorthogonal wavelets,”Applied and Computational Harmonic Analysis, vol. 3, no. 2, pp. 186–200, 1996
1996
-
[42]
The devil is in the details: Window- based attention for image compression,
R. Zou, C. Song, and Z. Zhang, “The devil is in the details: Window- based attention for image compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 17 492–17 501
2022
-
[43]
Openimages: A public dataset for large-scale multi-label and multi-class image classification
T. D. Ivan Krasin, N. Alldrin, A. Veit, S. Abu-El-Haija, S. Belongie, D. Cai, Z. Feng, V . Ferrari, V . Gomes, A. Gupta, D. Narayanan, C. Sun, G. Chechik, and K. Murphy., “Openimages: A public dataset for large-scale multi-label and multi-class image classification.” https: //github.com/openimages, 2016. [43]Kodak PhotoCD dataset, http://r0k.us/graphics/k...
2016
-
[44]
TESTIMAGES: a Large-scale Archive for Testing Visual Devices and Basic Image Processing Algorithms
N. Asuni and A. Giachetti, “TESTIMAGES: a Large-scale Archive for Testing Visual Devices and Basic Image Processing Algorithms.” The Eurographics Association, 2014
2014
-
[45]
2021 workshop and challenge on learned image compression (clic)
G. Toderici, R. Timofte, J. Balle, E. Agustsson, N. Johnston, and F. Mentzer, “2021 workshop and challenge on learned image compression (clic).” [Online]. Available: http://www.compression.cc
2021
-
[46]
Multiscale structural similarity for image quality assessment,
Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” inThe Thirty-Seventh Asilomar Confer- ence on Signals, Systems Computers, 2003, vol. 2, 2003, pp. 1398–1402
2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.