pith. sign in

arxiv: 2605.22061 · v1 · pith:D5GKGTUPnew · submitted 2026-05-21 · 💻 cs.CV

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

Pith reviewed 2026-05-22 08:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords distributed image compressionmultimodal side informationlow bitratediffusion decoderfeature mask generatorperceptual qualitystereo reconstructionVQ-VAE
0
0 comments X

The pith

Multimodal side information from correlated images enables high perceptual quality in distributed image compression below 0.1 bpp.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that existing distributed image compression methods lose fine local details and global context at extremely low bitrates because they underuse side information. By extracting text from correlated views to condition a diffusion decoder and training a feature-mask generator through multimodal alignment, the approach pulls precise details from lossless side information while restoring category cues in the quantized primary stream. A sympathetic reader would care because bandwidth-limited multi-view systems, such as vehicle cameras, could then transmit far less data yet still produce usable reconstructions for perception tasks.

Core claim

The MDIC framework, for the first time, incorporates side information multimodally into the DIC paradigm: a text-to-image diffusion decoder conditioned on textual descriptions captures shared global semantics, while a feature-mask generator supervised by a multimodal fine-grained alignment task extracts fine details from losslessly transmitted visual side information and regulates clustered features from VQ-VAE embeddings to compensate for information lost under extreme primary-image compression.

What carries the argument

Feature-mask generator supervised by multimodal fine-grained alignment, which both guides detail extraction from visual side information and regulates feature clustering from quantized embeddings.

If this is right

  • Reconstructed images maintain semantic consistency in local details drawn from side views.
  • Global perceptual quality rises through diffusion decoding guided by shared textual semantics.
  • Category information missing from extreme quantization of the primary image is restored via regulated clustered features.
  • State-of-the-art perceptual results are reported on KITTI Stereo and Cityscapes at bitrates under 0.1 bpp.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multimodal conditioning could be tested on video sequences where temporal side information is available at negligible extra cost.
  • Sensor networks with limited uplink capacity might adopt similar text-plus-mask side channels to keep reconstruction fidelity while cutting transmitted bits.
  • End-to-end training that jointly optimizes text extraction and mask generation could be explored to reduce reliance on separate pretrained modules.

Load-bearing premise

Textual side information extracted from correlated images accurately captures shared global semantics and the multimodal alignment task successfully supervises the mask generator without introducing inconsistencies.

What would settle it

On the KITTI Stereo dataset at 0.05 bpp, measure perceptual metrics such as FID or LPIPS for MDIC versus prior DIC baselines; absence of clear improvement would falsify the benefit of the multimodal side-information scheme.

Figures

Figures reproduced from arXiv: 2605.22061 by Cheng Tan, Guojun Xu, Jianwen Xiang, Junwei Zhou, Mingyang Zhang, Yanchao Yang.

Figure 1
Figure 1. Figure 1: (a) The existing VAE-based distributed coding frame [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of reconstruction performance by joint coding method BiSIC [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Different compressors for multi-view images. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed MDIC framework. The input view [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual Mask Generation Module in Text Supervision [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Perception evaluation of MDIC and other DIC, SIC, and LIC methods on KITTI Stereo and Cityscapes datasets. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distortion evaluation of MDIC and other DIC, SIC, and LIC methods on KITTI Stereo and Cityscapes datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization results on KITTI Stereo and Cityscapes datasets. Each row compares two methods (three images per method). [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The visual comparison results after ablating the [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MDIC, a multimodal distributed image compression framework for extremely low bitrates (<0.1 bpp). It extracts textual side information from correlated images to condition a text-to-image diffusion decoder for global semantics, and introduces a feature-mask generator supervised by a multimodal fine-grained alignment task. The mask guides extraction of fine-grained details from lossless visual side information and regulates clustered VQ-VAE embeddings from the quantized primary image to preserve semantic consistency and compensate for lost category information. Experiments on KITTI Stereo and Cityscapes are claimed to demonstrate state-of-the-art perceptual quality.

Significance. If validated, the work could meaningfully advance distributed image compression by incorporating multimodal (textual and visual) side information, addressing limitations in exploiting global context and local details at very low bitrates relevant to multi-view scenarios such as stereo vision or autonomous driving. The approach of using diffusion-based decoding and alignment-supervised masking offers a plausible path to better perceptual reconstructions, though its impact hinges on rigorous demonstration that the alignment mechanism reliably enforces consistency rather than introducing new artifacts.

major comments (2)
  1. [Method (feature-mask generator and alignment supervision)] Method description of the feature-mask generator and multimodal fine-grained alignment task: The central claim requires that this supervision correctly strengthens exploitation of visual side information while preserving semantic consistency. At <0.1 bpp the primary image is reduced to clustered VQ-VAE embeddings; any mismatch between text-conditioned global semantics and mask-guided local details risks hallucinated textures or lost object boundaries. The manuscript provides no quantitative evidence (e.g., consistency metrics, error-mode analysis, or ablation removing the alignment loss) that the supervision enforces consistency rather than trading one error mode for another.
  2. [Experiments] Experiments section: The abstract asserts SOTA perceptual quality on KITTI Stereo and Cityscapes, yet the provided description contains no quantitative metrics, error bars, ablation details on the alignment task, dataset splits, or comparisons isolating the contribution of textual vs. visual side information. This absence prevents verification that the multimodal components deliver the claimed improvements in fine-grained detail preservation and global quality.
minor comments (2)
  1. [Method] Clarify the exact bitrate range and VQ-VAE clustering details used in the primary image path to allow reproduction of the <0.1 bpp regime.
  2. [Method] Specify how textual side information is extracted (e.g., model, prompt strategy) and transmitted, including any bitrate overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and revised the manuscript to provide additional evidence and experimental details as requested.

read point-by-point responses
  1. Referee: [Method (feature-mask generator and alignment supervision)] Method description of the feature-mask generator and multimodal fine-grained alignment task: The central claim requires that this supervision correctly strengthens exploitation of visual side information while preserving semantic consistency. At <0.1 bpp the primary image is reduced to clustered VQ-VAE embeddings; any mismatch between text-conditioned global semantics and mask-guided local details risks hallucinated textures or lost object boundaries. The manuscript provides no quantitative evidence (e.g., consistency metrics, error-mode analysis, or ablation removing the alignment loss) that the supervision enforces consistency rather than trading one error mode for another.

    Authors: We acknowledge that the original manuscript did not include explicit quantitative consistency metrics, error-mode analysis, or an ablation study isolating the alignment loss. The feature-mask generator is supervised by the multimodal fine-grained alignment task specifically to enforce semantic consistency between the text-conditioned global semantics and the mask-guided local details extracted from visual side information. To directly address this concern, the revised manuscript now includes an ablation study removing the alignment supervision, along with qualitative analysis of error modes (e.g., hallucinated textures and boundary preservation) and a consistency metric based on feature alignment scores between reconstructed and side-information regions. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts SOTA perceptual quality on KITTI Stereo and Cityscapes, yet the provided description contains no quantitative metrics, error bars, ablation details on the alignment task, dataset splits, or comparisons isolating the contribution of textual vs. visual side information. This absence prevents verification that the multimodal components deliver the claimed improvements in fine-grained detail preservation and global quality.

    Authors: We agree that the experiments section required more comprehensive quantitative reporting. The revised manuscript now includes tables with perceptual metrics (FID, LPIPS, and user-study scores) reported with standard deviations across multiple runs, explicit dataset splits for KITTI Stereo and Cityscapes, ablation results on the alignment task, and direct comparisons isolating the contributions of textual side information (via the diffusion decoder) versus visual side information (via the feature-mask generator). These additions confirm the multimodal components' role in the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework validated on external datasets

full rationale

The paper introduces the MDIC framework with components such as a text-to-image diffusion decoder and a feature-mask generator supervised by multimodal alignment, but the provided text contains no equations, derivations, or first-principles results. All performance claims are presented as outcomes of experiments on the independent KITTI Stereo and Cityscapes datasets rather than any self-referential fitting, parameter prediction, or self-citation chain. No load-bearing step reduces by construction to its own inputs, satisfying the criteria for a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities with independent evidence; the MDIC framework and its components are presented as novel but without detailed mathematical grounding or external validation.

pith-pipeline@v0.9.0 · 5794 in / 1060 out tokens · 38763 ms · 2026-05-22T08:01:53.181681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Deep image compression using decoder side information

    Sharon Ayzik and Shai Avidan. Deep image compression using decoder side information. InEuropean Conference on Computer Vision, pages 699–714. Springer, 2020. 2

  2. [2]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018. 6

  3. [3]

    Towards image compression with per- fect realism at ultra-low bitrates

    Marlene Careil, Matthew J Muckley, Jakob Verbeek, and St´ephane Lathuili`ere. Towards image compression with per- fect realism at ultra-low bitrates. InThe Twelfth International Conference on Learning Representations, 2023. 2, 3, 6, 7, 8

  4. [4]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016. 6

  5. [5]

    Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2020. 6

  6. [6]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Repre- sentations, 2021. 5

  7. [7]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012. 6

  8. [8]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in Neural Information Processing Systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in Neural Information Processing Systems, 30, 2017. 6

  9. [9]

    Categorical Reparameterization with Gumbel-Softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144, 2016. 5

  10. [10]

    Multi-modality deep network for extreme learned im- age compression

    Xuhao Jiang, Weimin Tan, Tian Tan, Bo Yan, and Liquan Shen. Multi-modality deep network for extreme learned im- age compression. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1033–1041, 2023. 2

  11. [11]

    Ultra lowrate image compression with semantic residual coding and compression-aware dif- fusion

    Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Ji- awen Gu, and Zhan Ma. Ultra lowrate image compression with semantic residual coding and compression-aware dif- fusion. InInternational Conference on Machine Learning. PMLR, 2025. 2

  12. [12]

    Peak signal-to-noise ratio revisited: Is simple beautiful? In2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37– 38, 2012

    Jari Korhonen and Junyong You. Peak signal-to-noise ratio revisited: Is simple beautiful? In2012 Fourth International Workshop on Quality of Multimedia Experience, pages 37– 38, 2012. 3, 7

  13. [13]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational Conference on Machine Learning, pages 19730– 19742. PMLR, 2023. 3, 4

  14. [14]

    Towards extreme image compression with latent feature guidance and diffusion prior.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Jing- wen Jiang. Towards extreme image compression with latent feature guidance and diffusion prior.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2, 3, 7

  15. [15]

    Rdeic: Accelerating diffusion-based extreme im- age compression with relay residual diffusion.IEEE Trans- actions on Circuits and Systems for Video Technology, pages 1–1, 2025

    Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Aj- mal Mian. Rdeic: Accelerating diffusion-based extreme im- age compression with relay residual diffusion.IEEE Trans- actions on Circuits and Systems for Video Technology, pages 1–1, 2025. 2, 3, 7

  16. [16]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014. 2

  17. [17]

    Bidirectional stereo image compression with cross-dimensional entropy model

    Zhening Liu, Xinjie Zhang, Jiawei Shao, Zehong Lin, and Jun Zhang. Bidirectional stereo image compression with cross-dimensional entropy model. InEuropean Conference on Computer Vision, pages 480–496. Springer, 2024. 2, 3, 7

  18. [18]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 7

  19. [19]

    Correcting diffusion-based perceptual image compression with privi- leged end-to-end decoder

    Yiyang Ma, Wenhan Yang, and Jiaying Liu. Correcting diffusion-based perceptual image compression with privi- leged end-to-end decoder. 2024. 2

  20. [20]

    Neural distributed image compression using common in- formation

    Nitish Mital, Ezgi ¨Ozyılkan, Ali Garjani, and Deniz G¨und¨uz. Neural distributed image compression using common in- formation. In2022 Data Compression Conference (DCC), pages 182–191. IEEE, 2022. 1, 2, 3, 6, 7

  21. [21]

    Neural distributed image compression with cross-attention feature alignment

    Nitish Mital, Ezgi ¨Ozyilkan, Ali Garjani, and Deniz G¨und¨uz. Neural distributed image compression with cross-attention feature alignment. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 2498–2507, 2023. 1, 2, 3, 7

  22. [22]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal Processing Letters, 20(3):209–212, 2012. 6

  23. [23]

    Compressed image generation with denoising diffusion codebook models

    Guy Ohayon, Hila Manor, Tomer Michaeli, and Michael Elad. Compressed image generation with denoising diffusion codebook models. InForty-second International Conference on Machine Learning, 2025. 2

  24. [24]

    Bssic: Stereo image compression based on block shift

    Ya Qiao, Yongqi Zhai, and Ronggang Wang. Bssic: Stereo image compression based on block shift. In2024 Interna- tional Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2024. 3

  25. [25]

    Perceptual image compression with textual side information.Pattern Recognition, 169:111848, 2026

    Shi-Yu Qin, Bin Chen, Yu-Jun Huang, Bao-Yi An, Tao Dai, and Shu-Tao Xia. Perceptual image compression with textual side information.Pattern Recognition, 169:111848, 2026. 2 9

  26. [26]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 4

  27. [27]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 4

  28. [28]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 234–241. Springer, 2015. 6

  29. [29]

    Noiseless coding of correlated information sources.IEEE Transactions on Information The- ory, 19(4):471–480, 1973

    David Slepian and Jack Wolf. Noiseless coding of correlated information sources.IEEE Transactions on Information The- ory, 19(4):471–480, 1973. 1, 3

  30. [30]

    Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017. 2, 3, 4

  31. [31]

    Mul- tiscale structural similarity for image quality assessment

    Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul- tiscale structural similarity for image quality assessment. In The Thrity-seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. IEEE, 2003. 3, 7

  32. [32]

    Jay Whang, Alliot Nagle, Anish Acharya, Hyeji Kim, and Alexandros G. Dimakis. Neural distributed source coding. IEEE Journal on Selected Areas in Information Theory, 5: 493–508, 2024. 1, 3, 4

  33. [33]

    Sasic: Stereo image compression with latent shifts and stereo attention

    Matthias W ¨odlinger, Jan Kotera, Jan Xu, and Robert Sab- latnig. Sasic: Stereo image compression with latent shifts and stereo attention. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 661–670, 2022. 3, 7

  34. [34]

    Ecsic: Epipolar cross attention for stereo image compression

    Matthias W ¨odlinger, Jan Kotera, Manuel Keglevic, Jan Xu, and Robert Sablatnig. Ecsic: Epipolar cross attention for stereo image compression. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3436–3445, 2024. 3, 7

  35. [35]

    The rate-distortion function for source coding with side information at the decoder.IEEE Transactions on Information Theory, 22(1):1–10, 2003

    Aaron Wyner and Jacob Ziv. The rate-distortion function for source coding with side information at the decoder.IEEE Transactions on Information Theory, 22(1):1–10, 2003. 1, 3

  36. [36]

    Lossy image compression with conditional diffusion models.Advances in Neural In- formation Processing Systems, 36:64971–64995, 2023

    Ruihan Yang and Stephan Mandt. Lossy image compression with conditional diffusion models.Advances in Neural In- formation Processing Systems, 36:64971–64995, 2023. 2

  37. [37]

    Distributed deep joint source-channel coding with decoder-only side information

    Selim F Yilmaz, Ezgi Ozyilkan, Deniz G ¨und¨uz, and Elza Erkip. Distributed deep joint source-channel coding with decoder-only side information. In2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), pages 139–144. IEEE, 2024. 1

  38. [38]

    Learned distributed image compression with decoder side information.Digital Communications and Networks, 11 (2):349–358, 2025

    Yankai Yin, Zhe Sun, Peiying Ruan, Ruidong Li, and Feng Duan. Learned distributed image compression with decoder side information.Digital Communications and Networks, 11 (2):349–358, 2025. 1, 2

  39. [39]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 586–595, 2018. 6

  40. [40]

    Ldmic: Learning-based distributed multi-view image coding

    Xinjie Zhang, Jiawei Shao, and Jun Zhang. Ldmic: Learning-based distributed multi-view image coding. InIn- ternational Conference on Learning Representations, 2023. 1, 2, 3, 7

  41. [41]

    Camsic: Content-aware masked image modeling transformer for stereo image compression

    Xinjie Zhang, Shenyuan Gao, Zhening Liu, Jiawei Shao, Xingtong Ge, Dailan He, Tongda Xu, Yan Wang, and Jun Zhang. Camsic: Content-aware masked image modeling transformer for stereo image compression. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10239– 10247, 2025. 2, 3, 7

  42. [42]

    Leveraging diffusion knowledge for generative image compression with fractal frequency-aware band learning.arXiv preprint arXiv:2503.11321, 2025

    Lingyu Zhu, Xiangrui Zeng, Bolin Chen, Peilin Chen, Yung- Hui Li, and Shiqi Wang. Leveraging diffusion knowledge for generative image compression with fractal frequency-aware band learning.arXiv preprint arXiv:2503.11321, 2025. 2 10