pith. sign in

arxiv: 2412.18158 · v2 · pith:44ETUYDNnew · submitted 2024-12-24 · 💻 cs.CV · eess.IV

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

Pith reviewed 2026-05-23 06:38 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords universal image compressionsemantic disentanglementLLM codebooksgenerative diffusiontask-aware codinghuman-centric compressionmachine vision compression
0
0 comments X

The pith

A universal image codec disentangles task-specific semantics with LLM-generated codebooks and reconstructs via diffusion to serve both human perception and machine tasks without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current learned image compression methods specialize either for human viewing or for narrow machine tasks, forcing separate models and expensive retraining whenever a new application appears. The paper introduces UniCodec to solve this by performing semantic disentanglement at the encoder: a grounding model uses pre-generated task-specific label codebooks from an LLM to compress only the image regions relevant to the current task. At the decoder, compositional generation combines the compact components with priors from a generative diffusion model to produce a single reconstruction that supplies both rich visual detail and task-precise features. Switching tasks requires only loading a different codebook, removing the need for retraining. A sympathetic reader would care because the approach promises one efficient system that can handle the expanding range of human and machine uses for the same image data.

Core claim

UniCodec establishes that semantic disentanglement at the encoder, driven by pre-generated task-specific label codebooks from an LLM and applied via a grounding model, combined with compositional generation at the decoder using generative diffusion priors, produces a universal codec that delivers high-quality reconstructions optimized for both human perception and machine vision tasks across arbitrary applications without any task-specific retraining.

What carries the argument

LLM-generated task-specific label codebooks used by a grounding model for task-aware disentanglement at the encoder, paired with generative diffusion for compositional reconstruction at the decoder.

If this is right

  • Compressing only task-relevant regions saves significant bits compared to encoding entire images.
  • Switching tasks is achieved simply by selecting a new codebook, enabling zero-retraining adaptation.
  • The same compressed representation supports both high perceptual quality for humans and precise features for machines.
  • Extensive experiments show consistent outperformance over existing specialized compression methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disentanglement mechanism could support video sequences if the codebooks are extended to handle temporal consistency across frames.
  • One compressed bitstream might serve multiple downstream tasks simultaneously if several codebooks are applied in parallel.
  • Replacing the diffusion decoder with other generative models could further improve speed or quality for specific domains.

Load-bearing premise

Pre-generated task-specific label codebooks from an LLM combined with a grounding model can perform reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information.

What would settle it

An experiment introducing a completely novel machine vision task where the grounding model using an existing codebook fails to extract sufficient task-critical regions or the diffusion reconstruction falls below the accuracy of a task-specific codec on that task.

Figures

Figures reproduced from arXiv: 2412.18158 by Heming Sun, Jinming Liu, Junyan Lin, Shengyang Zhao, Wenjun Zeng, Xin Jin, Yuntao Wei, Zhibo Chen.

Figure 1
Figure 1. Figure 1: The codec paradigm of (a) human perception (b) ma [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of DISCOVER: (1) First, we use MLLM and grounding model to perform [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The labels and localization generation process of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the intermediate process in semantics [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Machine vision tasks performance comparison. We use [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Human perception comparison. “Ours(Full)” means that all bitstreams are transmitted for reconstruction, while [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The visualization results. DISCOVER preserves information in task-related regions for machine vision tasks while effectively [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation studies on COCO dataset. “w/o composition” [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Learned image compression methods have shown impressive performance but are often highly specialized for either human perception or specific machine vision tasks. This specialization limits their versatility and requires costly retraining for new applications. To address this, we introduce UniCodec, a universal codec built on a novel paradigm of semantic disentanglement at the encoder and compositional generation at the decoder. This framework is designed to simultaneously serve both human and machine needs, eliminating the need for task-specific retraining. At the encoder, UniCodec leverages pre-generated, task-specific label codebooks created by a Large Language Model (LLM). For any given task, a grounding model uses the corresponding codebook to perform task-aware disentanglement, compressing only the most relevant image regions. This mechanism not only saves significant bits but is also the key to our system's rapid, zero-retraining adaptation: switching to a new task is as simple as selecting a new codebook. The decoder then performs compositional generation: it combines the compact, disentangled components with powerful priors from a generative diffusion model. This process reconstructs a high-quality, complete image optimized with rich detail for human perception and precise features for machine vision tasks. Extensive experiments demonstrate that UniCodec consistently outperforms existing methods, effectively bridging the gap between human-centric and machine-centric compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces UniCodec, a universal image codec using LLM-generated task-specific label codebooks for semantic disentanglement via a grounding model at the encoder, with compositional reconstruction via generative diffusion at the decoder. The framework claims to serve both human perception and machine vision tasks simultaneously, enable zero-retraining adaptation to new tasks via codebook switching, and outperform existing methods based on extensive experiments.

Significance. If the empirical claims hold, the work could meaningfully advance universal learned compression by reducing specialization and retraining costs through modular codebook-based adaptation and diffusion-based composition. The design choice to leverage external pre-trained LLM and diffusion components for task-aware encoding without parameter updates is a coherent contribution to bridging human- and machine-centric codecs.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that UniCodec consistently outperforms existing methods' is presented without any metrics, baselines, ablation studies, or implementation details, leaving the primary empirical assertion without visible support in the manuscript.
  2. [Abstract] Abstract: The weakest assumption—that pre-generated LLM task-specific label codebooks combined with a grounding model enable reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information, and that the diffusion decoder can simultaneously optimize perceptual quality and machine-task features—is load-bearing for the zero-retraining and universal claims but receives no validation, edge-case analysis, or failure-mode discussion.
minor comments (1)
  1. [Title] Title: 'Efficiently LLM Reasoning' appears to contain a grammatical or phrasing error and should be revised for clarity (e.g., 'Efficient LLM Reasoning').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that UniCodec consistently outperforms existing methods' is presented without any metrics, baselines, ablation studies, or implementation details, leaving the primary empirical assertion without visible support in the manuscript.

    Authors: We agree that the abstract should not assert empirical superiority without visible support. The full manuscript (Sections 4 and 5) contains the requested details: quantitative metrics on perceptual quality (PSNR, LPIPS, FID) and machine-task accuracy, comparisons against learned codecs (e.g., Cheng2020, ELIC) and task-specific methods, ablation studies on codebook usage and diffusion components, and implementation details. To resolve the abstract-level concern we will revise the abstract to either include a concise set of key results or qualify the claim as 'as shown in our experiments'. revision: yes

  2. Referee: [Abstract] Abstract: The weakest assumption—that pre-generated LLM task-specific label codebooks combined with a grounding model enable reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information, and that the diffusion decoder can simultaneously optimize perceptual quality and machine-task features—is load-bearing for the zero-retraining and universal claims but receives no validation, edge-case analysis, or failure-mode discussion.

    Authors: The manuscript validates the core mechanism through experiments on multiple distinct tasks (object detection, segmentation, classification) demonstrating zero-retraining adaptation via codebook switching and joint optimization of perceptual and task metrics. However, we acknowledge that explicit discussion of edge cases (e.g., ambiguous scenes, out-of-distribution tasks) and failure modes is limited. We will add a dedicated limitations subsection with failure-case analysis and a brief discussion of the scope of 'arbitrary' tasks supported by the current codebook generation process. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is empirical and relies on external pre-trained components

full rationale

The paper presents UniCodec as an engineering framework that combines pre-existing LLM-generated codebooks, a grounding model, and a generative diffusion decoder. No equations, parameter-fitting procedures, or first-principles derivations are described in the provided text. The central claim is empirical outperformance on human and machine tasks via codebook swapping, with no load-bearing step that reduces by construction to a fitted input or self-citation chain. The approach is self-contained as an experimental proposal whose validity rests on external benchmarks rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5785 in / 1151 out tokens · 54252 ms · 2026-05-23T06:38:15.986178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 11 internal anchors

  1. [1]

    The jpeg still picture compression standard,

    G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electron- ics, vol. 38, no. 1, pp. xviii–xxxiv, 1992. 1, 2

  2. [2]

    Overview of the high efficiency video coding (hevc) standard,

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” TCSVT, vol. 22, no. 12, pp. 1649–1668,

  3. [3]

    Overview of the versatile video coding (vvc) standard and its applications,

    B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sul- livan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” TCSVT,

  4. [4]

    Learned image com- pression with mixed transformer-cnn architectures,

    J. Liu, H. Sun, and J. Katto, “Learned image com- pression with mixed transformer-cnn architectures,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , June 2023, pp. 14 388–14 397. 1, 2, 3, 5

  5. [5]

    Frequency-aware transformer for learned image compression,

    H. Li, S. Li, W. Dai, C. Li, J. Zou, and H. Xiong, “Frequency-aware transformer for learned image compression,” International Conference on Learning Representations, 2024. 1, 3, 7, 8

  6. [6]

    Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,

    Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,” in CVPR, 2020, pp. 7939–7948. 1, 2

  7. [7]

    ‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,

    C. Li, G. Lu, D. Feng, H. Wu, Z. Zhang, X. Liu, G. Zhai, W. Lin, and W. Zhang, “‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,”arXiv preprint arXiv:2402.16749,

  8. [8]

    Multi-realism image compression with a conditional generator,

    E. Agustsson, D. Minnen, G. Toderici, and F. Mentzer, “Multi-realism image compression with a conditional generator,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , 2023, pp. 22 324–22 333. 3

  9. [9]

    Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,

    M. J. Muckley, A. El-Nouby, K. Ullrich, H. J ´egou, and J. Verbeek, “Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,” in International Conference on Machine Learning. PMLR, 2023, pp. 25 426–25 443. 7, 8

  10. [10]

    Towards image compression with perfect re- alism at ultra-low bitrates,

    M. Careil, M. J. Muckley, J. Verbeek, and S. Lath- uili`ere, “Towards image compression with perfect re- alism at ultra-low bitrates,” in The Twelfth Interna- tional Conference on Learning Representations, 2024. 1, 3, 7, 8

  11. [11]

    End-to-end optimized image compression,

    J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inICLR, 2017. 1, 2, 3

  12. [12]

    Variational image compression with a scale hyperprior,

    J. Ball ´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” inICLR, 2018. 1, 3, 5

  13. [13]

    Faster r-cnn: Towards real-time object detection with region pro- posal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” NeurIPS, vol. 28, pp. 91–99, 2015. 1, 6

  14. [14]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023

  15. [15]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763. 1

  16. [16]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” arXiv preprint arXiv:2307.16125, 2023. 1

  17. [17]

    One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,

    B. Li, J. Dong, Y . Wang, J. Liu, L. Yin, W. Zhao, Z. Zhu, X. Jin, and W. Zeng, “One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,” arXiv preprint arXiv:2306.12681 ,

  18. [18]

    Image coding for ma- chines with omnipotent feature learning,

    R. Feng, X. Jin, Z. Guo, R. Feng, Y . Gao, T. He, Z. Zhang, S. Sun, and Z. Chen, “Image coding for ma- chines with omnipotent feature learning,” in ECCV. Springer, 2022, pp. 510–528. 2, 3

  19. [19]

    Rate-distortion-cognition controllable versa- tile neural image compression,

    J. Liu, R. Feng, Y . Qi, Q. Chen, Z. Chen, W. Zeng, and X. Jin, “Rate-distortion-cognition controllable versa- tile neural image compression,” in European Confer- ence on Computer Vision. Springer, 2025, pp. 329–

  20. [20]

    Bridging compressed image latents and multimodal large language models.arXiv preprint arXiv:2407.19651,

    C.-H. Kao, C. Chien, Y .-J. Tseng, Y .-H. Chen, A. Gnutti, S.-Y . Lo, W.-H. Peng, and R. Leonardi, “Comneck: Bridging compressed image latents and multimodal llms via universal transform-neck,” arXiv preprint arXiv:2407.19651, 2024

  21. [21]

    Image compression for machine and human vision with spatial-frequency adaptation,

    H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Image compression for machine and human vision with spatial-frequency adaptation,” Eu- ropean Conference on Computer Vision, 2024. 2

  22. [22]

    Transtic: Transferring transformer-based image compression from human perception to machine perception,

    Y .-H. Chen, Y .-C. Weng, C.-H. Kao, C. Chien, W.- C. Chiu, and W.-H. Peng, “Transtic: Transferring transformer-based image compression from human perception to machine perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 297–23 307. 2, 3, 7

  23. [23]

    End-to-end optimized image com- pression for machines, a study,

    L. D. Chamain, F. Racap ´e, J. B ´egaint, A. Pushparaja, and S. Feltman, “End-to-end optimized image com- pression for machines, a study,” in 2021 Data Com- pression Conference (DCC). IEEE, 2021, pp. 163–

  24. [24]

    Gpt-4o version,

    ChatGPT, “Gpt-4o version,” https://chat.openai.com/ chat, 2024, accessed: June 14, 2024. 2, 3, 6

  25. [25]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” European Conference on Computer Vision, 2024. 2, 3, 4, 5, 6

  26. [26]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022. 2, 3, 5, 6

  27. [27]

    Open- sora: Democratizing efficient video production for all,

    Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open- sora: Democratizing efficient video production for all,” March 2024. [Online]. Available: https: //github.com/hpcaitech/Open-Sora 2

  28. [28]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024

  29. [29]

    Photorealis- tic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealis- tic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022. 2

  30. [30]

    Calculation of average psnr differ- ences between rd-curves,

    G. Bjøntegaard, “Calculation of average psnr differ- ences between rd-curves,” 2001. [Online]. Available: https://api.semanticscholar.org/CorpusID:61598325 2, 7

  31. [31]

    Condi- tional perceptual quality preserving image compres- sion,

    T. Xu, Q. Zhang, Y . Li, D. He, Z. Wang, Y . Wang, H. Qin, Y . Wang, J. Liu, and Y .-Q. Zhang, “Condi- tional perceptual quality preserving image compres- sion,”arXiv preprint arXiv:2308.08154, 2023. 2, 8

  32. [32]

    An overview of the jpeg 2000 still image compression standard,

    M. Rabbani and R. Joshi, “An overview of the jpeg 2000 still image compression standard,” Signal pro- cessing: Image communication , vol. 17, no. 1, pp. 3– 48, 2002. 2

  33. [33]

    Overview of the h. 264/avc video coding standard,

    T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” TCSVT, vol. 13, no. 7, pp. 560–576, 2003. 2

  34. [34]

    Joint autore- gressive and hierarchical priors for learned image compression,

    D. Minnen, J. Ball ´e, and G. Toderici, “Joint autore- gressive and hierarchical priors for learned image compression,” inNeurIPS, 2018. 2

  35. [35]

    Conditional probability models for deep image compression,

    F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in CVPR, 2018, pp. 4394– 4402

  36. [36]

    High-fidelity generative image compression,

    F. Mentzer, G. Toderici, M. Tschannen, and E. Agusts- son, “High-fidelity generative image compression,” Advances in neural information processing systems ,

  37. [37]

    Deep sets

    R. Zhang, R. Fang, P. Gao, W. Zhang, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free clip- adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021. 2

  38. [38]

    Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,

    T. He, S. Sun, Z. Guo, and Z. Chen, “Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,” in 2019 Picture Coding Symposium (PCS). IEEE, 2019, pp. 1–5. 2

  39. [39]

    Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,

    L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,” TIP, vol. 29, pp. 8680–8695, 2020

  40. [40]

    Semantic structured im- age coding framework for multiple intelligent applica- tions,

    S. Sun, T. He, and Z. Chen, “Semantic structured im- age coding framework for multiple intelligent applica- tions,”TCSVT, 2020

  41. [41]

    Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,

    X. Jin, R. Feng, S. Sun, R. Feng, T. He, and Z. Chen, “Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,” Journal of Visual Communication and Image Repre- sentation, vol. 93, p. 103816, 2023

  42. [42]

    Semantic segmentation in learned compressed domain,

    J. Liu, H. Sun, and J. Katto, “Semantic segmentation in learned compressed domain,” in 2022 Picture Cod- ing Symposium (PCS). IEEE, 2022, pp. 181–185

  43. [43]

    Com- posable image coding for machine via task-oriented internal adaptor and external prior,

    J. Liu, X. Jin, R. Feng, Z. Chen, and W. Zeng, “Com- posable image coding for machine via task-oriented internal adaptor and external prior,” in 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 2023, pp. 1–5. 2

  44. [44]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Prelimi- nary explorations with gpt-4v(ision),” arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023. 3

  45. [45]

    Visual instruc- tion tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruc- tion tuning,” Advances in neural information process- ing systems, vol. 36, 2024. 3

  46. [46]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023. 3

  47. [47]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elho- seiny, “Minigpt-4: Enhancing vision-language under- standing with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023. 3

  48. [48]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al. , “Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

  49. [49]

    Gemini: A Family of Highly Capable Multimodal Models

    R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al. , “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, vol. 1, 2023. 3

  50. [50]

    Deco: Decoupling token compression from semantic abstraction in multimodal large language models,

    L. Yao, L. Li, S. Ren, L. Wang, Y . Liu, X. Sun, and L. Hou, “Deco: Decoupling token compression from semantic abstraction in multimodal large language models,”arXiv preprint arXiv:2405.20985, 2024. 3

  51. [51]

    MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers et al., “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv preprint arXiv:2403.09611, 2024. 3

  52. [52]

    A fast and accurate one-stage approach to vi- sual grounding,

    Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo, “A fast and accurate one-stage approach to vi- sual grounding,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision , 2019, pp. 4683–4693. 3

  53. [53]

    A real-time cross-modality correlation filtering method for referring expression comprehen- sion,

    Y . Liao, S. Liu, G. Li, F. Wang, Y . Chen, C. Qian, and B. Li, “A real-time cross-modality correlation filtering method for referring expression comprehen- sion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp. 10 880–10 889

  54. [54]

    Improv- ing one-stage visual grounding by recursive sub-query construction,

    Z. Yang, T. Chen, L. Wang, and J. Luo, “Improv- ing one-stage visual grounding by recursive sub-query construction,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 387–404. 3

  55. [55]

    Blended diffusion for text-driven editing of natural images,

    O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2022, pp. 18 208–18 218. 3

  56. [56]

    Diffedit: Diffusion-based semantic image editing with mask guidance,

    G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022

  57. [57]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    C. Meng, Y . He, Y . Song, J. Song, J. Wu, J.-Y . Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021

  58. [58]

    Zero-shot image-to-image transla- tion,

    G. Parmar, K. Kumar Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero-shot image-to-image transla- tion,” inACM SIGGRAPH 2023 Conference Proceed- ings, 2023, pp. 1–11. 3

  59. [59]

    Dsslic: Deep se- mantic segmentation-based layered image compres- sion,

    M. Akbari, J. Liang, and J. Han, “Dsslic: Deep se- mantic segmentation-based layered image compres- sion,” inICASSP 2019-2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2042–2046. 3

  60. [60]

    Learning in compressed domain for faster machine vision tasks,

    J. Liu, H. Sun, and J. Katto, “Learning in compressed domain for faster machine vision tasks,” in2021 Inter- national Conference on Visual Communications and Image Processing (VCIP). IEEE, 2021, pp. 01–05. 3

  61. [61]

    Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,

    N. K ¨orber, E. Kromer, A. Siebert, S. Hauke, D. Mueller-Gritschneder, and B. Schuller, “Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,” in European Con- ference on Computer Vision . Springer, 2025, pp. 202–220. 3

  62. [62]

    Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,

    J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,” in Proceedings of naacL-HLT, vol. 1. Minneapolis, Minnesota, 2019, p. 2. 4

  63. [63]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022. 4

  64. [64]

    Adding con- ditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding con- ditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 3836–3847. 5

  65. [65]

    Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,

    D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2022, pp. 5718–5727. 6, 7

  66. [66]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

    A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Mal- loci, A. Kolesnikov, T. Duerig, and V . Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, 2020. 6

  67. [67]

    Mi- crosoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Mi- crosoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755. 6

  68. [68]

    Workshop and challenge on learned image compression,

    CLIC, “Workshop and challenge on learned image compression,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition,

  69. [69]

    En- hanced deep residual networks for single image super- resolution,

    B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “En- hanced deep residual networks for single image super- resolution,” in The IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) Workshops, July

  70. [70]

    Ntire 2017 challenge on single image super-resolution: Dataset and study,

    E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, July 2017. 6

  71. [71]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inICCV, 2017, pp. 2961–2969. 6

  72. [72]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778. 6

  73. [73]

    A large-scale hierarchical image database,

    J. Deng, “A large-scale hierarchical image database,” Proceedings of IEEE/CVF conference on Computer Vision and Pattern Recognition, 2009. 6

  74. [74]

    Kodak lossless true color image suite (pho- tocd pcd0992),

    E. Kodak, “Kodak lossless true color image suite (pho- tocd pcd0992),” 1993. 6

  75. [75]

    Lossy image compression with conditional diffusion models,

    R. Yang and S. Mandt, “Lossy image compression with conditional diffusion models,” Advances in Neu- ral Information Processing Systems, vol. 36, 2024. 6

  76. [76]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems , vol. 30, 2017. 6

  77. [77]

    Demystifying MMD GANs

    M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,”arXiv preprint arXiv:1801.01401, 2018. 6

  78. [78]

    Im- age quality assessment: Unifying structure and texture similarity,

    K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Im- age quality assessment: Unifying structure and texture similarity,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567–2581,

  79. [79]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595. 6

  80. [80]

    Vvc official test model vtm

    J. V . E. Team, “Vvc official test model vtm.” 2021. 7