Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

Heming Sun; Jinming Liu; Junyan Lin; Shengyang Zhao; Wenjun Zeng; Xin Jin; Yuntao Wei; Zhibo Chen

arxiv: 2412.18158 · v2 · pith:44ETUYDNnew · submitted 2024-12-24 · 💻 cs.CV · eess.IV

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

Jinming Liu , Yuntao Wei , Junyan Lin , Shengyang Zhao , Heming Sun , Zhibo Chen , Wenjun Zeng , Xin Jin This is my paper

Pith reviewed 2026-05-23 06:38 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords universal image compressionsemantic disentanglementLLM codebooksgenerative diffusiontask-aware codinghuman-centric compressionmachine vision compression

0 comments

The pith

A universal image codec disentangles task-specific semantics with LLM-generated codebooks and reconstructs via diffusion to serve both human perception and machine tasks without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current learned image compression methods specialize either for human viewing or for narrow machine tasks, forcing separate models and expensive retraining whenever a new application appears. The paper introduces UniCodec to solve this by performing semantic disentanglement at the encoder: a grounding model uses pre-generated task-specific label codebooks from an LLM to compress only the image regions relevant to the current task. At the decoder, compositional generation combines the compact components with priors from a generative diffusion model to produce a single reconstruction that supplies both rich visual detail and task-precise features. Switching tasks requires only loading a different codebook, removing the need for retraining. A sympathetic reader would care because the approach promises one efficient system that can handle the expanding range of human and machine uses for the same image data.

Core claim

UniCodec establishes that semantic disentanglement at the encoder, driven by pre-generated task-specific label codebooks from an LLM and applied via a grounding model, combined with compositional generation at the decoder using generative diffusion priors, produces a universal codec that delivers high-quality reconstructions optimized for both human perception and machine vision tasks across arbitrary applications without any task-specific retraining.

What carries the argument

LLM-generated task-specific label codebooks used by a grounding model for task-aware disentanglement at the encoder, paired with generative diffusion for compositional reconstruction at the decoder.

If this is right

Compressing only task-relevant regions saves significant bits compared to encoding entire images.
Switching tasks is achieved simply by selecting a new codebook, enabling zero-retraining adaptation.
The same compressed representation supports both high perceptual quality for humans and precise features for machines.
Extensive experiments show consistent outperformance over existing specialized compression methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same disentanglement mechanism could support video sequences if the codebooks are extended to handle temporal consistency across frames.
One compressed bitstream might serve multiple downstream tasks simultaneously if several codebooks are applied in parallel.
Replacing the diffusion decoder with other generative models could further improve speed or quality for specific domains.

Load-bearing premise

Pre-generated task-specific label codebooks from an LLM combined with a grounding model can perform reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information.

What would settle it

An experiment introducing a completely novel machine vision task where the grounding model using an existing codebook fails to extract sufficient task-critical regions or the diffusion reconstruction falls below the accuracy of a task-specific codec on that task.

Figures

Figures reproduced from arXiv: 2412.18158 by Heming Sun, Jinming Liu, Junyan Lin, Shengyang Zhao, Wenjun Zeng, Xin Jin, Yuntao Wei, Zhibo Chen.

**Figure 2.** Figure 2: The framework of DISCOVER: (1) First, we use MLLM and grounding model to perform [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The labels and localization generation process of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the intermediate process in semantics [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Machine vision tasks performance comparison. We use [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Human perception comparison. “Ours(Full)” means that all bitstreams are transmitted for reconstruction, while [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The visualization results. DISCOVER preserves information in task-related regions for machine vision tasks while effectively [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation studies on COCO dataset. “w/o composition” [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Learned image compression methods have shown impressive performance but are often highly specialized for either human perception or specific machine vision tasks. This specialization limits their versatility and requires costly retraining for new applications. To address this, we introduce UniCodec, a universal codec built on a novel paradigm of semantic disentanglement at the encoder and compositional generation at the decoder. This framework is designed to simultaneously serve both human and machine needs, eliminating the need for task-specific retraining. At the encoder, UniCodec leverages pre-generated, task-specific label codebooks created by a Large Language Model (LLM). For any given task, a grounding model uses the corresponding codebook to perform task-aware disentanglement, compressing only the most relevant image regions. This mechanism not only saves significant bits but is also the key to our system's rapid, zero-retraining adaptation: switching to a new task is as simple as selecting a new codebook. The decoder then performs compositional generation: it combines the compact, disentangled components with powerful priors from a generative diffusion model. This process reconstructs a high-quality, complete image optimized with rich detail for human perception and precise features for machine vision tasks. Extensive experiments demonstrate that UniCodec consistently outperforms existing methods, effectively bridging the gap between human-centric and machine-centric compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The LLM codebook disentanglement plus diffusion composition idea is a coherent way to aim for zero-retraining universal coding, but the abstract supplies zero metrics or ablations to show it actually works.

read the letter

The paper's main contribution is a framework called UniCodec that uses pre-generated task-specific codebooks from an LLM at the encoder for semantic disentanglement via a grounding model, then relies on compositional generation with a diffusion model at the decoder to produce images that try to serve both human perception and machine tasks. The zero-retraining adaptation by swapping codebooks is the practical hook, and it avoids the usual need for separate models or retraining when the task changes. That design choice is internally consistent and does not rely on self-referential fitting or invented parameters in the description given. The abstract frames this as distinct from standard learned codecs that specialize in one objective, which is a fair characterization of the intended novelty. What the work does well is spell out a clear pipeline for task-aware compression that saves bits by focusing only on relevant regions, then uses strong generative priors to reconstruct. If the grounding and diffusion stages function as stated, this could reduce the number of task-specific codecs in pipelines like surveillance. The soft spot is straightforward: the abstract asserts that extensive experiments show consistent outperformance and that it bridges human-centric and machine-centric compression, yet it includes no numbers, no baselines, no ablation results, and no implementation details. Without those, it is impossible to assess whether the disentanglement preserves necessary information for arbitrary tasks or whether the diffusion output actually optimizes both perceptual quality and machine features at once. The central assumption—that LLM codebooks plus grounding can reliably handle new tasks without loss—is strong and needs direct evidence. This is the kind of paper that belongs in a reading group for people working on multi-objective learned compression, since the framework raises useful questions even if the results are not yet visible. It deserves peer review so referees can examine the full experiments and check whether the empirical support matches the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces UniCodec, a universal image codec using LLM-generated task-specific label codebooks for semantic disentanglement via a grounding model at the encoder, with compositional reconstruction via generative diffusion at the decoder. The framework claims to serve both human perception and machine vision tasks simultaneously, enable zero-retraining adaptation to new tasks via codebook switching, and outperform existing methods based on extensive experiments.

Significance. If the empirical claims hold, the work could meaningfully advance universal learned compression by reducing specialization and retraining costs through modular codebook-based adaptation and diffusion-based composition. The design choice to leverage external pre-trained LLM and diffusion components for task-aware encoding without parameter updates is a coherent contribution to bridging human- and machine-centric codecs.

major comments (2)

[Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that UniCodec consistently outperforms existing methods' is presented without any metrics, baselines, ablation studies, or implementation details, leaving the primary empirical assertion without visible support in the manuscript.
[Abstract] Abstract: The weakest assumption—that pre-generated LLM task-specific label codebooks combined with a grounding model enable reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information, and that the diffusion decoder can simultaneously optimize perceptual quality and machine-task features—is load-bearing for the zero-retraining and universal claims but receives no validation, edge-case analysis, or failure-mode discussion.

minor comments (1)

[Title] Title: 'Efficiently LLM Reasoning' appears to contain a grammatical or phrasing error and should be revised for clarity (e.g., 'Efficient LLM Reasoning').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point-by-point below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Extensive experiments demonstrate that UniCodec consistently outperforms existing methods' is presented without any metrics, baselines, ablation studies, or implementation details, leaving the primary empirical assertion without visible support in the manuscript.

Authors: We agree that the abstract should not assert empirical superiority without visible support. The full manuscript (Sections 4 and 5) contains the requested details: quantitative metrics on perceptual quality (PSNR, LPIPS, FID) and machine-task accuracy, comparisons against learned codecs (e.g., Cheng2020, ELIC) and task-specific methods, ablation studies on codebook usage and diffusion components, and implementation details. To resolve the abstract-level concern we will revise the abstract to either include a concise set of key results or qualify the claim as 'as shown in our experiments'. revision: yes
Referee: [Abstract] Abstract: The weakest assumption—that pre-generated LLM task-specific label codebooks combined with a grounding model enable reliable task-aware disentanglement for arbitrary new tasks while preserving all necessary information, and that the diffusion decoder can simultaneously optimize perceptual quality and machine-task features—is load-bearing for the zero-retraining and universal claims but receives no validation, edge-case analysis, or failure-mode discussion.

Authors: The manuscript validates the core mechanism through experiments on multiple distinct tasks (object detection, segmentation, classification) demonstrating zero-retraining adaptation via codebook switching and joint optimization of perceptual and task metrics. However, we acknowledge that explicit discussion of edge cases (e.g., ambiguous scenes, out-of-distribution tasks) and failure modes is limited. We will add a dedicated limitations subsection with failure-case analysis and a brief discussion of the scope of 'arbitrary' tasks supported by the current codebook generation process. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is empirical and relies on external pre-trained components

full rationale

The paper presents UniCodec as an engineering framework that combines pre-existing LLM-generated codebooks, a grounding model, and a generative diffusion decoder. No equations, parameter-fitting procedures, or first-principles derivations are described in the provided text. The central claim is empirical outperformance on human and machine tasks via codebook swapping, with no load-bearing step that reduces by construction to a fitted input or self-citation chain. The approach is self-contained as an experimental proposal whose validity rests on external benchmarks rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5785 in / 1151 out tokens · 54252 ms · 2026-05-23T06:38:15.986178+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 11 internal anchors

[1]

The jpeg still picture compression standard,

G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electron- ics, vol. 38, no. 1, pp. xviii–xxxiv, 1992. 1, 2

work page 1992
[2]

Overview of the high efficiency video coding (hevc) standard,

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” TCSVT, vol. 22, no. 12, pp. 1649–1668,

work page
[3]

Overview of the versatile video coding (vvc) standard and its applications,

B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sul- livan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” TCSVT,

work page
[4]

Learned image com- pression with mixed transformer-cnn architectures,

J. Liu, H. Sun, and J. Katto, “Learned image com- pression with mixed transformer-cnn architectures,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , June 2023, pp. 14 388–14 397. 1, 2, 3, 5

work page 2023
[5]

Frequency-aware transformer for learned image compression,

H. Li, S. Li, W. Dai, C. Li, J. Zou, and H. Xiong, “Frequency-aware transformer for learned image compression,” International Conference on Learning Representations, 2024. 1, 3, 7, 8

work page 2024
[6]

Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,

Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,” in CVPR, 2020, pp. 7939–7948. 1, 2

work page 2020
[7]

‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,

C. Li, G. Lu, D. Feng, H. Wu, Z. Zhang, X. Liu, G. Zhai, W. Lin, and W. Zhang, “‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,”arXiv preprint arXiv:2402.16749,

work page arXiv
[8]

Multi-realism image compression with a conditional generator,

E. Agustsson, D. Minnen, G. Toderici, and F. Mentzer, “Multi-realism image compression with a conditional generator,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , 2023, pp. 22 324–22 333. 3

work page 2023
[9]

Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,

M. J. Muckley, A. El-Nouby, K. Ullrich, H. J ´egou, and J. Verbeek, “Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,” in International Conference on Machine Learning. PMLR, 2023, pp. 25 426–25 443. 7, 8

work page 2023
[10]

Towards image compression with perfect re- alism at ultra-low bitrates,

M. Careil, M. J. Muckley, J. Verbeek, and S. Lath- uili`ere, “Towards image compression with perfect re- alism at ultra-low bitrates,” in The Twelfth Interna- tional Conference on Learning Representations, 2024. 1, 3, 7, 8

work page 2024
[11]

End-to-end optimized image compression,

J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inICLR, 2017. 1, 2, 3

work page 2017
[12]

Variational image compression with a scale hyperprior,

J. Ball ´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” inICLR, 2018. 1, 3, 5

work page 2018
[13]

Faster r-cnn: Towards real-time object detection with region pro- posal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” NeurIPS, vol. 28, pp. 91–99, 2015. 1, 6

work page 2015
[14]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763. 1

work page 2021
[16]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,

B. Li, J. Dong, Y . Wang, J. Liu, L. Yin, W. Zhao, Z. Zhu, X. Jin, and W. Zeng, “One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,” arXiv preprint arXiv:2306.12681 ,

work page arXiv
[18]

Image coding for ma- chines with omnipotent feature learning,

R. Feng, X. Jin, Z. Guo, R. Feng, Y . Gao, T. He, Z. Zhang, S. Sun, and Z. Chen, “Image coding for ma- chines with omnipotent feature learning,” in ECCV. Springer, 2022, pp. 510–528. 2, 3

work page 2022
[19]

Rate-distortion-cognition controllable versa- tile neural image compression,

J. Liu, R. Feng, Y . Qi, Q. Chen, Z. Chen, W. Zeng, and X. Jin, “Rate-distortion-cognition controllable versa- tile neural image compression,” in European Confer- ence on Computer Vision. Springer, 2025, pp. 329–

work page 2025
[20]

Bridging compressed image latents and multimodal large language models.arXiv preprint arXiv:2407.19651,

C.-H. Kao, C. Chien, Y .-J. Tseng, Y .-H. Chen, A. Gnutti, S.-Y . Lo, W.-H. Peng, and R. Leonardi, “Comneck: Bridging compressed image latents and multimodal llms via universal transform-neck,” arXiv preprint arXiv:2407.19651, 2024

work page arXiv 2024
[21]

Image compression for machine and human vision with spatial-frequency adaptation,

H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Image compression for machine and human vision with spatial-frequency adaptation,” Eu- ropean Conference on Computer Vision, 2024. 2

work page 2024
[22]

Transtic: Transferring transformer-based image compression from human perception to machine perception,

Y .-H. Chen, Y .-C. Weng, C.-H. Kao, C. Chien, W.- C. Chiu, and W.-H. Peng, “Transtic: Transferring transformer-based image compression from human perception to machine perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 297–23 307. 2, 3, 7

work page 2023
[23]

End-to-end optimized image com- pression for machines, a study,

L. D. Chamain, F. Racap ´e, J. B ´egaint, A. Pushparaja, and S. Feltman, “End-to-end optimized image com- pression for machines, a study,” in 2021 Data Com- pression Conference (DCC). IEEE, 2021, pp. 163–

work page 2021
[24]

Gpt-4o version,

ChatGPT, “Gpt-4o version,” https://chat.openai.com/ chat, 2024, accessed: June 14, 2024. 2, 3, 6

work page 2024
[25]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” European Conference on Computer Vision, 2024. 2, 3, 4, 5, 6

work page 2024
[26]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022. 2, 3, 5, 6

work page 2022
[27]

Open- sora: Democratizing efficient video production for all,

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open- sora: Democratizing efficient video production for all,” March 2024. [Online]. Available: https: //github.com/hpcaitech/Open-Sora 2

work page 2024
[28]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Photorealis- tic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealis- tic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022. 2

work page 2022
[30]

Calculation of average psnr differ- ences between rd-curves,

G. Bjøntegaard, “Calculation of average psnr differ- ences between rd-curves,” 2001. [Online]. Available: https://api.semanticscholar.org/CorpusID:61598325 2, 7

work page 2001
[31]

Condi- tional perceptual quality preserving image compres- sion,

T. Xu, Q. Zhang, Y . Li, D. He, Z. Wang, Y . Wang, H. Qin, Y . Wang, J. Liu, and Y .-Q. Zhang, “Condi- tional perceptual quality preserving image compres- sion,”arXiv preprint arXiv:2308.08154, 2023. 2, 8

work page arXiv 2023
[32]

An overview of the jpeg 2000 still image compression standard,

M. Rabbani and R. Joshi, “An overview of the jpeg 2000 still image compression standard,” Signal pro- cessing: Image communication , vol. 17, no. 1, pp. 3– 48, 2002. 2

work page 2000
[33]

Overview of the h. 264/avc video coding standard,

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” TCSVT, vol. 13, no. 7, pp. 560–576, 2003. 2

work page 2003
[34]

Joint autore- gressive and hierarchical priors for learned image compression,

D. Minnen, J. Ball ´e, and G. Toderici, “Joint autore- gressive and hierarchical priors for learned image compression,” inNeurIPS, 2018. 2

work page 2018
[35]

Conditional probability models for deep image compression,

F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in CVPR, 2018, pp. 4394– 4402

work page 2018
[36]

High-fidelity generative image compression,

F. Mentzer, G. Toderici, M. Tschannen, and E. Agusts- son, “High-fidelity generative image compression,” Advances in neural information processing systems ,

work page
[37]

Deep sets

R. Zhang, R. Fang, P. Gao, W. Zhang, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free clip- adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021. 2

work page arXiv 2021
[38]

Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,

T. He, S. Sun, Z. Guo, and Z. Chen, “Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,” in 2019 Picture Coding Symposium (PCS). IEEE, 2019, pp. 1–5. 2

work page 2019
[39]

Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,

L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,” TIP, vol. 29, pp. 8680–8695, 2020

work page 2020
[40]

Semantic structured im- age coding framework for multiple intelligent applica- tions,

S. Sun, T. He, and Z. Chen, “Semantic structured im- age coding framework for multiple intelligent applica- tions,”TCSVT, 2020

work page 2020
[41]

Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,

X. Jin, R. Feng, S. Sun, R. Feng, T. He, and Z. Chen, “Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,” Journal of Visual Communication and Image Repre- sentation, vol. 93, p. 103816, 2023

work page 2023
[42]

Semantic segmentation in learned compressed domain,

J. Liu, H. Sun, and J. Katto, “Semantic segmentation in learned compressed domain,” in 2022 Picture Cod- ing Symposium (PCS). IEEE, 2022, pp. 181–185

work page 2022
[43]

Com- posable image coding for machine via task-oriented internal adaptor and external prior,

J. Liu, X. Jin, R. Feng, Z. Chen, and W. Zeng, “Com- posable image coding for machine via task-oriented internal adaptor and external prior,” in 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 2023, pp. 1–5. 2

work page 2023
[44]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Prelimi- nary explorations with gpt-4v(ision),” arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Visual instruc- tion tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruc- tion tuning,” Advances in neural information process- ing systems, vol. 36, 2024. 3

work page 2024
[46]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elho- seiny, “Minigpt-4: Enhancing vision-language under- standing with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al. , “Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Gemini: A Family of Highly Capable Multimodal Models

R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al. , “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, vol. 1, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Deco: Decoupling token compression from semantic abstraction in multimodal large language models,

L. Yao, L. Li, S. Ren, L. Wang, Y . Liu, X. Sun, and L. Hou, “Deco: Decoupling token compression from semantic abstraction in multimodal large language models,”arXiv preprint arXiv:2405.20985, 2024. 3

work page arXiv 2024
[51]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers et al., “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv preprint arXiv:2403.09611, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

A fast and accurate one-stage approach to vi- sual grounding,

Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo, “A fast and accurate one-stage approach to vi- sual grounding,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision , 2019, pp. 4683–4693. 3

work page 2019
[53]

A real-time cross-modality correlation filtering method for referring expression comprehen- sion,

Y . Liao, S. Liu, G. Li, F. Wang, Y . Chen, C. Qian, and B. Li, “A real-time cross-modality correlation filtering method for referring expression comprehen- sion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp. 10 880–10 889

work page 2020
[54]

Improv- ing one-stage visual grounding by recursive sub-query construction,

Z. Yang, T. Chen, L. Wang, and J. Luo, “Improv- ing one-stage visual grounding by recursive sub-query construction,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 387–404. 3

work page 2020
[55]

Blended diffusion for text-driven editing of natural images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2022, pp. 18 208–18 218. 3

work page 2022
[56]

Diffedit: Diffusion-based semantic image editing with mask guidance,

G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022

work page arXiv 2022
[57]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

C. Meng, Y . He, Y . Song, J. Song, J. Wu, J.-Y . Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Zero-shot image-to-image transla- tion,

G. Parmar, K. Kumar Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero-shot image-to-image transla- tion,” inACM SIGGRAPH 2023 Conference Proceed- ings, 2023, pp. 1–11. 3

work page 2023
[59]

Dsslic: Deep se- mantic segmentation-based layered image compres- sion,

M. Akbari, J. Liang, and J. Han, “Dsslic: Deep se- mantic segmentation-based layered image compres- sion,” inICASSP 2019-2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2042–2046. 3

work page 2019
[60]

Learning in compressed domain for faster machine vision tasks,

J. Liu, H. Sun, and J. Katto, “Learning in compressed domain for faster machine vision tasks,” in2021 Inter- national Conference on Visual Communications and Image Processing (VCIP). IEEE, 2021, pp. 01–05. 3

work page 2021
[61]

Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,

N. K ¨orber, E. Kromer, A. Siebert, S. Hauke, D. Mueller-Gritschneder, and B. Schuller, “Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,” in European Con- ference on Computer Vision . Springer, 2025, pp. 202–220. 3

work page 2025
[62]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,

J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,” in Proceedings of naacL-HLT, vol. 1. Minneapolis, Minnesota, 2019, p. 2. 4

work page 2019
[63]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022. 4

work page 2021
[64]

Adding con- ditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding con- ditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 3836–3847. 5

work page 2023
[65]

Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,

D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2022, pp. 5718–5727. 6, 7

work page 2022
[66]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Mal- loci, A. Kolesnikov, T. Duerig, and V . Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, 2020. 6

work page 2020
[67]

Mi- crosoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Mi- crosoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755. 6

work page 2014
[68]

Workshop and challenge on learned image compression,

CLIC, “Workshop and challenge on learned image compression,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition,

work page
[69]

En- hanced deep residual networks for single image super- resolution,

B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “En- hanced deep residual networks for single image super- resolution,” in The IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) Workshops, July

work page
[70]

Ntire 2017 challenge on single image super-resolution: Dataset and study,

E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, July 2017. 6

work page 2017
[71]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inICCV, 2017, pp. 2961–2969. 6

work page 2017
[72]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778. 6

work page 2016
[73]

A large-scale hierarchical image database,

J. Deng, “A large-scale hierarchical image database,” Proceedings of IEEE/CVF conference on Computer Vision and Pattern Recognition, 2009. 6

work page 2009
[74]

Kodak lossless true color image suite (pho- tocd pcd0992),

E. Kodak, “Kodak lossless true color image suite (pho- tocd pcd0992),” 1993. 6

work page 1993
[75]

Lossy image compression with conditional diffusion models,

R. Yang and S. Mandt, “Lossy image compression with conditional diffusion models,” Advances in Neu- ral Information Processing Systems, vol. 36, 2024. 6

work page 2024
[76]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems , vol. 30, 2017. 6

work page 2017
[77]

Demystifying MMD GANs

M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,”arXiv preprint arXiv:1801.01401, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[78]

Im- age quality assessment: Unifying structure and texture similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Im- age quality assessment: Unifying structure and texture similarity,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567–2581,

work page
[79]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595. 6

work page 2018
[80]

Vvc official test model vtm

J. V . E. Team, “Vvc official test model vtm.” 2021. 7

work page 2021

[1] [1]

The jpeg still picture compression standard,

G. K. Wallace, “The jpeg still picture compression standard,” IEEE transactions on consumer electron- ics, vol. 38, no. 1, pp. xviii–xxxiv, 1992. 1, 2

work page 1992

[2] [2]

Overview of the high efficiency video coding (hevc) standard,

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” TCSVT, vol. 22, no. 12, pp. 1649–1668,

work page

[3] [3]

Overview of the versatile video coding (vvc) standard and its applications,

B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sul- livan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” TCSVT,

work page

[4] [4]

Learned image com- pression with mixed transformer-cnn architectures,

J. Liu, H. Sun, and J. Katto, “Learned image com- pression with mixed transformer-cnn architectures,” in Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR) , June 2023, pp. 14 388–14 397. 1, 2, 3, 5

work page 2023

[5] [5]

Frequency-aware transformer for learned image compression,

H. Li, S. Li, W. Dai, C. Li, J. Zou, and H. Xiong, “Frequency-aware transformer for learned image compression,” International Conference on Learning Representations, 2024. 1, 3, 7, 8

work page 2024

[6] [6]

Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,

Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaus- sian mixture likelihoods and attention modules,” in CVPR, 2020, pp. 7939–7948. 1, 2

work page 2020

[7] [7]

‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,

C. Li, G. Lu, D. Feng, H. Wu, Z. Zhang, X. Liu, G. Zhai, W. Lin, and W. Zhang, “‘misc: Ultra-low bitrate image semantic compression driven by large multimodal model,”arXiv preprint arXiv:2402.16749,

work page arXiv

[8] [8]

Multi-realism image compression with a conditional generator,

E. Agustsson, D. Minnen, G. Toderici, and F. Mentzer, “Multi-realism image compression with a conditional generator,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , 2023, pp. 22 324–22 333. 3

work page 2023

[9] [9]

Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,

M. J. Muckley, A. El-Nouby, K. Ullrich, H. J ´egou, and J. Verbeek, “Improving statistical fidelity for neu- ral image compression with implicit local likelihood models,” in International Conference on Machine Learning. PMLR, 2023, pp. 25 426–25 443. 7, 8

work page 2023

[10] [10]

Towards image compression with perfect re- alism at ultra-low bitrates,

M. Careil, M. J. Muckley, J. Verbeek, and S. Lath- uili`ere, “Towards image compression with perfect re- alism at ultra-low bitrates,” in The Twelfth Interna- tional Conference on Learning Representations, 2024. 1, 3, 7, 8

work page 2024

[11] [11]

End-to-end optimized image compression,

J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inICLR, 2017. 1, 2, 3

work page 2017

[12] [12]

Variational image compression with a scale hyperprior,

J. Ball ´e, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” inICLR, 2018. 1, 3, 5

work page 2018

[13] [13]

Faster r-cnn: Towards real-time object detection with region pro- posal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region pro- posal networks,” NeurIPS, vol. 28, pp. 91–99, 2015. 1, 6

work page 2015

[14] [14]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rol- land, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763. 1

work page 2021

[16] [16]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

B. Li, R. Wang, G. Wang, Y . Ge, Y . Ge, and Y . Shan, “Seed-bench: Benchmarking multimodal llms with generative comprehension,” arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,

B. Li, J. Dong, Y . Wang, J. Liu, L. Yin, W. Zhao, Z. Zhu, X. Jin, and W. Zeng, “One at a time: Multi- step volumetric probability distribution diffusion for depth estimation,” arXiv preprint arXiv:2306.12681 ,

work page arXiv

[18] [18]

Image coding for ma- chines with omnipotent feature learning,

R. Feng, X. Jin, Z. Guo, R. Feng, Y . Gao, T. He, Z. Zhang, S. Sun, and Z. Chen, “Image coding for ma- chines with omnipotent feature learning,” in ECCV. Springer, 2022, pp. 510–528. 2, 3

work page 2022

[19] [19]

Rate-distortion-cognition controllable versa- tile neural image compression,

J. Liu, R. Feng, Y . Qi, Q. Chen, Z. Chen, W. Zeng, and X. Jin, “Rate-distortion-cognition controllable versa- tile neural image compression,” in European Confer- ence on Computer Vision. Springer, 2025, pp. 329–

work page 2025

[20] [20]

Bridging compressed image latents and multimodal large language models.arXiv preprint arXiv:2407.19651,

C.-H. Kao, C. Chien, Y .-J. Tseng, Y .-H. Chen, A. Gnutti, S.-Y . Lo, W.-H. Peng, and R. Leonardi, “Comneck: Bridging compressed image latents and multimodal llms via universal transform-neck,” arXiv preprint arXiv:2407.19651, 2024

work page arXiv 2024

[21] [21]

Image compression for machine and human vision with spatial-frequency adaptation,

H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Image compression for machine and human vision with spatial-frequency adaptation,” Eu- ropean Conference on Computer Vision, 2024. 2

work page 2024

[22] [22]

Transtic: Transferring transformer-based image compression from human perception to machine perception,

Y .-H. Chen, Y .-C. Weng, C.-H. Kao, C. Chien, W.- C. Chiu, and W.-H. Peng, “Transtic: Transferring transformer-based image compression from human perception to machine perception,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 297–23 307. 2, 3, 7

work page 2023

[23] [23]

End-to-end optimized image com- pression for machines, a study,

L. D. Chamain, F. Racap ´e, J. B ´egaint, A. Pushparaja, and S. Feltman, “End-to-end optimized image com- pression for machines, a study,” in 2021 Data Com- pression Conference (DCC). IEEE, 2021, pp. 163–

work page 2021

[24] [24]

Gpt-4o version,

ChatGPT, “Gpt-4o version,” https://chat.openai.com/ chat, 2024, accessed: June 14, 2024. 2, 3, 6

work page 2024

[25] [25]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” European Conference on Computer Vision, 2024. 2, 3, 4, 5, 6

work page 2024

[26] [26]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022. 2, 3, 5, 6

work page 2022

[27] [27]

Open- sora: Democratizing efficient video production for all,

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open- sora: Democratizing efficient video production for all,” March 2024. [Online]. Available: https: //github.com/hpcaitech/Open-Sora 2

work page 2024

[28] [28]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao et al., “Sora: A review on background, technology, limitations, and opportunities of large vision models,” arXiv preprint arXiv:2402.17177, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Photorealis- tic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al. , “Photorealis- tic text-to-image diffusion models with deep language understanding,” Advances in neural information pro- cessing systems, vol. 35, pp. 36 479–36 494, 2022. 2

work page 2022

[30] [30]

Calculation of average psnr differ- ences between rd-curves,

G. Bjøntegaard, “Calculation of average psnr differ- ences between rd-curves,” 2001. [Online]. Available: https://api.semanticscholar.org/CorpusID:61598325 2, 7

work page 2001

[31] [31]

Condi- tional perceptual quality preserving image compres- sion,

T. Xu, Q. Zhang, Y . Li, D. He, Z. Wang, Y . Wang, H. Qin, Y . Wang, J. Liu, and Y .-Q. Zhang, “Condi- tional perceptual quality preserving image compres- sion,”arXiv preprint arXiv:2308.08154, 2023. 2, 8

work page arXiv 2023

[32] [32]

An overview of the jpeg 2000 still image compression standard,

M. Rabbani and R. Joshi, “An overview of the jpeg 2000 still image compression standard,” Signal pro- cessing: Image communication , vol. 17, no. 1, pp. 3– 48, 2002. 2

work page 2000

[33] [33]

Overview of the h. 264/avc video coding standard,

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” TCSVT, vol. 13, no. 7, pp. 560–576, 2003. 2

work page 2003

[34] [34]

Joint autore- gressive and hierarchical priors for learned image compression,

D. Minnen, J. Ball ´e, and G. Toderici, “Joint autore- gressive and hierarchical priors for learned image compression,” inNeurIPS, 2018. 2

work page 2018

[35] [35]

Conditional probability models for deep image compression,

F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, “Conditional probability models for deep image compression,” in CVPR, 2018, pp. 4394– 4402

work page 2018

[36] [36]

High-fidelity generative image compression,

F. Mentzer, G. Toderici, M. Tschannen, and E. Agusts- son, “High-fidelity generative image compression,” Advances in neural information processing systems ,

work page

[37] [37]

Deep sets

R. Zhang, R. Fang, P. Gao, W. Zhang, K. Li, J. Dai, Y . Qiao, and H. Li, “Tip-adapter: Training-free clip- adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021. 2

work page arXiv 2021

[38] [38]

Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,

T. He, S. Sun, Z. Guo, and Z. Chen, “Beyond cod- ing: Detection-driven image compression with seman- tically structured bit-stream,” in 2019 Picture Coding Symposium (PCS). IEEE, 2019, pp. 1–5. 2

work page 2019

[39] [39]

Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,

L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collab- orative compression and intelligent analytics,” TIP, vol. 29, pp. 8680–8695, 2020

work page 2020

[40] [40]

Semantic structured im- age coding framework for multiple intelligent applica- tions,

S. Sun, T. He, and Z. Chen, “Semantic structured im- age coding framework for multiple intelligent applica- tions,”TCSVT, 2020

work page 2020

[41] [41]

Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,

X. Jin, R. Feng, S. Sun, R. Feng, T. He, and Z. Chen, “Semantical video coding: Instill static- dynamic clues into structured bitstream for ai tasks,” Journal of Visual Communication and Image Repre- sentation, vol. 93, p. 103816, 2023

work page 2023

[42] [42]

Semantic segmentation in learned compressed domain,

J. Liu, H. Sun, and J. Katto, “Semantic segmentation in learned compressed domain,” in 2022 Picture Cod- ing Symposium (PCS). IEEE, 2022, pp. 181–185

work page 2022

[43] [43]

Com- posable image coding for machine via task-oriented internal adaptor and external prior,

J. Liu, X. Jin, R. Feng, Z. Chen, and W. Zeng, “Com- posable image coding for machine via task-oriented internal adaptor and external prior,” in 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 2023, pp. 1–5. 2

work page 2023

[44] [44]

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, “The dawn of lmms: Prelimi- nary explorations with gpt-4v(ision),” arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Visual instruc- tion tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruc- tion tuning,” Advances in neural information process- ing systems, vol. 36, 2024. 3

work page 2024

[46] [46]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

D. Zhu, J. Chen, X. Shen, X. Li, and M. Elho- seiny, “Minigpt-4: Enhancing vision-language under- standing with advanced large language models,”arXiv preprint arXiv:2304.10592, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al. , “Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Gemini: A Family of Highly Capable Multimodal Models

R. Anil, S. Borgeaud, Y . Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al. , “Gemini: A family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, vol. 1, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

Deco: Decoupling token compression from semantic abstraction in multimodal large language models,

L. Yao, L. Li, S. Ren, L. Wang, Y . Liu, X. Sun, and L. Hou, “Deco: Decoupling token compression from semantic abstraction in multimodal large language models,”arXiv preprint arXiv:2405.20985, 2024. 3

work page arXiv 2024

[51] [51]

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

B. McKinzie, Z. Gan, J.-P. Fauconnier, S. Dodge, B. Zhang, P. Dufter, D. Shah, X. Du, F. Peng, F. Weers et al., “Mm1: Methods, analysis & insights from multimodal llm pre-training,” arXiv preprint arXiv:2403.09611, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

A fast and accurate one-stage approach to vi- sual grounding,

Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo, “A fast and accurate one-stage approach to vi- sual grounding,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision , 2019, pp. 4683–4693. 3

work page 2019

[53] [53]

A real-time cross-modality correlation filtering method for referring expression comprehen- sion,

Y . Liao, S. Liu, G. Li, F. Wang, Y . Chen, C. Qian, and B. Li, “A real-time cross-modality correlation filtering method for referring expression comprehen- sion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp. 10 880–10 889

work page 2020

[54] [54]

Improv- ing one-stage visual grounding by recursive sub-query construction,

Z. Yang, T. Chen, L. Wang, and J. Luo, “Improv- ing one-stage visual grounding by recursive sub-query construction,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 2020, pp. 387–404. 3

work page 2020

[55] [55]

Blended diffusion for text-driven editing of natural images,

O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , 2022, pp. 18 208–18 218. 3

work page 2022

[56] [56]

Diffedit: Diffusion-based semantic image editing with mask guidance,

G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022

work page arXiv 2022

[57] [57]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

C. Meng, Y . He, Y . Song, J. Song, J. Wu, J.-Y . Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[58] [58]

Zero-shot image-to-image transla- tion,

G. Parmar, K. Kumar Singh, R. Zhang, Y . Li, J. Lu, and J.-Y . Zhu, “Zero-shot image-to-image transla- tion,” inACM SIGGRAPH 2023 Conference Proceed- ings, 2023, pp. 1–11. 3

work page 2023

[59] [59]

Dsslic: Deep se- mantic segmentation-based layered image compres- sion,

M. Akbari, J. Liang, and J. Han, “Dsslic: Deep se- mantic segmentation-based layered image compres- sion,” inICASSP 2019-2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2042–2046. 3

work page 2019

[60] [60]

Learning in compressed domain for faster machine vision tasks,

J. Liu, H. Sun, and J. Katto, “Learning in compressed domain for faster machine vision tasks,” in2021 Inter- national Conference on Visual Communications and Image Processing (VCIP). IEEE, 2021, pp. 01–05. 3

work page 2021

[61] [61]

Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,

N. K ¨orber, E. Kromer, A. Siebert, S. Hauke, D. Mueller-Gritschneder, and B. Schuller, “Egic: En- hanced low-bit-rate generative image compression guided by semantic segmentation,” in European Con- ference on Computer Vision . Springer, 2025, pp. 202–220. 3

work page 2025

[62] [62]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,

J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for lan- guage understanding,” in Proceedings of naacL-HLT, vol. 1. Minneapolis, Minnesota, 2019, p. 2. 4

work page 2019

[63] [63]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022. 4

work page 2021

[64] [64]

Adding con- ditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding con- ditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 3836–3847. 5

work page 2023

[65] [65]

Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,

D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “Elic: Efficient learned image compres- sion with unevenly grouped space-channel contextual adaptive coding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, 2022, pp. 5718–5727. 6, 7

work page 2022

[66] [66]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Mal- loci, A. Kolesnikov, T. Duerig, and V . Ferrari, “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, 2020. 6

work page 2020

[67] [67]

Mi- crosoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Per- ona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Mi- crosoft coco: Common objects in context,” in ECCV. Springer, 2014, pp. 740–755. 6

work page 2014

[68] [68]

Workshop and challenge on learned image compression,

CLIC, “Workshop and challenge on learned image compression,” in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition,

work page

[69] [69]

En- hanced deep residual networks for single image super- resolution,

B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “En- hanced deep residual networks for single image super- resolution,” in The IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) Workshops, July

work page

[70] [70]

Ntire 2017 challenge on single image super-resolution: Dataset and study,

E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, July 2017. 6

work page 2017

[71] [71]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inICCV, 2017, pp. 2961–2969. 6

work page 2017

[72] [72]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778. 6

work page 2016

[73] [73]

A large-scale hierarchical image database,

J. Deng, “A large-scale hierarchical image database,” Proceedings of IEEE/CVF conference on Computer Vision and Pattern Recognition, 2009. 6

work page 2009

[74] [74]

Kodak lossless true color image suite (pho- tocd pcd0992),

E. Kodak, “Kodak lossless true color image suite (pho- tocd pcd0992),” 1993. 6

work page 1993

[75] [75]

Lossy image compression with conditional diffusion models,

R. Yang and S. Mandt, “Lossy image compression with conditional diffusion models,” Advances in Neu- ral Information Processing Systems, vol. 36, 2024. 6

work page 2024

[76] [76]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems , vol. 30, 2017. 6

work page 2017

[77] [77]

Demystifying MMD GANs

M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,”arXiv preprint arXiv:1801.01401, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[78] [78]

Im- age quality assessment: Unifying structure and texture similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Im- age quality assessment: Unifying structure and texture similarity,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567–2581,

work page

[79] [79]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595. 6

work page 2018

[80] [80]

Vvc official test model vtm

J. V . E. Team, “Vvc official test model vtm.” 2021. 7

work page 2021