pith. machine review for the scientific record. sign in

arxiv: 2604.05347 · v1 · submitted 2026-04-07 · 📡 eess.IV · cs.CV· cs.MM

Recognition: 2 theorem links

· Lean Theorem

CI-ICM: Channel Importance-driven Learned Image Coding for Machines

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.MM
keywords learned image compressionmachine visionchannel importanceobject detectioninstance segmentationfeature channel groupingbitrate allocationtask adaptation
0
0 comments X

The pith

A learned image codec for machines scores feature channel importance to allocate bits preferentially and raise task accuracy at fixed bitrates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional image codecs optimized for human eyes waste bits on details irrelevant to machines and discard features machines require. The paper introduces CI-ICM to generate importance scores for every feature channel, group and scale channels accordingly, and apply context modeling that protects high-value channels while adapting the output to multiple downstream tasks. Experiments on COCO2017 demonstrate clear gains in object detection and instance segmentation over a baseline learned codec. Readers should care because machine perception pipelines now process the majority of images; shifting compression toward task-critical features can cut transmission costs while preserving or improving AI accuracy.

Core claim

The authors propose Channel Importance-driven learned Image Coding for Machines (CI-ICM). A Channel Importance Generation module produces and ranks channel importance scores via a channel order loss. These scores feed a Feature Channel Grouping and Scaling module that non-uniformly groups channels and adjusts their dynamic ranges, plus a Channel Importance-based Context module that allocates bits to preserve fidelity in critical channels. A Task-Specific Channel Adaptation module further enhances features for multiple machine tasks. On COCO2017 the method delivers BD-mAP@50:95 gains of 16.25% in object detection and 13.72% in instance segmentation over the baseline codec.

What carries the argument

The Channel Importance Generation (CIG) module that quantifies and ranks feature-channel importance for machine tasks, enabling the Feature Channel Grouping and Scaling (FCGS) and Channel Importance-based Context (CI-CTX) modules to perform non-uniform bitrate allocation.

If this is right

  • Machine vision tasks obtain higher mean average precision at the same bitrate constraint.
  • Bitrate is allocated non-uniformly to preserve higher fidelity in channels ranked as task-critical.
  • A single codec supports multiple downstream tasks through the task-specific adaptation module.
  • Ablation studies confirm that each of the four proposed modules contributes to the measured gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same importance-driven grouping could be applied to compress video streams for surveillance or autonomous driving pipelines.
  • If the importance scores generalize beyond the tested models, pre-computed channel rankings might enable faster real-time encoding.
  • The work suggests compression loops that incorporate feedback from the downstream machine task could outperform purely reconstruction-focused codecs.

Load-bearing premise

The channel importance scores produced by the CIG module accurately reflect task-critical information across varied machine vision models and datasets.

What would settle it

Apply CI-ICM-compressed images to an object-detection or segmentation model whose architecture was not used when training the channel importance scores and check whether the BD-mAP gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.05347 by Gangyi Jiang, Huan Zhang, Junle Liu, Weisi Lin, Yun Zhang, Zhaoqing Pan.

Figure 1
Figure 1. Figure 1: Feature channel importance analysis by adding distortions, where [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of the proposed CI-ICM. (a) Network architecture of the proposed CI-ICM, (b) The reordering, grouping, and scaling feature flow of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between the number of removed channels and the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Decoding process of the CI-CTX module. the weights Wc, more features of higher importance can be reserved in the feature removal experiment, which validates the effectiveness of CIG and its generated importance weights. Based on these weights Wc, we propose a channel order loss LCO to guide the extraction process of the feature representation module ga. In this case, the output feature y from retrained ga … view at source ↗
Figure 5
Figure 5. Figure 5: MSE(Φi ch-org, Φi ch) with different si. Note that si, i ∈ [1, 2, 3] are plotted in 1/si, and s4 are plotted in log10(s4) for better observation. (a) s1, (b) s2, (c) s3, (d) s4. We established baseline priors Φi ch-org generated by the prior extraction network g i ch without any scaling. The scale table {si} is initialized as [1, 1, 2, 10, 1 × 104 ], complying with the principles si−1 ≤ si . Then, the scal… view at source ↗
Figure 7
Figure 7. Figure 7: Structure of the TSCA module. to 0.099, which are small. The Correlation Coefficients (CC) of the fitting curves are 0.955 to 0.995, which is high. The high CC and small RMSE show that the quadratic function is accurate. Therefore, based on these fitted curves, the optimal si is obtained to achieve the highest mAP@50:95, which is [1, 1.85, 2.27, 3.71, 1 × 104.38]. Note that these optimal values are obtaine… view at source ↗
Figure 8
Figure 8. Figure 8: Three stages of training for the proposed CI-ICM, where dark green [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Trained Wc of an image in COCO2017 before and after training using channel order loss. (a) Before training, (b) After training. where Ryˆi denote the bitrate of the i-th feature subset, n the the total number of subsets, Φ0 z and Φi ch denotes the hyper￾prior parameters of i-th feature channels. To enforce ordered feature extraction and proper feature separation, channel order loss is added to formulate th… view at source ↗
Figure 10
Figure 10. Figure 10: Rate-accuracy curves of the proposed CI-ICM and benchmark schemes on object detection task, (a) mAP@50:95, (b) mAP@50, (c) mAP@75. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of object detection results using different codecs, where bpp values are also presented. (a) and (f): Ground truth, (b) and (g): ELIC, (c) [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Rate-accuracy curves of the proposed CI-ICM and baseline schemes on instance segmentation task, (a) mAP@50:95, (b) mAP@50, (c) mAP@75. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of instance segmentation results from coded images, where bpp values also presented. (a) and (f): Ground truth, (b) and (g): ELIC, (c) [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Rate-accuracy curves of ablation studies on object detection task and instance segmentation task, where the task accuracy is measured with [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Rate-accuracy curves of generalization studies, where the task accuracy is measured with mAP@50:95. (a) Analysis on COCO 2017 dataset, Faster [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
read the original abstract

Traditional human vision-centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance-driven learned Image Coding for Machines (CI-ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non-uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance-based Context (CI-CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task-Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI-ICM achieves BD-mAP@50:95 gains of 16.25$\%$ in object detection and 13.72$\%$ in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI-ICM. This work establishes feature channel optimization for machine vision-centric compression, bridging the gap between image coding and machine perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CI-ICM, a learned image codec for machine vision tasks that introduces a Channel Importance Generation (CIG) module with channel order loss, a Feature Channel Grouping and Scaling (FCGS) module, a Channel Importance-based Context (CI-CTX) module, and a Task-Specific Channel Adaptation (TSCA) module. On COCO2017, it reports BD-mAP@50:95 gains of 16.25% for object detection and 13.72% for instance segmentation over a baseline codec, supported by ablations and complexity analysis.

Significance. If reproducible and generalizable, the work could advance machine-centric compression by demonstrating that non-uniform bit allocation based on learned channel importance improves downstream task performance at fixed rates. The explicit ablation studies and complexity analysis are strengths that support practical claims; however, the absence of baseline specifications and cross-task validation limits the assessed impact.

major comments (3)
  1. [Abstract] Abstract: The central performance claim (BD-mAP gains of 16.25% detection / 13.72% segmentation) is presented without naming the baseline codec, its rate points, or any statistical significance tests, preventing verification of the reported improvements.
  2. [Experimental Results] Experimental section (implied by results on COCO2017): The TSCA module is described as enabling adaptation to multiple tasks, yet only results for the two COCO tasks are shown; without cross-model or cross-task transfer experiments, it remains unclear whether the CIG-derived importance scores capture general machine-critical features or merely overfit to the specific detection/segmentation heads used in training.
  3. [Method] Method description (CIG and FCGS modules): The channel order loss and subsequent non-uniform grouping/scaling assume that importance scores derived from gradients or activations generalize across varied machine vision models, but no evidence is provided that the added modules avoid introducing distribution shifts harmful to unseen downstream models.
minor comments (2)
  1. [Abstract] The abstract states that 'computation complexity analysis reveals the practicability' but does not quantify the overhead of the CIG/FCGS/CI-CTX/TSCA modules relative to the baseline.
  2. [Method] Notation for channel importance scores and grouping is introduced without an explicit equation or diagram reference in the provided summary, which could be clarified for reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback and positive recognition of the ablation studies and complexity analysis. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (BD-mAP gains of 16.25% detection / 13.72% segmentation) is presented without naming the baseline codec, its rate points, or any statistical significance tests, preventing verification of the reported improvements.

    Authors: We agree that the abstract should enable immediate verification. The baseline is the standard learned image codec (without CIG, FCGS, CI-CTX, or TSCA modules) as defined in Section III and used for all rate-distortion curves in Section IV. The BD-mAP@50:95 values are computed over the same set of rate points shown in Figures 3 and 4 (approximately 0.1–0.8 bpp). While statistical significance tests are not standard in learned compression literature, we will add a sentence to the abstract naming the baseline explicitly and referencing the rate points and evaluation protocol used in the experimental section. revision: yes

  2. Referee: [Experimental Results] Experimental section (implied by results on COCO2017): The TSCA module is described as enabling adaptation to multiple tasks, yet only results for the two COCO tasks are shown; without cross-model or cross-task transfer experiments, it remains unclear whether the CIG-derived importance scores capture general machine-critical features or merely overfit to the specific detection/segmentation heads used in training.

    Authors: The TSCA module is trained jointly with the two COCO tasks (detection and instance segmentation) that employ distinct heads, and the reported gains demonstrate that the same channel importance scores can be adapted to both. We acknowledge that this does not constitute full cross-model transfer (e.g., to classification or different backbones). We will revise the experimental section to explicitly state the scope of the current validation, add a limitations paragraph discussing potential task-specific overfitting, and note that TSCA fine-tuning would be required for new heads. revision: partial

  3. Referee: [Method] Method description (CIG and FCGS modules): The channel order loss and subsequent non-uniform grouping/scaling assume that importance scores derived from gradients or activations generalize across varied machine vision models, but no evidence is provided that the added modules avoid introducing distribution shifts harmful to unseen downstream models.

    Authors: The channel importance is computed from task-specific gradients and activations, and the channel order loss enforces a stable ranking that prioritizes task-critical channels. Ablation results (Table II) show consistent gains when CIG/FCGS are included, indicating that the non-uniform allocation improves rather than harms the tested tasks. We do not claim zero distribution shift for arbitrary unseen models; TSCA is designed precisely to mitigate task-specific shifts via adaptation. We will add a short discussion in Section III clarifying this scope and the role of TSCA for new models. revision: partial

standing simulated objections not resolved
  • Comprehensive experiments on completely unseen downstream models (different architectures or tasks without any fine-tuning) to quantify potential distribution shifts introduced by CIG/FCGS.

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper proposes a set of architectural modules (CIG for channel importance, FCGS for grouping/scaling, CI-CTX for context allocation, and TSCA for task adaptation) within a learned image codec and reports empirical BD-mAP gains on COCO2017 for detection and segmentation. No mathematical derivation, first-principles prediction, or fitted parameter is presented as a 'result' that reduces to its own inputs by construction. The central claims are performance measurements from training and evaluation, not self-referential definitions or renamed known patterns. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that would force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the domain assumption that feature channels can be meaningfully ranked for machine tasks and that the proposed modules can be trained end-to-end without harming rate-distortion behavior. No numerical free parameters are stated. The four modules are new invented components whose independent evidence is limited to the reported experiments.

axioms (1)
  • domain assumption Feature channels in learned codecs carry unequal importance for downstream machine vision tasks.
    Invoked to justify the CIG module and subsequent grouping.
invented entities (4)
  • Channel Importance Generation (CIG) module no independent evidence
    purpose: Quantify and rank channel importance for machine vision
    New component introduced to generate importance scores.
  • Feature Channel Grouping and Scaling (FCGS) module no independent evidence
    purpose: Non-uniform grouping and dynamic-range adjustment of channels
    New component for bitrate allocation.
  • Channel Importance-based Context (CI-CTX) module no independent evidence
    purpose: Context modeling that preserves fidelity in critical channels
    New component for entropy coding.
  • Task-Specific Channel Adaptation (TSCA) module no independent evidence
    purpose: Adapt features for multiple downstream machine tasks
    New component for multi-task support.

pith-pipeline@v0.9.0 · 5590 in / 1388 out tokens · 56155 ms · 2026-05-10T19:35:30.982826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Towards efficient front-end visual sensing for digital retina: A model- centric paradigm,

    Y . Lou, L.-Y . Duan, Y . Luo, Z. Chen, T. Liu, S. Wang, and W. Gao, “Towards efficient front-end visual sensing for digital retina: A model- centric paradigm,”IEEE Trans. Multimedia, vol. 22, no. 11, pp. 3002– 3013, Nov. 2020

  2. [2]

    Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things,

    J. Zhang and D. Tao, “Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things,”IEEE Internet Things J., vol. 8, no. 10, pp. 7789–7817, May 2021

  3. [3]

    Device- edge-cloud collaborative acceleration method towards occluded face recognition in high-traffic areas,

    P. Zhang, F. Huang, D. Wu, B. Yang, Z. Yang, and L. Tan, “Device- edge-cloud collaborative acceleration method towards occluded face recognition in high-traffic areas,”IEEE Trans. Multimedia, vol. 25, pp. 1513–1520, Mar. 2023

  4. [4]

    Overview of the versatile video coding (vvc) standard and its applications,

    B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,”IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, Oct. 2021

  5. [5]

    Hnr-isc: Hybrid neural representation for image set compression,

    P. Zhang, S. Wang, M. Wang, P. Chen, W. Wu, X. Wang, and S. Kwong, “Hnr-isc: Hybrid neural representation for image set compression,”IEEE Trans. Multimedia, vol. 27, pp. 28–40, Dec. 2025

  6. [6]

    Recent advances in end-to-end learned image and video compression,

    W.-H. Peng and H.-M. Hang, “Recent advances in end-to-end learned image and video compression,” inIEEE Int. Conf. Vis. Commun. Image Process., Macau, China, Dec. 2020, pp. 1–2

  7. [7]

    Video coding for machines: A paradigm of collaborative compression and intelligent analytics,

    L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,”IEEE Trans. Image Process., vol. 29, pp. 8680–8695, Aug. 2020

  8. [8]

    Task-driven video compression for humans and machines: Framework design and opti- mization,

    X. Yi, H. Wang, S. Kwong, and C.-C. Jay Kuo, “Task-driven video compression for humans and machines: Framework design and opti- mization,”IEEE Trans. Multimedia, vol. 25, pp. 8091–8102, Dec. 2023

  9. [9]

    Just noticeable difference for deep machine vision,

    J. Jin, X. Zhang, X. Fu, H. Zhang, W. Lin, J. Lou, and Y . Zhao, “Just noticeable difference for deep machine vision,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 6, pp. 3452–3461, Jun. 2022

  10. [10]

    Task-aware quantization network for jpeg image compression,

    J. Choi and B. Han, “Task-aware quantization network for jpeg image compression,” inEur. Conf. Comput. Vis., Nov. 2020, pp. 309–324

  11. [11]

    Visual analysis motivated rate- distortion model for image coding,

    Z. Huang, C. Jia, S. Wang, and S. Ma, “Visual analysis motivated rate- distortion model for image coding,” inIEEE Int. Conf. Multimedia Expo, Shenzhen, China, Jun. 2021, pp. 1–6

  12. [12]

    Saliency segmentation oriented deep image compression with novel bit allocation,

    Y . Li, W. Gao, G. Li, and S. Ma, “Saliency segmentation oriented deep image compression with novel bit allocation,”IEEE Trans. Image Process., vol. 34, pp. 16–29, Nov. 2025

  13. [13]

    Learning to predict object-wise just recognizable distortion for image and video compression,

    Y . Zhang, H. Lin, J. Sun, L. Zhu, and S. Kwong, “Learning to predict object-wise just recognizable distortion for image and video compression,”IEEE Trans. Multimedia, vol. 26, pp. 5925–5938, Dec. 2024

  14. [14]

    Towards coding for human and machine vision: Scalable face image coding,

    S. Yang, Y . Hu, W. Yang, L.-Y . Duan, and J. Liu, “Towards coding for human and machine vision: Scalable face image coding,”IEEE Trans. Multimedia, vol. 23, pp. 2957–2971, Mar. 2021

  15. [15]

    Image coding for machines with edge information learning using segment anything,

    T. Shindo, K. Yamada, T. Watanabe, and H. Watanabe, “Image coding for machines with edge information learning using segment anything,” inIEEE Int. Conf. Image Process., Abu Dhabi, UAE, Oct. 2024, pp. 3702–3708

  16. [16]

    Rethink- ing semantic image compression: Scalable representation with cross- modality transfer,

    P. Zhang, S. Wang, M. Wang, J. Li, X. Wang, and S. Kwong, “Rethink- ing semantic image compression: Scalable representation with cross- modality transfer,”IEEE Trans. Circuit Syst. Video Technol., vol. 33, no. 8, pp. 4441–4445, Aug. 2023

  17. [17]

    End-to-end compression towards machine vision: Network architecture design and optimization,

    S. Wang, Z. Wang, S. Wang, and Y . Ye, “End-to-end compression towards machine vision: Network architecture design and optimization,” IEEE Open J. Circuits Syst., vol. 2, pp. 675–685, Nov. 2021

  18. [18]

    Learned image coding for machines: A content-adaptive approach,

    N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, H. R. Tavakoli, and E. Rahtu, “Learned image coding for machines: A content-adaptive approach,” inIEEE Int. Conf. Multimedia Expo, Shenzhen, China, Nov. 2021, pp. 1–6

  19. [19]

    Image coding for machines: an end-to-end learned approach,

    N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: an end-to-end learned approach,” inICASSP IEEE Int. Conf. Acoust. Speech Signal Process, Jun. 2021, pp. 1590–1594

  20. [20]

    End-to-end learning of compressible features,

    S. Singh, S. Abu-El-Haija, N. Johnston, J. Ball ´e, A. Shrivastava, and G. Toderici, “End-to-end learning of compressible features,” inIEEE Int. Conf. Image Process., Abu Dhabi, UAE, Oct. 2020, pp. 3349–3353

  21. [21]

    SC2 benchmark: Supervised compression for split computing,

    Y . Matsubara, R. Yang, M. Levorato, and S. Mandt, “SC2 benchmark: Supervised compression for split computing,”Trans. Mach. Learn. Res., pp. 1–20, Jun. 2023

  22. [22]

    Multiscale feature importance-based bit allocation for end-to-end feature coding for ma- chines,

    J. Liu, Y . Zhang, Z. Guo, X. Huang, and G. Jiang, “Multiscale feature importance-based bit allocation for end-to-end feature coding for ma- chines,”ACM Trans. Multimed. Comput. Commun. Appl., vol. 21, no. 9, pp. 1–19, Sep. 2025

  23. [23]

    End-to-end optimized image compression for machines, a study,

    L. D. Chamain, F. Racap ´e, J. B ´egaint, A. Pushparaja, and S. Feltman, “End-to-end optimized image compression for machines, a study,” in Data Compression Conf., Snowbird, UT, USA, Mar. 2021, pp. 163–172

  24. [24]

    Rate-distortion theory in coding for machines and its applications,

    A. Harell, Y . Foroutan, N. Ahuja, P. Datta, B. Kanzariya, V . S. So- mayazulu, O. Tickoo, A. de Andrade, and I. V . Baji ´c, “Rate-distortion theory in coding for machines and its applications,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 7, pp. 5501–5519, Jul. 2025

  25. [25]

    Improving multiple machine vision tasks in the compressed domain,

    J. Liu, H. Sun, and J. Katto, “Improving multiple machine vision tasks in the compressed domain,” inInt. Conf. Pattern Recog., Montreal, QC, Canada, Aug. 2022, pp. 331–337

  26. [26]

    Scalable image coding for humans and machines,

    H. Choi and I. V . Baji ´c, “Scalable image coding for humans and machines,”IEEE Trans. Image Process., vol. 31, pp. 2739–2754, Mar. 2022

  27. [27]

    Latent-space scalability for multi-task collabo- rative intelligence,

    H. Choi and I. V . Baji ´c, “Latent-space scalability for multi-task collabo- rative intelligence,” inIEEE Int. Conf. Image Process., Anchorage, AK, USA, Sep. 2021, pp. 3562–3566

  28. [28]

    Unified and scalable deep image compression framework for human and machine,

    G. Zhang, X. Zhang, and L. Tang, “Unified and scalable deep image compression framework for human and machine,”ACM Trans. Multi- media Comput. Commun. Appl., vol. 20, no. 10, pp. 1–22, Oct. 2024

  29. [29]

    Learned disentangled latent representations for scalable image coding for humans and ma- chines,

    E. ¨Ozyılkan, M. Ulhaq, H. Choi, and F. Racap ´e, “Learned disentangled latent representations for scalable image coding for humans and ma- chines,” inData Compression Conf., Snowbird, UT, USA, Mar. 2023, pp. 42–51

  30. [30]

    Learnt mutual feature compression for machine vision,

    T. Liu, M. Xu, S. Li, C. Chen, L. Yang, and Z. Lv, “Learnt mutual feature compression for machine vision,” inIEEE Int. Conf. Acoust. Speech Signal Process., Rhodes Island, Greece, Jun. 2023, pp. 1–5

  31. [31]

    Semantically scalable image coding with compression of feature maps,

    N. Yan, D. Liu, H. Li, and F. Wu, “Semantically scalable image coding with compression of feature maps,” inIEEE Int. Conf. Image Process., Abu Dhabi, UAE, Oct. 2020, pp. 3114–3118

  32. [32]

    Semantic and saliency-aware scalable image coding towards human-machine collaboration,

    T. Cui, Y . Wang, Y . Wang, and Z. Fang, “Semantic and saliency-aware scalable image coding towards human-machine collaboration,”IEEE Trans. Circuit Syst. Video Technol., pp. 1–1, May 2025

  33. [33]

    Transtic: Transferring transformer-based image compression from hu- man perception to machine perception,

    Y .-H. Chen, Y . Weng, C.-H. Kao, C. Chien, W.-C. Chiu, and W. Peng, “Transtic: Transferring transformer-based image compression from hu- man perception to machine perception,” inIEEE/CVF Int. Conf. Comput. Vis., Paris, France, Oct. 2023, pp. 23 240–23 250

  34. [34]

    Im- age compression for machine and human vision with spatial-frequency adaptation,

    H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Im- age compression for machine and human vision with spatial-frequency adaptation,” inEur. Conf. Comput. Vis., Milan, Italy, Oct. 2024, pp. 382–399

  35. [35]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017

  36. [36]

    Elic: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,

    D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “Elic: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,” inIEEE/CVF Conf. Comput. Vis. Pattern Recog., New Orleans, LA, USA, Jun. 2022, pp. 5708–5717

  37. [37]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE/CVF Conf. Comput. Vis. Pattern Recog., Salt Lake City, UT, USA, Jun. 2018, pp. 7132–7141

  38. [38]

    Mlic: Multi- reference entropy model for learned image compression,

    W. Jiang, J. Yang, Y . Zhai, P. Ning, F. Gao, and R. Wang, “Mlic: Multi- reference entropy model for learned image compression,” inACM Int. Conf. Multimedia, New York, NY , USA, Oct. 2023, pp. 7618 – 7627

  39. [39]

    Real- time evaluation of object detection models across open world scenarios,

    P. Goswami, L. Aggarwal, A. Kumar, R. Kanwar, and U. Vasisht, “Real- time evaluation of object detection models across open world scenarios,” Appl. Soft Comput., vol. 163, p. 111921, Sep. 2024

  40. [40]

    On biasing transformer attention towards monotonicity,

    A. R. Gonzales, C. Amrhein, N. Aepli, and R. Sennrich, “On biasing transformer attention towards monotonicity,” inN. Am. Chapter Assoc. Comput. Linguist.Online: Association for Computational Linguistics, Jun. 2021, pp. 4474–4488

  41. [41]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inIEEE Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 2980–2988

  42. [42]

    The jpeg 2000 still image compression standard,

    A. Skodras, C. Christopoulos, and T. Ebrahimi, “The jpeg 2000 still image compression standard,”IEEE Signal Process. Mag., vol. 18, no. 5, pp. 36–58, Sep. 2001

  43. [43]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,”Int. J. Comput. Vision, vol. 88, no. 2, pp. 303–338, Sep. 2010. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY , VOL.XX, NO.XX, 2025 15 Yun Zhang(Senior Member, IEEE) received the B.S. and M.S. degrees in electrical e...