pith. sign in

arxiv: 2402.08267 · v3 · submitted 2024-02-13 · 💻 cs.CV · cs.AI

Improving Image Coding for Machines through Optimizing Encoder via Auxiliary Loss

Pith reviewed 2026-05-24 03:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image coding for machinesauxiliary lossencoder optimizationobject detectionsemantic segmentationrate-distortion performancelearned image compression
0
0 comments X

The pith

Applying auxiliary loss to the encoder in learned image coding for machines improves recognition capability and yields large BD-rate savings for object detection and semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a training method for image coding for machines that adds an auxiliary loss term directly on the encoder outputs to strengthen the encoder's ability to retain task-relevant visual information. This sidesteps the optimization problems that arise when backpropagating a full deep recognition loss through the compressor and avoids the extra runtime cost of region-of-interest bit allocation. The resulting models deliver better rate-distortion curves when the compressed bitstream is fed to downstream machine tasks. A reader cares because the method offers a lightweight way to make learned codecs task-aware without the usual training or evaluation penalties.

Core claim

The central claim is that adding an auxiliary loss to the encoder during training of learned ICM models supplies effective recognition supervision, thereby improving both the encoder's recognition capability and the overall rate-distortion performance; the method records Bjontegaard Delta rate gains of 27.7 percent on object detection and 20.3 percent on semantic segmentation relative to conventional task-loss training.

What carries the argument

Auxiliary loss applied to the encoder, which injects recognition supervision into the compression model to guide bit allocation toward task-salient features.

If this is right

  • The encoder preserves more task-relevant information at any given bitrate, directly lowering the bits needed for equivalent machine-task performance.
  • Training remains stable even when the downstream recognition model is deep, because the auxiliary loss bypasses full task-loss back-propagation.
  • No additional overhead appears at inference time, unlike ROI-based allocation schemes.
  • The same auxiliary-loss recipe can be applied to existing learned compression architectures without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may transfer to other machine-vision tasks such as instance segmentation or depth estimation if suitable auxiliary heads are defined.
  • Combining the auxiliary loss with modern entropy models or attention-based compressors could produce further additive gains.
  • The method invites direct comparison against end-to-end task-loss training on identical backbone networks to isolate the contribution of the auxiliary term.

Load-bearing premise

The auxiliary loss can deliver useful recognition supervision to the encoder without introducing new optimization instability or requiring extra evaluation-time computation.

What would settle it

A controlled experiment in which the auxiliary-loss encoder produces no measurable improvement in downstream task accuracy or BD-rate on standard detection and segmentation benchmarks would falsify the central claim.

read the original abstract

Image coding for machines (ICM) aims to compress images for machine analysis using recognition models rather than human vision. Hence, in ICM, it is important for the encoder to recognize and compress the information necessary for the machine recognition task. There are two main approaches in learned ICM; optimization of the compression model based on task loss, and Region of Interest (ROI) based bit allocation. These approaches provide the encoder with the recognition capability. However, optimization with task loss becomes difficult when the recognition model is deep, and ROI-based methods often involve extra overhead during evaluation. In this study, we propose a novel training method for learned ICM models that applies auxiliary loss to the encoder to improve its recognition capability and rate-distortion performance. Our method achieves Bjontegaard Delta rate improvements of 27.7% and 20.3% in object detection and semantic segmentation tasks, compared to the conventional training method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes applying an auxiliary loss directly to the encoder in learned image coding for machines (ICM) models to supply recognition supervision, avoiding the optimization difficulties of full task-loss training and the evaluation overhead of ROI-based methods. It reports Bjontegaard Delta rate improvements of 27.7% on object detection and 20.3% on semantic segmentation relative to a conventional training baseline.

Significance. If reproducible, the auxiliary-loss approach offers a lightweight way to inject task awareness into the encoder, which could simplify ICM pipelines and improve rate-distortion performance for downstream machine vision without requiring deep back-propagation through the task model or extra inference-time mechanisms.

minor comments (3)
  1. The abstract and introduction cite specific BD-rate numbers but the experimental section should explicitly list the datasets (e.g., COCO, Cityscapes), task models, codec architecture, and training hyperparameters so that the 27.7 % and 20.3 % figures can be independently verified.
  2. Figure captions and axis labels in the rate-distortion curves should state the exact metric (mAP, mIoU) and the anchor codec used for the BD-rate calculation.
  3. The weighting hyperparameter of the auxiliary loss is listed as a free parameter; the paper should report the sensitivity analysis or the value chosen for the reported experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. The report raises no major comments, so we have no specific points requiring rebuttal or clarification at this stage. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity; empirical method comparison is self-contained

full rationale

The paper proposes an auxiliary-loss training technique for learned image coding for machines and reports empirical BD-rate gains versus a conventional baseline. No derivation chain, first-principles prediction, or fitted parameter is presented as a 'result'; the central claims are direct experimental outcomes from controlled training runs. No self-citation is invoked as load-bearing justification, and the method description does not reduce to a renaming or self-definition of its inputs. The performance numbers are presented as measured improvements, not as outputs forced by construction from the training procedure itself.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Review is abstract-only; no explicit free parameters, axioms, or invented entities are described. The method implicitly relies on standard deep-learning training assumptions such as the existence of a differentiable encoder and a compatible recognition model.

free parameters (1)
  • auxiliary loss weighting hyperparameter
    Typical balancing term between auxiliary loss and rate-distortion loss; not specified in abstract.

pith-pipeline@v0.9.0 · 5702 in / 1077 out tokens · 32926 ms · 2026-05-24T03:42:34.939722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    The scenarios in which machine analysis systems are utilized are generally classified into edge computing and cloud computing

    INTRODUCTION In recent years, the performance of deep neural networks (DNNs) has seen remarkable improvements, leading to their widespread use in various machine analysis systems such as video surveillance systems and speech recognition systems. The scenarios in which machine analysis systems are utilized are generally classified into edge computing and c...

  2. [2]

    RELATED WORK 2.1. Learned image compression Learned image compression has attracted much attention in recent years because its compression performance has been greatly improved and even exceeds the performance of conventional hand-crafted codecs such as HEVC and VVC [16,17]. The compression model typically consists of encoder, entropy model and decoder, a...

  3. [3]

    For this reason, many studies take task loss-based optimization [3-5] or ROI-based bit allocation approaches [8,9]

    PROPOSED METHOD In ICM, it is ideal to identify and extract only the information needed for the recognition task, and to compress it. For this reason, many studies take task loss-based optimization [3-5] or ROI-based bit allocation approaches [8,9]. However, it has been reported that training using task loss can make it difficult to optimize the ICM model...

  4. [4]

    EXPERIMENTS We evaluated the proposed method on the object detection and semantic segmentation tasks. As a baseline model, we used a compression model trained without auxiliary loss, and for object detection, we also used the compression model with the ROI-based method [8] applied to the baseline. 4.1 Main experiments Experimental setup: For the object de...

  5. [5]

    As in the object detection task, we followed the training manner [ 5]

    as the recognition model. As in the object detection task, we followed the training manner [ 5]. Since training with the loss function in Eq. (5) could not reduce the validation loss of the baseline model, we added the image reconstruction error to stabilize the training: 𝐿 = 𝑅 + 𝜆{𝐸(𝑦, 𝑦̂ ) + 𝛼𝐸(𝑦, 𝑦̂𝑎𝑢𝑥) + 𝑀𝑆𝐸(𝑥, 𝑥̂ )}. (6) As the dataset, we used Pasca...

  6. [6]

    Our proposed method imposes the auxiliary loss on the encoder of a compression model via a lightweight recognition model during training

    CONCLUSION In this study, we propose a novel training method for ICM models using auxiliary loss. Our proposed method imposes the auxiliary loss on the encoder of a compression model via a lightweight recognition model during training. This approach improves the encoder's recognition capability and R-D performance without any additional overhead during in...

  7. [7]

    High efficiency video coding,

    “High efficiency video coding,” rec. ITU-T H.265 and ISO/IEC 23008- 2, 2019, Int. Telecomm. Union -Telecomm. (ITU -T) and Int. Standards Org./Int/Electrotech. Comm. (ISO/IEC JTC 1)

  8. [8]

    Versatile video coding,

    “Versatile video coding,” rec. ITU-T H.266 and ISO/IEC 23090-3, 2020, Int. Telecomm. Union -Telecomm. (ITU -T) and Int. Standards Org./Int/Electrotech. Comm. (ISO/IEC JTC 1)

  9. [9]

    Image coding fo r machines: an end -to-end learned approach,

    N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding fo r machines: an end -to-end learned approach,” 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1590–1594, Jun. 2021

  10. [10]

    Rate -distortion in image ´ coding for machines,

    A. Harell, A. De Andrade, and I. V. Bajic, “Rate -distortion in image ´ coding for machines,” 2022 Picture Coding Symposium (PCS), pp. 199–203, Dec. 2022

  11. [11]

    Deep Feature Compressio n using Rate -Distortion Optimization Guided Autoencoder,

    M. Yamazaki, Y. Kora, T. Nakao, X. Lei and K. Yokoo, "Deep Feature Compressio n using Rate -Distortion Optimization Guided Autoencoder," 2022 IEEE International Conference on Image Processing (ICIP), pp. 1216- 1220, Oct. 2022

  12. [12]

    Visual analysis motivated rate - distortion model for image coding,

    Huang, Z., Jia, C., Wang, S., Ma, S, “Visual analysis motivated rate - distortion model for image coding,” 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. Jul. 2021

  13. [13]

    Choi and I

    H. Choi and I. V. Bajic, ”High Efficiency Compression for Object Detection,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1792-1796. Apr. 2018

  14. [14]

    Region of Interest Enabled Le arned Image Coding for Machines,

    J. I. Ahonen, N. Le, H. Zhang, F. Cricri and E. Rahtu, "Region of Interest Enabled Le arned Image Coding for Machines," 2023 IEEE International Workshop on Multimedia Signal Processing (MMSP), pp. 1-6, Sep. 2023

  15. [15]

    Region -of-interest and channel attent ion -based joint optimization of image compression and computer vision,

    B. Li, L. Ye, J. Liang, Y. Wang, and J. Han, “Region -of-interest and channel attent ion -based joint optimization of image compression and computer vision,” Neurocomputing, vol. 500, pp. 13–25, Aug. 2022

  16. [16]

    [VCM] On VCM reporting template,

    C. Hollmann, S. Liu, W. Gao, and X. Xu, “[VCM] On VCM reporting template,” iSO/IEC JTC 1/SC 29/WG 2, m56185, Jan. 2021

  17. [17]

    Matsubara, R

    Y. Matsubara, R. Yang, M. Levorato, and S. Mandt, ‘SC2 benchmark: Supervised c ompression for split computing,’’ Transactions on Machine Learning Research, issn. 2835-8856, Jun. 2023

  18. [18]

    Head Network Distillation: Splitting Distilled Deep Neural Networks for Resource-Constrained Edg e Computing Systems,

    Y. Matsubara, D. Callegaro, S. Baidya, M. Levorato, and S. Singh, “Head Network Distillation: Splitting Distilled Deep Neural Networks for Resource-Constrained Edg e Computing Systems,” IEEE Access, vol. 8, pp. 212177-212193, Nov. 2020

  19. [19]

    BottleFit: Learni ng Compressed Representations in Deep Neural Networks for Effective and Efficient Sp lit Computing,

    Y. Matsubara, D. Callegaro, S. Singh, M. Levorato, and F. Restuccia, “BottleFit: Learni ng Compressed Representations in Deep Neural Networks for Effective and Efficient Sp lit Computing,” 2022 IEEE International Symposium on a World of Wireless, Mobi le and Multimedia Networks (WoWMoM), pp. 337–346, Jun. 2022

  20. [20]

    Relay backpropagation for effective learning of deep c onvolutional neural networks,

    Q. Huang L. Shen, Z. Lin, “Relay backpropagation for effective learning of deep c onvolutional neural networks,” 2016 European Conference on Computer Vision (ECCV), pp. 467-482, Oct. 2016

  21. [21]

    Deeply supervised nets. 2015 International Conference on Artificial Intelligence and Statistics (AISTATS)

    Patrick Gallagher Zhengyou Zhang Zhuowen Tu ChenYu Lee, Saining Xie, “Deeply supervised nets. 2015 International Conference on Artificial Intelligence and Statistics (AISTATS)”, May. 2015

  22. [22]

    Learned Image Compression with Mixed Transformer -CNN Architectures,

    J. Liu, H. Sun, and J. Katto, “Learned Image Compression with Mixed Transformer -CNN Architectures,” 2023 The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14388 -14397, Jun. 2023

  23. [23]

    Learned image compression with di scretized gaussian mixture likelihoods and attention modules,

    Z. Cheng, H. Sun, M. Takeuchi, and J. Katto , “Learned image compression with di scretized gaussian mixture likelihoods and attention modules,” 2020 The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7939–7948, Jun. 2020

  24. [24]

    Noise or signal: The role of image backgrounds in object recognition ,

    Kai Yuanqing Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry , “Noise or signal: The role of image backgrounds in object recognition ,” 2020 International Conference on Learning Representations (ICLR), Apr. 2020

  25. [25]

    CompressAI: a pytorch libra ry and evaluation platform for end -to-end compression research,

    J. Begaint, F. Racape, S. Feltman, and A. Pushparaja, “CompressAI: a pytorch libra ry and evaluation platform for end -to-end compression research,” arXiv preprint arXiv:2011.03029, Nov. 2020

  26. [26]

    Faster R-CNN: Towards Real- Time Objec t Detection with Region Proposal Networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real- Time Objec t Detection with Region Proposal Networks,” 2015 Advances in Neural Information Processing Systems 28 (NIPS), Vol. 1, pp. 91 -99, Dec. 2015

  27. [27]

    Detectron2,

    Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2, 2019

  28. [28]

    Aggregated residual transformations for deep neural networks,

    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. “Aggregated residual transformations for deep neural networks,” 2017 The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Jul. 2017

  29. [29]

    L. -C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In The European Conference on Computer Vision (ECCV), Sep. 2018

  30. [30]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016

  31. [31]

    Microsoft COCO: Common Objects in Contex t,

    T. Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, L. Zitnick, and P. Dollar, “Microsoft COCO: Common Objects in Contex t,” 2014 European Conference on Computer Vision (ECCV), pp. 740-755, Sep. 2014

  32. [32]

    OpenMMLab Semantic Segmentation Toolbox and Benchmark,

    MMSegmentation Contributors, “OpenMMLab Semantic Segmentation Toolbox and Benchmark,” https://github.com/open - mmlab/mmsegmentation, 2020

  33. [33]

    The Role of Context for Object Detection and Semantic Segmentation in the Wild,

    R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun and A. Yuille, “The Role of Context for Object Detection and Semantic Segmentation in the Wild,” 201 4 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 891-898, Jun. 2014

  34. [34]

    Adapting auxiliary losses using gradient similarity,

    Du, Y., Czarnecki, W. M., Jayakumar, S. M., Pascanu, R., and B. Lakshminarayanan. “Adapting auxiliary losses using gradient similarity,” arXiv preprint arXiv:1812.02224, 2018