pith. sign in

arxiv: 2604.09421 · v1 · submitted 2026-04-10 · 📡 eess.IV · cs.CV· cs.MM

Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application

Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.MM
keywords just recognizable differencevideo coding for machinesmulti-task learningobject detectioninstance segmentationkeypoint detectionperceptual modelingattribute fusion
0
0 comments X

The pith

An attribute-assisted multi-task model predicts just recognizable differences across three machine vision tasks to support efficient video coding for machines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a dataset of 27,264 machine-generated JRD annotations for object detection, instance segmentation, and keypoint detection. It introduces the AMT-JRD model that combines generalized feature extraction, specialized task features, and fusion of object attributes such as size and location to predict visibility thresholds jointly. This approach addresses the single-task limitation of prior JRD methods by enabling shared learning that compensates for image-feature shortcomings. A reader would care because accurate multi-task JRD prediction lets video codecs drop imperceptible details for machines, cutting data rates while preserving task performance in areas like surveillance or autonomous systems. Experiments confirm lower prediction errors than single-task baselines and concrete bit-rate savings when the predictions guide coding.

Core claim

The authors show that the AMT-JRD model, trained on the new MT-JRD dataset, achieves a mean absolute error of 3.781 and error variance of 5.332 across the three tasks by integrating GFEM, SFEM, and AFFM modules, outperforming state-of-the-art single-task prediction by 6.7% and 6.3% respectively, and delivering average BD-mAP improvements of 3.861% over VVC and 7.886% over JPEG when applied to VCM.

What carries the argument

The Attribute-assisted Multi-Task JRD (AMT-JRD) prediction model, which uses Generalized Feature Extraction Module (GFEM), Specialized Feature Extraction Module (SFEM), and Attribute Feature Fusion Module (AFFM) to jointly estimate object-wise JRDs by incorporating prior object size and location knowledge.

If this is right

  • The predicted JRDs can be used to reduce coding bit rates in VCM pipelines while keeping accuracy high across multiple tasks simultaneously.
  • Object attribute fusion compensates for limitations of image features alone, leading to more robust threshold estimates.
  • The same model architecture supports joint optimization for object detection, instance segmentation, and keypoint detection without separate predictors.
  • Integration with existing codecs like VVC yields measurable BD-mAP gains over both VVC and JPEG baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attribute-fusion idea could be tested on video sequences with motion to see if temporal object attributes further improve JRD accuracy.
  • The dataset construction method might scale to collect labels for additional machine tasks such as action recognition or tracking.
  • If the multi-task predictions generalize, they could serve as a starting point for standardized perceptual models in future machine-oriented compression standards.

Load-bearing premise

The 27,264 machine-generated JRD annotations collected for the three tasks represent the perceptual behavior of real-world machine vision systems on unseen content and additional tasks.

What would settle it

A direct test applying the AMT-JRD predictions to new video sequences or a fourth machine task and finding no BD-mAP gain or a drop in task accuracy would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2604.09421 by Junqi Liu, Long Xu, Weisi Lin, Xiaoxia Huang, Yun Zhang.

Figure 2
Figure 2. Figure 2: Paradigm of MT-JRD including database, model, and coding application. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Visualized examples of MT-JRD. (a) Performance degradation in multi-task [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline for constructing the MT-JRD dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MT-JRD quantity distribution under different [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distortion distribution of the MT-JRD dataset. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relationship between object size and JRD. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Architectures of MT-JRD prediction models. (a) Independent single-task archi [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The proposed AMT-JRD model, which consists of GFEM, SFEM, AFFM, and multi-task classification heads. [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The processing workflow for JRD-based VCM optimization. (a) JRD-based [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Coding gain comparison among JRD models in VVC. (a) OD, (b) IS, (c) KPD. [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Coding gain comparison among JRD models in JPEG. (a) OD, (b) IS, (c) KPD. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Complexity and accuracy of JRD models on the MT-JRD dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualized machine analytical results of compressed images from different JRD-based coding methods. The first, second, and third lines correspond to OD, IS, and KPD, [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
read the original abstract

Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model's capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Multi-Task Just Recognizable Difference (MT-JRD) dataset containing 27,264 machine-generated annotations across object detection, instance segmentation, and keypoint detection tasks. It proposes the Attribute-assisted MT-JRD (AMT-JRD) model that combines Generalized Feature Extraction Module (GFEM), Specialized Feature Extraction Module (SFEM), and Attribute Feature Fusion Module (AFFM) to jointly predict JRD thresholds for multiple tasks by incorporating object size and location priors. The model is then integrated into a Video Coding for Machines (VCM) pipeline to reduce bitrate while preserving task accuracy, with reported results of MAE 3.781 and error variance 5.332, 6.7%/6.3% gains over single-task baselines, and average BD-mAP improvements of 3.861% over VVC and 7.886% over JPEG.

Significance. If the empirical results hold under proper validation, the work provides a concrete step toward multi-task perceptual modeling for VCM, extending single-task JRD approaches with a new dataset and an architecture that fuses task-specific and attribute-based features. The downstream coding gains demonstrate a practical application, and the explicit use of machine-generated annotations as training targets is a clear methodological choice that could be reproduced if the annotation protocol is fully documented.

major comments (3)
  1. [Experimental Results / Dataset Construction] The central performance claims (MAE 3.781, 6.7%/6.3% gains, BD-mAP improvements) rest on the 27,264 annotations being faithful proxies for machine vision thresholds, yet the manuscript provides no information on the specific detectors/segmentors/keypoint models used to generate the labels, the video corpus selection criteria, or any cross-validation across different machine vision backbones. This directly affects whether the GFEM+SFEM+AFFM architecture learns intrinsic perceptual limits or dataset-specific correlations (see results tables and experimental setup).
  2. [Experimental Results] No details are given on train/validation/test splits, whether the held-out evaluation uses the same machine vision models as annotation generation, or any statistical significance testing for the reported error metrics and BD-mAP deltas. Without these, it is impossible to assess whether the multi-task gains are robust or influenced by post-hoc choices (see all quantitative tables and the VCM coding experiments).
  3. [Model Architecture / Ablation Studies] The AFFM module claims to compensate for limitations of image features by injecting object size and location priors, but the paper does not quantify the contribution of this module via ablation (e.g., AMT-JRD without AFFM) or show that the priors are not already implicitly captured by the feature extractors on the collected data.
minor comments (2)
  1. [Abstract / Dataset] The abstract and results sections should explicitly state the number of videos/frames per task and the range of JRD values to allow readers to judge the scale of the reported MAE 3.781.
  2. [Introduction / Method] Notation for the three tasks and the JRD definition should be introduced consistently in the introduction or method section rather than only in the abstract.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of the work. We agree that the manuscript requires additional details on dataset construction, experimental protocols, and ablation studies to strengthen the validation of the reported results. We will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Experimental Results / Dataset Construction] The central performance claims (MAE 3.781, 6.7%/6.3% gains, BD-mAP improvements) rest on the 27,264 annotations being faithful proxies for machine vision thresholds, yet the manuscript provides no information on the specific detectors/segmentors/keypoint models used to generate the labels, the video corpus selection criteria, or any cross-validation across different machine vision backbones. This directly affects whether the GFEM+SFEM+AFFM architecture learns intrinsic perceptual limits or dataset-specific correlations (see results tables and experimental setup).

    Authors: We acknowledge that the current manuscript does not provide sufficient detail on these aspects of dataset construction. In the revised version, we will expand the dataset section to explicitly document the machine vision models used to generate the annotations, the criteria and sources for selecting the video corpus, and any cross-validation experiments performed across alternative backbones. This will allow readers to better evaluate whether the model captures general perceptual thresholds. revision: yes

  2. Referee: [Experimental Results] No details are given on train/validation/test splits, whether the held-out evaluation uses the same machine vision models as annotation generation, or any statistical significance testing for the reported error metrics and BD-mAP deltas. Without these, it is impossible to assess whether the multi-task gains are robust or influenced by post-hoc choices (see all quantitative tables and the VCM coding experiments).

    Authors: We agree that these experimental details are necessary for assessing robustness. The revised manuscript will include the train/validation/test split information, confirmation that held-out evaluation employs the same models as annotation generation, and results from statistical significance testing (such as paired t-tests with reported p-values) on the MAE, variance, and BD-mAP metrics. Updated tables and text will incorporate these elements. revision: yes

  3. Referee: [Model Architecture / Ablation Studies] The AFFM module claims to compensate for limitations of image features by injecting object size and location priors, but the paper does not quantify the contribution of this module via ablation (e.g., AMT-JRD without AFFM) or show that the priors are not already implicitly captured by the feature extractors on the collected data.

    Authors: We recognize the importance of ablation studies to isolate the AFFM's contribution. In the revision, we will add a dedicated ablation analysis comparing the full AMT-JRD model to a variant without the AFFM (and without attribute priors). This will quantify the impact on prediction error and demonstrate whether the size and location priors provide benefits beyond what is implicitly learned by the GFEM and SFEM on the dataset. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on new dataset and held-out evaluation

full rationale

The paper first constructs a fresh MT-JRD dataset of 27,264 machine-generated annotations for object detection, instance segmentation and keypoint detection. It then trains the AMT-JRD model (GFEM + SFEM + AFFM) on this data and reports MAE 3.781 plus BD-mAP gains on held-out test material and downstream VCM coding. No equation or claim reduces by construction to a fitted parameter, self-citation chain, or renamed input; the reported predictions are ordinary supervised outputs from the collected annotations rather than tautological restatements. The chain from data collection through multi-task learning to coding application is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a newly collected machine-annotated dataset and a neural network whose weights are fitted to those annotations; no additional physical constants or closed-form derivations are invoked.

free parameters (1)
  • neural network weights
    All parameters of the AMT-JRD model are learned from the 27,264 JRD annotations.
axioms (1)
  • domain assumption Machine vision perceptual thresholds for detection, segmentation and keypoint tasks can be reliably captured by image features plus object size and location attributes.
    This assumption underpins both the dataset collection and the design of the Attribute Feature Fusion Module.

pith-pipeline@v0.9.0 · 5656 in / 1406 out tokens · 38314 ms · 2026-05-10T16:39:19.117468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Overview of the high efficiency video coding (hevc) standard,

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,”IEEE Trans. Circuit Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012

  2. [2]

    Overview of the versatile video coding (vvc) standard and its applications,

    B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,”IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

  3. [3]

    Just noticeable visual redundancy forecasting: a deep multimodal-driven approach,

    W. Xie, S. Wang, S. Tian, L. Huang, Y . Liu, and M. Wang, “Just noticeable visual redundancy forecasting: a deep multimodal-driven approach,” inAAAI Conf. Artif. Intell., vol. 37, no. 3, 2023, pp. 2965– 2973

  4. [4]

    Metajnd: A meta-learning approach for just noticeable difference estimation,

    M. Wang, Y . Zhu, R. Zhang, and W. Xie, “Metajnd: A meta-learning approach for just noticeable difference estimation,” inInt. Joint Conf. Artif. Intell., 2024, pp. 3151–3159

  5. [5]

    Toward top-down just noticeable difference estimation of natural images,

    Q. Jiang, Z. Liu, S. Wang, F. Shao, and W. Lin, “Toward top-down just noticeable difference estimation of natural images,”IEEE Trans. Image Process., vol. 31, pp. 3697–3712, 2022

  6. [6]

    Rethinking and con- ceptualizing just noticeable difference estimation by residual learning,

    Q. Jiang, F. Liu, Z. Wang, S. Wang, and W. Lin, “Rethinking and con- ceptualizing just noticeable difference estimation by residual learning,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 10, pp. 9515–9527, 2024

  7. [7]

    Hierarchical predictive coding-based jnd estimation for image compression,

    H. Wang, L. Yu, J. Liang, H. Yin, T. Li, and S. Wang, “Hierarchical predictive coding-based jnd estimation for image compression,”IEEE Trans. Image Process., vol. 30, pp. 487–500, 2021

  8. [8]

    A survey on perceptually optimized video coding,

    Y . Zhang, L. Zhu, G. Jiang, S. Kwong, and C.-C. J. Kuo, “A survey on perceptually optimized video coding,”ACM Comput. Surveys, vol. 55, no. 12, pp. 1–37, 2023

  9. [9]

    Video coding for machines: A paradigm of collaborative compression and intelligent analytics,

    L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,”IEEE Trans. Image Process., vol. 29, pp. 8680–8695, 2020

  10. [10]

    Progress and opportunities in modelling just- noticeable difference (jnd) for multimedia,

    W. Lin and G. Ghinea, “Progress and opportunities in modelling just- noticeable difference (jnd) for multimedia,”IEEE Trans. Multimedia, vol. 24, pp. 3706–3721, 2022

  11. [11]

    Deep learning-based picture-wise just noticeable distortion prediction 12 model for image compression,

    H. Liu, Y . Zhang, H. Zhang, C. Fan, S. Kwong, C.-C. J. Kuo, and X. Fan, “Deep learning-based picture-wise just noticeable distortion prediction 12 model for image compression,”IEEE Trans. Image Process., vol. 29, pp. 641–656, 2020

  12. [12]

    Deep learning based just noticeable difference and perceptual quality prediction models for compressed video,

    Y . Zhang, H. Liu, Y . Yang, X. Fan, S. Kwong, and C. C. J. Kuo, “Deep learning based just noticeable difference and perceptual quality prediction models for compressed video,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 3, pp. 1197–1212, 2022

  13. [13]

    Vp-jnd:visual perception assisted deep picture-wise just noticeable difference predic- tion model for image compression,

    Y . Zhang, S. Zhang, N. Li, C. Fan, and R. Hamzaoui, “Vp-jnd:visual perception assisted deep picture-wise just noticeable difference predic- tion model for image compression,”IEEE Trans. Circuit Syst. Video Technol., pp. 1–1, 2025

  14. [14]

    Mtjnd: Multi-task deep learning framework for improved jnd prediction,

    S. Nami, F. Pakdaman, M. R. Hashemi, S. Shirmohammadi, and M. Gabbouj, “Mtjnd: Multi-task deep learning framework for improved jnd prediction,” inProc. IEEE Int. Conf. Image Process., 2023, pp. 1245–1249

  15. [15]

    Sg-jnd: Semantic-guided just noticeable distortion predictor for image compression,

    L. Cao, W. Sun, X. Min, J. Jia, Z. Zhang, Z. Chen, Y . Zhu, L. Liu, Q. Chen, J. Chen, and G. Zhai, “Sg-jnd: Semantic-guided just noticeable distortion predictor for image compression,” inProc. IEEE Int. Conf. Image Process., 2024, pp. 1139–1145

  16. [16]

    Lightweight multitask learning for robust jnd prediction using latent space and reconstructed frames,

    S. Nami, F. Pakdaman, M. R. Hashemi, S. Shirmohammadi, and M. Gabbouj, “Lightweight multitask learning for robust jnd prediction using latent space and reconstructed frames,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 9, pp. 8657–8671, 2024

  17. [17]

    Recent standard development activities on video coding for machines,

    W. Gao, S. Liu, X. Xu, M. Rafie, Y . Zhang, and I. Curcio, “Recent standard development activities on video coding for machines,”arXiv preprint arXiv:2105.12653, 2021

  18. [18]

    Statistical study on perceived jpeg image quality via mcl-jci dataset construction and analysis,

    L. Jin, J. Y . Lin, S. Hu, H. Wang, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Statistical study on perceived jpeg image quality via mcl-jci dataset construction and analysis,”Electronic Imaging, vol. 2016, no. 13, pp. 1–9, 2016

  19. [19]

    Large-scale crowdsourced subjective assessment of picturewise just noticeable difference,

    H. Lin, G. Chen, M. Jenadeleh, V . Hosu, U.-D. Reips, R. Hamzaoui, and D. Saupe, “Large-scale crowdsourced subjective assessment of picturewise just noticeable difference,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 9, pp. 5859–5873, 2022

  20. [20]

    Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset,

    H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset,” inProc. IEEE Int. Conf. Image Process., 2016, pp. 1509–1513

  21. [21]

    Videoset: A large-scale compressed video quality dataset based on jnd measurement,

    H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun, X. Jin, R. Wang, X. Wanget al., “Videoset: A large-scale compressed video quality dataset based on jnd measurement,”J. Vis. Commun. Image Represent., vol. 46, pp. 292–302, 2017

  22. [22]

    Transtic: Transferring transformer-based image compression from human perception to machine perception,

    Y .-H. Chen, Y .-C. Weng, C.-H. Kao, C. Chien, W.-C. Chiu, and W.- H. Peng, “Transtic: Transferring transformer-based image compression from human perception to machine perception,” inProc. Int. Conf. Comput. Vis., 2023, pp. 23 240–23 250

  23. [23]

    Im- age compression for machine and human vision with spatial-frequency adaptation,

    H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Im- age compression for machine and human vision with spatial-frequency adaptation,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 382– 399

  24. [24]

    Boosting neural image compression for machines using latent space masking,

    K. Fischer, F. Brand, and A. Kaup, “Boosting neural image compression for machines using latent space masking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 4, pp. 3719–3731, 2025

  25. [25]

    Preprocessing enhanced image compression for machine vision,

    G. Lu, X. Ge, T. Zhong, Q. Hu, and J. Geng, “Preprocessing enhanced image compression for machine vision,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 12, pp. 13 556–13 568, 2024

  26. [26]

    Task-switchable pre-processor for image compression for multiple machine vision tasks,

    M. Yang, F. Yang, L. Murn, M. G. Blanch, J. Sock, S. Wan, F. Yang, and L. Herranz, “Task-switchable pre-processor for image compression for multiple machine vision tasks,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 7, pp. 6416–6429, 2024

  27. [27]

    Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,

    W. Yang, H. Huang, Y . Hu, L.-Y . Duan, and J. Liu, “Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 5174–5191, 2024

  28. [28]

    All-in-one image coding for joint human-machine vision with multi-path aggregation,

    X. Zhang, P. Guo, M. Lu, and Z. Ma, “All-in-one image coding for joint human-machine vision with multi-path aggregation,”Proc. Adv. Neural Inf. Process. Syst., vol. 37, pp. 71 465–71 503, 2024

  29. [29]

    Rate- distortion-cognition controllable versatile neural image compression,

    J. Liu, R. Feng, Y . Qi, Q. Chen, Z. Chen, W. Zeng, and X. Jin, “Rate- distortion-cognition controllable versatile neural image compression,” in Proc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 329–348

  30. [30]

    Just noticeable difference for deep machine vision,

    J. Jin, X. Zhang, X. Fu, H. Zhang, W. Lin, J. Lou, and Y . Zhao, “Just noticeable difference for deep machine vision,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 6, pp. 3452–3461, 2022

  31. [31]

    Perceptual video coding for machines via satisfied machine ratio modeling,

    Q. Zhang, S. Wang, X. Zhang, C. Jia, Z. Wang, S. Ma, and W. Gao, “Perceptual video coding for machines via satisfied machine ratio modeling,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2024

  32. [32]

    A non-reference just recognized distortion prediction framework for object detection task,

    Y . Liu, H. Yin, H. Wang, X. Wang, and L. Yin, “A non-reference just recognized distortion prediction framework for object detection task,” in 2024 Data Compression Conference (DCC), 2024, pp. 570–570

  33. [33]

    Generative adversarial nets,

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014

  34. [34]

    Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,

    C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7464–7475

  35. [35]

    Just recognizable distortion for machine vision oriented image and video coding,

    Q. Zhang, S. Wang, X. Zhang, S. Ma, and W. Gao, “Just recognizable distortion for machine vision oriented image and video coding,”Int. J. Comput. Vis., vol. 129, no. 10, pp. 2889–2906, 2021

  36. [36]

    Learning to predict object-wise just recognizable distortion for image and video compression,

    Y . Zhang, H. Lin, J. Sun, L. Zhu, and S. Kwong, “Learning to predict object-wise just recognizable distortion for image and video compression,”IEEE Trans. Multimedia, vol. 26, pp. 5925–5938, 2024

  37. [37]

    Dt-jrd: Deep transformer-based just recognizable difference prediction model for video coding for machines,

    J. Liu, Y . Zhang, X. Wang, L. Xu, and S. Kwong, “Dt-jrd: Deep transformer-based just recognizable difference prediction model for video coding for machines,”IEEE Trans. Multimedia, vol. 28, pp. 114– 127, 2026

  38. [38]

    Faster r-cnn: Towards real-time object detection with region proposal networks,

    S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017

  39. [39]

    Mask r-cnn,

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProc. Int. Conf. Comput. Vis., 2017, pp. 2980–2988

  40. [40]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 740–755

  41. [41]

    Aggregated residual transformations for deep neural networks,

    S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5987–5995

  42. [42]

    The pascal visual object classes (voc) challenge,

    M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,”Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010

  43. [43]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004

  44. [44]

    The unreasonable effectiveness of deep features as a perceptual metric,

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586– 595

  45. [45]

    Understanding the effective receptive field in deep convolutional neural networks,

    W. Luo, Y . Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,”Proc. Adv. Neural Inf. Process. Syst., vol. 29, 2016

  46. [46]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. Int. Conf. Comput. Vis., 2021, pp. 10 012–10 022. Junqi Liureceived the B.E. degree in electronic information science and technology from Sun Yat- sen University, China, in 2024. He is currently pur- su...