Multi-task Just Recognizable Difference for Video Coding for Machines: Database, Model, and Coding Application
Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3
The pith
An attribute-assisted multi-task model predicts just recognizable differences across three machine vision tasks to support efficient video coding for machines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that the AMT-JRD model, trained on the new MT-JRD dataset, achieves a mean absolute error of 3.781 and error variance of 5.332 across the three tasks by integrating GFEM, SFEM, and AFFM modules, outperforming state-of-the-art single-task prediction by 6.7% and 6.3% respectively, and delivering average BD-mAP improvements of 3.861% over VVC and 7.886% over JPEG when applied to VCM.
What carries the argument
The Attribute-assisted Multi-Task JRD (AMT-JRD) prediction model, which uses Generalized Feature Extraction Module (GFEM), Specialized Feature Extraction Module (SFEM), and Attribute Feature Fusion Module (AFFM) to jointly estimate object-wise JRDs by incorporating prior object size and location knowledge.
If this is right
- The predicted JRDs can be used to reduce coding bit rates in VCM pipelines while keeping accuracy high across multiple tasks simultaneously.
- Object attribute fusion compensates for limitations of image features alone, leading to more robust threshold estimates.
- The same model architecture supports joint optimization for object detection, instance segmentation, and keypoint detection without separate predictors.
- Integration with existing codecs like VVC yields measurable BD-mAP gains over both VVC and JPEG baselines.
Where Pith is reading between the lines
- The attribute-fusion idea could be tested on video sequences with motion to see if temporal object attributes further improve JRD accuracy.
- The dataset construction method might scale to collect labels for additional machine tasks such as action recognition or tracking.
- If the multi-task predictions generalize, they could serve as a starting point for standardized perceptual models in future machine-oriented compression standards.
Load-bearing premise
The 27,264 machine-generated JRD annotations collected for the three tasks represent the perceptual behavior of real-world machine vision systems on unseen content and additional tasks.
What would settle it
A direct test applying the AMT-JRD predictions to new video sequences or a fourth machine task and finding no BD-mAP gain or a drop in task accuracy would show the claim does not hold.
Figures
read the original abstract
Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single-task scenario. To address this issue, we propose a Multi-Task JRD (MT-JRD) dataset and an Attribute-assisted MT-JRD (AMT-JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT-JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object-wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model's capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT-JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT-JRD achieves precise and robust multi-task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state-of-the-art single-task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT-JRD-based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta-mean Average Precision (BD-mAP), respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Multi-Task Just Recognizable Difference (MT-JRD) dataset containing 27,264 machine-generated annotations across object detection, instance segmentation, and keypoint detection tasks. It proposes the Attribute-assisted MT-JRD (AMT-JRD) model that combines Generalized Feature Extraction Module (GFEM), Specialized Feature Extraction Module (SFEM), and Attribute Feature Fusion Module (AFFM) to jointly predict JRD thresholds for multiple tasks by incorporating object size and location priors. The model is then integrated into a Video Coding for Machines (VCM) pipeline to reduce bitrate while preserving task accuracy, with reported results of MAE 3.781 and error variance 5.332, 6.7%/6.3% gains over single-task baselines, and average BD-mAP improvements of 3.861% over VVC and 7.886% over JPEG.
Significance. If the empirical results hold under proper validation, the work provides a concrete step toward multi-task perceptual modeling for VCM, extending single-task JRD approaches with a new dataset and an architecture that fuses task-specific and attribute-based features. The downstream coding gains demonstrate a practical application, and the explicit use of machine-generated annotations as training targets is a clear methodological choice that could be reproduced if the annotation protocol is fully documented.
major comments (3)
- [Experimental Results / Dataset Construction] The central performance claims (MAE 3.781, 6.7%/6.3% gains, BD-mAP improvements) rest on the 27,264 annotations being faithful proxies for machine vision thresholds, yet the manuscript provides no information on the specific detectors/segmentors/keypoint models used to generate the labels, the video corpus selection criteria, or any cross-validation across different machine vision backbones. This directly affects whether the GFEM+SFEM+AFFM architecture learns intrinsic perceptual limits or dataset-specific correlations (see results tables and experimental setup).
- [Experimental Results] No details are given on train/validation/test splits, whether the held-out evaluation uses the same machine vision models as annotation generation, or any statistical significance testing for the reported error metrics and BD-mAP deltas. Without these, it is impossible to assess whether the multi-task gains are robust or influenced by post-hoc choices (see all quantitative tables and the VCM coding experiments).
- [Model Architecture / Ablation Studies] The AFFM module claims to compensate for limitations of image features by injecting object size and location priors, but the paper does not quantify the contribution of this module via ablation (e.g., AMT-JRD without AFFM) or show that the priors are not already implicitly captured by the feature extractors on the collected data.
minor comments (2)
- [Abstract / Dataset] The abstract and results sections should explicitly state the number of videos/frames per task and the range of JRD values to allow readers to judge the scale of the reported MAE 3.781.
- [Introduction / Method] Notation for the three tasks and the JRD definition should be introduced consistently in the introduction or method section rather than only in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of the work. We agree that the manuscript requires additional details on dataset construction, experimental protocols, and ablation studies to strengthen the validation of the reported results. We will revise the paper accordingly.
read point-by-point responses
-
Referee: [Experimental Results / Dataset Construction] The central performance claims (MAE 3.781, 6.7%/6.3% gains, BD-mAP improvements) rest on the 27,264 annotations being faithful proxies for machine vision thresholds, yet the manuscript provides no information on the specific detectors/segmentors/keypoint models used to generate the labels, the video corpus selection criteria, or any cross-validation across different machine vision backbones. This directly affects whether the GFEM+SFEM+AFFM architecture learns intrinsic perceptual limits or dataset-specific correlations (see results tables and experimental setup).
Authors: We acknowledge that the current manuscript does not provide sufficient detail on these aspects of dataset construction. In the revised version, we will expand the dataset section to explicitly document the machine vision models used to generate the annotations, the criteria and sources for selecting the video corpus, and any cross-validation experiments performed across alternative backbones. This will allow readers to better evaluate whether the model captures general perceptual thresholds. revision: yes
-
Referee: [Experimental Results] No details are given on train/validation/test splits, whether the held-out evaluation uses the same machine vision models as annotation generation, or any statistical significance testing for the reported error metrics and BD-mAP deltas. Without these, it is impossible to assess whether the multi-task gains are robust or influenced by post-hoc choices (see all quantitative tables and the VCM coding experiments).
Authors: We agree that these experimental details are necessary for assessing robustness. The revised manuscript will include the train/validation/test split information, confirmation that held-out evaluation employs the same models as annotation generation, and results from statistical significance testing (such as paired t-tests with reported p-values) on the MAE, variance, and BD-mAP metrics. Updated tables and text will incorporate these elements. revision: yes
-
Referee: [Model Architecture / Ablation Studies] The AFFM module claims to compensate for limitations of image features by injecting object size and location priors, but the paper does not quantify the contribution of this module via ablation (e.g., AMT-JRD without AFFM) or show that the priors are not already implicitly captured by the feature extractors on the collected data.
Authors: We recognize the importance of ablation studies to isolate the AFFM's contribution. In the revision, we will add a dedicated ablation analysis comparing the full AMT-JRD model to a variant without the AFFM (and without attribute priors). This will quantify the impact on prediction error and demonstrate whether the size and location priors provide benefits beyond what is implicitly learned by the GFEM and SFEM on the dataset. revision: yes
Circularity Check
No significant circularity; derivation rests on new dataset and held-out evaluation
full rationale
The paper first constructs a fresh MT-JRD dataset of 27,264 machine-generated annotations for object detection, instance segmentation and keypoint detection. It then trains the AMT-JRD model (GFEM + SFEM + AFFM) on this data and reports MAE 3.781 plus BD-mAP gains on held-out test material and downstream VCM coding. No equation or claim reduces by construction to a fitted parameter, self-citation chain, or renamed input; the reported predictions are ordinary supervised outputs from the collected annotations rather than tautological restatements. The chain from data collection through multi-task learning to coding application is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- domain assumption Machine vision perceptual thresholds for detection, segmentation and keypoint tasks can be reliably captured by image features plus object size and location attributes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Just Recognizable Difference (JRD) ... minimal perceptual threshold that significantly influences machine vision performance
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AMT-JRD prediction model ... GFEM, SFEM, AFFM
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Overview of the high efficiency video coding (hevc) standard,
G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,”IEEE Trans. Circuit Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012
work page 2012
-
[2]
Overview of the versatile video coding (vvc) standard and its applications,
B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,”IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021
work page 2021
-
[3]
Just noticeable visual redundancy forecasting: a deep multimodal-driven approach,
W. Xie, S. Wang, S. Tian, L. Huang, Y . Liu, and M. Wang, “Just noticeable visual redundancy forecasting: a deep multimodal-driven approach,” inAAAI Conf. Artif. Intell., vol. 37, no. 3, 2023, pp. 2965– 2973
work page 2023
-
[4]
Metajnd: A meta-learning approach for just noticeable difference estimation,
M. Wang, Y . Zhu, R. Zhang, and W. Xie, “Metajnd: A meta-learning approach for just noticeable difference estimation,” inInt. Joint Conf. Artif. Intell., 2024, pp. 3151–3159
work page 2024
-
[5]
Toward top-down just noticeable difference estimation of natural images,
Q. Jiang, Z. Liu, S. Wang, F. Shao, and W. Lin, “Toward top-down just noticeable difference estimation of natural images,”IEEE Trans. Image Process., vol. 31, pp. 3697–3712, 2022
work page 2022
-
[6]
Rethinking and con- ceptualizing just noticeable difference estimation by residual learning,
Q. Jiang, F. Liu, Z. Wang, S. Wang, and W. Lin, “Rethinking and con- ceptualizing just noticeable difference estimation by residual learning,” IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 10, pp. 9515–9527, 2024
work page 2024
-
[7]
Hierarchical predictive coding-based jnd estimation for image compression,
H. Wang, L. Yu, J. Liang, H. Yin, T. Li, and S. Wang, “Hierarchical predictive coding-based jnd estimation for image compression,”IEEE Trans. Image Process., vol. 30, pp. 487–500, 2021
work page 2021
-
[8]
A survey on perceptually optimized video coding,
Y . Zhang, L. Zhu, G. Jiang, S. Kwong, and C.-C. J. Kuo, “A survey on perceptually optimized video coding,”ACM Comput. Surveys, vol. 55, no. 12, pp. 1–37, 2023
work page 2023
-
[9]
Video coding for machines: A paradigm of collaborative compression and intelligent analytics,
L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,”IEEE Trans. Image Process., vol. 29, pp. 8680–8695, 2020
work page 2020
-
[10]
Progress and opportunities in modelling just- noticeable difference (jnd) for multimedia,
W. Lin and G. Ghinea, “Progress and opportunities in modelling just- noticeable difference (jnd) for multimedia,”IEEE Trans. Multimedia, vol. 24, pp. 3706–3721, 2022
work page 2022
-
[11]
H. Liu, Y . Zhang, H. Zhang, C. Fan, S. Kwong, C.-C. J. Kuo, and X. Fan, “Deep learning-based picture-wise just noticeable distortion prediction 12 model for image compression,”IEEE Trans. Image Process., vol. 29, pp. 641–656, 2020
work page 2020
-
[12]
Y . Zhang, H. Liu, Y . Yang, X. Fan, S. Kwong, and C. C. J. Kuo, “Deep learning based just noticeable difference and perceptual quality prediction models for compressed video,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 3, pp. 1197–1212, 2022
work page 2022
-
[13]
Y . Zhang, S. Zhang, N. Li, C. Fan, and R. Hamzaoui, “Vp-jnd:visual perception assisted deep picture-wise just noticeable difference predic- tion model for image compression,”IEEE Trans. Circuit Syst. Video Technol., pp. 1–1, 2025
work page 2025
-
[14]
Mtjnd: Multi-task deep learning framework for improved jnd prediction,
S. Nami, F. Pakdaman, M. R. Hashemi, S. Shirmohammadi, and M. Gabbouj, “Mtjnd: Multi-task deep learning framework for improved jnd prediction,” inProc. IEEE Int. Conf. Image Process., 2023, pp. 1245–1249
work page 2023
-
[15]
Sg-jnd: Semantic-guided just noticeable distortion predictor for image compression,
L. Cao, W. Sun, X. Min, J. Jia, Z. Zhang, Z. Chen, Y . Zhu, L. Liu, Q. Chen, J. Chen, and G. Zhai, “Sg-jnd: Semantic-guided just noticeable distortion predictor for image compression,” inProc. IEEE Int. Conf. Image Process., 2024, pp. 1139–1145
work page 2024
-
[16]
S. Nami, F. Pakdaman, M. R. Hashemi, S. Shirmohammadi, and M. Gabbouj, “Lightweight multitask learning for robust jnd prediction using latent space and reconstructed frames,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 9, pp. 8657–8671, 2024
work page 2024
-
[17]
Recent standard development activities on video coding for machines,
W. Gao, S. Liu, X. Xu, M. Rafie, Y . Zhang, and I. Curcio, “Recent standard development activities on video coding for machines,”arXiv preprint arXiv:2105.12653, 2021
-
[18]
Statistical study on perceived jpeg image quality via mcl-jci dataset construction and analysis,
L. Jin, J. Y . Lin, S. Hu, H. Wang, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Statistical study on perceived jpeg image quality via mcl-jci dataset construction and analysis,”Electronic Imaging, vol. 2016, no. 13, pp. 1–9, 2016
work page 2016
-
[19]
Large-scale crowdsourced subjective assessment of picturewise just noticeable difference,
H. Lin, G. Chen, M. Jenadeleh, V . Hosu, U.-D. Reips, R. Hamzaoui, and D. Saupe, “Large-scale crowdsourced subjective assessment of picturewise just noticeable difference,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 9, pp. 5859–5873, 2022
work page 2022
-
[20]
Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset,
H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset,” inProc. IEEE Int. Conf. Image Process., 2016, pp. 1509–1513
work page 2016
-
[21]
Videoset: A large-scale compressed video quality dataset based on jnd measurement,
H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun, X. Jin, R. Wang, X. Wanget al., “Videoset: A large-scale compressed video quality dataset based on jnd measurement,”J. Vis. Commun. Image Represent., vol. 46, pp. 292–302, 2017
work page 2017
-
[22]
Y .-H. Chen, Y .-C. Weng, C.-H. Kao, C. Chien, W.-C. Chiu, and W.- H. Peng, “Transtic: Transferring transformer-based image compression from human perception to machine perception,” inProc. Int. Conf. Comput. Vis., 2023, pp. 23 240–23 250
work page 2023
-
[23]
Im- age compression for machine and human vision with spatial-frequency adaptation,
H. Li, S. Li, S. Ding, W. Dai, M. Cao, C. Li, J. Zou, and H. Xiong, “Im- age compression for machine and human vision with spatial-frequency adaptation,” inProc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 382– 399
work page 2024
-
[24]
Boosting neural image compression for machines using latent space masking,
K. Fischer, F. Brand, and A. Kaup, “Boosting neural image compression for machines using latent space masking,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 4, pp. 3719–3731, 2025
work page 2025
-
[25]
Preprocessing enhanced image compression for machine vision,
G. Lu, X. Ge, T. Zhong, Q. Hu, and J. Geng, “Preprocessing enhanced image compression for machine vision,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 12, pp. 13 556–13 568, 2024
work page 2024
-
[26]
Task-switchable pre-processor for image compression for multiple machine vision tasks,
M. Yang, F. Yang, L. Murn, M. G. Blanch, J. Sock, S. Wan, F. Yang, and L. Herranz, “Task-switchable pre-processor for image compression for multiple machine vision tasks,”IEEE Trans. Circuit Syst. Video Technol., vol. 34, no. 7, pp. 6416–6429, 2024
work page 2024
-
[27]
W. Yang, H. Huang, Y . Hu, L.-Y . Duan, and J. Liu, “Video coding for machines: Compact visual representation compression for intelligent collaborative analytics,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7, pp. 5174–5191, 2024
work page 2024
-
[28]
All-in-one image coding for joint human-machine vision with multi-path aggregation,
X. Zhang, P. Guo, M. Lu, and Z. Ma, “All-in-one image coding for joint human-machine vision with multi-path aggregation,”Proc. Adv. Neural Inf. Process. Syst., vol. 37, pp. 71 465–71 503, 2024
work page 2024
-
[29]
Rate- distortion-cognition controllable versatile neural image compression,
J. Liu, R. Feng, Y . Qi, Q. Chen, Z. Chen, W. Zeng, and X. Jin, “Rate- distortion-cognition controllable versatile neural image compression,” in Proc. Eur. Conf. Comput. Vis.Springer, 2024, pp. 329–348
work page 2024
-
[30]
Just noticeable difference for deep machine vision,
J. Jin, X. Zhang, X. Fu, H. Zhang, W. Lin, J. Lou, and Y . Zhao, “Just noticeable difference for deep machine vision,”IEEE Trans. Circuit Syst. Video Technol., vol. 32, no. 6, pp. 3452–3461, 2022
work page 2022
-
[31]
Perceptual video coding for machines via satisfied machine ratio modeling,
Q. Zhang, S. Wang, X. Zhang, C. Jia, Z. Wang, S. Ma, and W. Gao, “Perceptual video coding for machines via satisfied machine ratio modeling,”IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–18, 2024
work page 2024
-
[32]
A non-reference just recognized distortion prediction framework for object detection task,
Y . Liu, H. Yin, H. Wang, X. Wang, and L. Yin, “A non-reference just recognized distortion prediction framework for object detection task,” in 2024 Data Compression Conference (DCC), 2024, pp. 570–570
work page 2024
-
[33]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014
work page 2014
-
[34]
Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,
C.-Y . Wang, A. Bochkovskiy, and H.-Y . M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7464–7475
work page 2023
-
[35]
Just recognizable distortion for machine vision oriented image and video coding,
Q. Zhang, S. Wang, X. Zhang, S. Ma, and W. Gao, “Just recognizable distortion for machine vision oriented image and video coding,”Int. J. Comput. Vis., vol. 129, no. 10, pp. 2889–2906, 2021
work page 2021
-
[36]
Learning to predict object-wise just recognizable distortion for image and video compression,
Y . Zhang, H. Lin, J. Sun, L. Zhu, and S. Kwong, “Learning to predict object-wise just recognizable distortion for image and video compression,”IEEE Trans. Multimedia, vol. 26, pp. 5925–5938, 2024
work page 2024
-
[37]
J. Liu, Y . Zhang, X. Wang, L. Xu, and S. Kwong, “Dt-jrd: Deep transformer-based just recognizable difference prediction model for video coding for machines,”IEEE Trans. Multimedia, vol. 28, pp. 114– 127, 2026
work page 2026
-
[38]
Faster r-cnn: Towards real-time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017
work page 2017
-
[39]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” inProc. Int. Conf. Comput. Vis., 2017, pp. 2980–2988
work page 2017
-
[40]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inProc. Eur. Conf. Comput. Vis., 2014, pp. 740–755
work page 2014
-
[41]
Aggregated residual transformations for deep neural networks,
S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2017, pp. 5987–5995
work page 2017
-
[42]
The pascal visual object classes (voc) challenge,
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- man, “The pascal visual object classes (voc) challenge,”Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010
work page 2010
-
[43]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004
work page 2004
-
[44]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586– 595
work page 2018
-
[45]
Understanding the effective receptive field in deep convolutional neural networks,
W. Luo, Y . Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,”Proc. Adv. Neural Inf. Process. Syst., vol. 29, 2016
work page 2016
-
[46]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProc. Int. Conf. Comput. Vis., 2021, pp. 10 012–10 022. Junqi Liureceived the B.E. degree in electronic information science and technology from Sun Yat- sen University, China, in 2024. He is currently pur- su...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.