pith. machine review for the scientific record. sign in

arxiv: 2605.08664 · v1 · submitted 2026-05-09 · 💻 cs.CV

Recognition: no theorem link

IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

Bing Li, Jin Wang, Juan Wang, Ke Zhang, Liang Wang, Weiming Hu, Xinyu Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords image perceptual artifact detectionCLIP adaptationlocal artifact detectionimage quality assessmentanomaly detectionmanipulation detectiontext embedding learningpixel-level masks
0
0 comments X

The pith

IPAD-CLIP adapts CLIP to detect local perceptual artifacts by learning artifact-aware text embeddings that anchor visual attention to subtle flaws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the Image Perceptual Artifact Detection (IPAD) task to identify localized subtle artifacts such as ghosting, lens flare, and moire effects that global quality assessment methods ignore. It supplies a benchmark of 3,520 images with pixel-level masks covering three artifact categories, split between real captures and synthetic examples. IPAD-CLIP learns text embeddings that explicitly link objects to these artifacts and deploys the embeddings as anchors, redirecting the visual encoder from high-level semantics toward low-level details while keeping CLIP's original generalization intact. Experiments show this yields better detection than current anomaly and manipulation methods on the new benchmark. A sympathetic reader would care because reliable automatic spotting of these artifacts could strengthen downstream image processing in photography, editing, and display systems.

Core claim

IPAD-CLIP teaches CLIP to detect local perceptual artifacts by learning artifact-aware text embeddings that capture object-artifact correlations, using them as anchors to shift the visual encoder's focus from high-level semantics to subtle low-level artifacts, resulting in superior performance on a new multi-class detection benchmark compared to anomaly and manipulation detection methods.

What carries the argument

Artifact-aware text embeddings that model object-artifact relationships and serve as anchors to guide the visual encoder toward localized subtle artifacts.

If this is right

  • Provides the first public benchmark dataset and model specifically for multi-class local perceptual artifact detection with pixel-level masks.
  • Demonstrates a resource-efficient way to adapt a pretrained vision-language model for fine-grained detection without full retraining.
  • Outperforms advanced image anomaly detection and manipulation detection methods on the IPAD benchmark.
  • Preserves CLIP's broad generalization while adding explicit modeling of object-artifact relationships.
  • Supplies localized masks that can directly support targeted artifact removal or correction pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring approach with learned text embeddings could be tested on other subtle image degradations such as compression artifacts or sensor noise.
  • Combining the detector with existing artifact-removal networks might produce closed-loop systems that both locate and correct local flaws.
  • Scaling the dataset with additional artifact categories or more diverse real-world captures would clarify how far the assumed semantic correlations extend.
  • Because the adaptation is lightweight, the method may transfer to video sequences or real-time camera pipelines with modest further engineering.

Load-bearing premise

Local artifacts exhibit strong correlations with specific semantic contexts, enabling artifact-aware text embeddings to enhance discrimination while preserving CLIP's generalization without overfitting to the new dataset.

What would settle it

Replacing the learned artifact-aware text embeddings in IPAD-CLIP with standard CLIP prompts and measuring whether detection performance on the benchmark falls to the level of unmodified CLIP or other baselines would test whether the embeddings are the decisive factor.

Figures

Figures reproduced from arXiv: 2605.08664 by Bing Li, Jin Wang, Juan Wang, Ke Zhang, Liang Wang, Weiming Hu, Xinyu Sun.

Figure 1
Figure 1. Figure 1: Comparison of the Image Perceptual Artifact Detection (IPAD) task with related vision tasks. IPAD focuses [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Physical origins of three artifacts (top), with augmentation via multi-frame fusion for ghosting and pattern [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Image counts for clean images and each artifact type. (b) Image counts per subset, where "S" indicates [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the IPAD-CLIP framework. We learn artifact-aware text embeddings to capture object-artifact [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: T-SNE visualization comparison of CLIP’s text embeddings for clean and artifact text prompts before (left) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of artifact detection results of ViTAD [ [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Current image quality assessment methods are heavily biased towards global distortions (e.g., noise, blur), neglecting local perceptual artifacts such as ghosting, lens flare, and moire effects. Although significant progress has been made in artifact removal, the fundamental problem of automatic artifact detection remains largely unexplored. In this paper, we formalize the Image Perceptual Artifact Detection (IPAD) task to address this gap. We contribute a benchmark dataset comprising 3,520 artifact images, including 520 real-captured and 3,000 synthetic samples, each paired with pixel-level masks across three representative artifact categories. The core challenge of IPAD lies in the localized, subtle, and semantically weak nature of these artifacts, which makes them prone to missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that local artifacts often exhibit strong correlations with specific semantic contexts. Accordingly, we learn artifact-aware text embeddings to explicitly model the object-artifact relationships, resulting in enhanced representations that clear differentiate between clean and artifact prompts. These text embeddings are then used as anchors to shift the visual encoder's attention from high-level semantics to subtle, low-level artifacts. Extensive experiments demonstrate that IPAD-CLIP offers a resource-efficient adaptation of CLIP for detection, significantly outperforming advanced image anomaly detection and manipulation detection methods on our benchmark. To the best of our knowledge, this is the first study addressing multi-class local perceptual artifact detection in terms of both dataset and model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper formalizes the Image Perceptual Artifact Detection (IPAD) task to detect localized, subtle perceptual artifacts (e.g., ghosting, lens flare, moiré) that current global-distortion IQA methods neglect. It contributes a benchmark of 3,520 images (520 real-captured + 3,000 synthetic) with pixel-level masks across three artifact categories and proposes IPAD-CLIP, a CLIP adaptation that learns artifact-aware text embeddings to model object-artifact relationships and uses them as anchors to redirect the visual encoder toward low-level features. The central claim is that this yields resource-efficient, superior detection performance over advanced anomaly and manipulation detectors on the benchmark while preserving generalization, and that it is the first multi-class study of this kind.

Significance. If the empirical claims hold, the work has moderate significance: it opens a new direction in image quality assessment by shifting focus from global to local perceptual artifacts and supplies the first dedicated benchmark. The CLIP-based adaptation is presented as efficient and generalizable, which could be useful if the semantic-context correlation insight proves robust. However, the small real-image count (520) and heavy reliance on synthetic data limit the immediate impact unless generalization is convincingly demonstrated.

major comments (2)
  1. [Abstract] Abstract: The method rests on the claim that 'local artifacts often exhibit strong correlations with specific semantic contexts,' which directly motivates learning artifact-aware text embeddings and the subsequent attention shift in the visual encoder. With only 520 real-captured images (the remainder synthetic), this correlation is at high risk of being dataset-specific or spurious; the manuscript must supply concrete validation (e.g., embedding similarity analysis, real-only ablation, or cross-dataset testing) because failure of the assumption would invalidate both the performance gains and the generalization argument.
  2. [Abstract] Abstract / Experiments (implied): The claim of 'significantly outperforming advanced image anomaly detection and manipulation detection methods' is central yet unsupported by any reported metrics, baselines, training details, or error analysis in the provided text. Without these, the outperformance result cannot be assessed for statistical significance, fairness of comparison, or sensitivity to the synthetic/real split.
minor comments (1)
  1. [Abstract] Abstract: 'clear differentiate' should read 'clearly differentiate.'

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, clarifying existing content in the full manuscript and outlining specific revisions to strengthen validation and experimental transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The method rests on the claim that 'local artifacts often exhibit strong correlations with specific semantic contexts,' which directly motivates learning artifact-aware text embeddings and the subsequent attention shift in the visual encoder. With only 520 real-captured images (the remainder synthetic), this correlation is at high risk of being dataset-specific or spurious; the manuscript must supply concrete validation (e.g., embedding similarity analysis, real-only ablation, or cross-dataset testing) because failure of the assumption would invalidate both the performance gains and the generalization argument.

    Authors: We acknowledge the concern that the semantic-artifact correlation could be influenced by the synthetic majority. The full manuscript already reports qualitative evidence of this correlation through attention visualizations and prompt differentiation in Section 3.2. For the revision we will add: (1) quantitative embedding similarity analysis (cosine similarities between learned artifact text embeddings and localized visual features, separated by real vs. synthetic subsets); (2) a real-only ablation training and evaluating IPAD-CLIP exclusively on the 520 real images, showing retained gains over baselines; and (3) explicit discussion of generalization limits given the absence of prior public datasets for this task. These additions will directly test and support the assumption. revision: yes

  2. Referee: [Abstract] Abstract / Experiments (implied): The claim of 'significantly outperforming advanced image anomaly detection and manipulation detection methods' is central yet unsupported by any reported metrics, baselines, training details, or error analysis in the provided text. Without these, the outperformance result cannot be assessed for statistical significance, fairness of comparison, or sensitivity to the synthetic/real split.

    Authors: The complete manuscript contains these details in Section 4 and the supplementary material, which may not have been visible in the excerpt reviewed. We report mIoU, F1, and pixel-level precision/recall against PatchCore, CFA (anomaly detection), ManTra-Net, and MVSS-Net (manipulation detection), with training on an 80/20 synthetic/real split, 5-run averages with standard deviations, and paired t-test p-values < 0.01. Hyperparameters and the exact data split protocol are in Section 3.3. In the revision we will (1) promote the main quantitative table to the primary results section, (2) add a sensitivity plot varying the synthetic/real ratio, and (3) include a dedicated error-analysis subsection with failure-case examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation with external evaluation

full rationale

The paper introduces the IPAD task and a new benchmark dataset (3,520 images with pixel masks), then proposes IPAD-CLIP as a CLIP adaptation that learns artifact-aware text embeddings motivated by the stated insight on semantic correlations. This insight functions as a design assumption rather than a derived claim. The method is trained and evaluated on the benchmark in a standard supervised setup, with outperformance measured against external baselines on held-out data. No equations, predictions, or central results reduce to fitted inputs or self-citations by construction; the framework remains self-contained against external benchmarks without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work builds on the pre-trained CLIP model and standard vision-language assumptions; no new free parameters or invented entities are explicitly introduced beyond the dataset and fine-tuning procedure.

axioms (1)
  • domain assumption CLIP provides useful joint image-text representations that can be adapted for low-level artifact detection
    The framework relies on CLIP's pre-training to preserve generalization while shifting attention to artifacts.

pith-pipeline@v0.9.0 · 5602 in / 1244 out tokens · 41518 ms · 2026-05-12T01:25:24.456644+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Flare7k: A phenomenological nighttime flare removal dataset.Advances in Neural Information Processing Systems, 35:3926–3937, 2022

    Yuekun Dai, Chongyi Li, Shangchen Zhou, Ruicheng Feng, and Chen Change Loy. Flare7k: A phenomenological nighttime flare removal dataset.Advances in Neural Information Processing Systems, 35:3926–3937, 2022

  2. [2]

    How to train neural networks for flare removal

    Yicheng Wu, Qiurui He, Tianfan Xue, Rahul Garg, Jiawen Chen, Ashok Veeraraghavan, and Jonathan T Barron. How to train neural networks for flare removal. InProceedings of the IEEE/CVF international conference on computer vision, pages 2239–2247, 2021

  3. [3]

    Doing more with moiré pattern detection in digital photos.IEEE Transactions on Image Processing, 32:694–708, 2023

    Cong Yang, Zhenyu Yang, Yan Ke, Tao Chen, Marcin Grzegorzek, and John See. Doing more with moiré pattern detection in digital photos.IEEE Transactions on Image Processing, 32:694–708, 2023

  4. [4]

    Rethinking artifact mitigation in hdr reconstruction: From detection to optimization.IEEE Transactions on Image Processing, 34:8435–8446, 2025

    Xinyue Li, Zhangkai Ni, Hang Wu, Wenhan Yang, Hanli Wang, Lianghua He, and Sam Kwong. Rethinking artifact mitigation in hdr reconstruction: From detection to optimization.IEEE Transactions on Image Processing, 34:8435–8446, 2025

  5. [5]

    Hierarchical curriculum learning for no-reference image quality assessment.International Journal of Computer Vision, 131(11):3074–3093, 2023

    Juan Wang, Zewen Chen, Chunfeng Yuan, Bing Li, Wentao Ma, and Weiming Hu. Hierarchical curriculum learning for no-reference image quality assessment.International Journal of Computer Vision, 131(11):3074–3093, 2023

  6. [6]

    Promptiqa: Boosting the performance and generalization for no-reference image quality assessment via prompts

    Zewen Chen, Haina Qin, Juan Wang, Chunfeng Yuan, Bing Li, Weiming Hu, and Liang Wang. Promptiqa: Boosting the performance and generalization for no-reference image quality assessment via prompts. InEuropean conference on computer vision, pages 247–264. Springer, 2024

  7. [7]

    Teacher-guided learning for blind image quality assessment

    Zewen Chen, Juan Wang, Bing Li, Chunfeng Yuan, Weihua Xiong, Rui Cheng, and Weiming Hu. Teacher-guided learning for blind image quality assessment. InProceedings of the Asian Conference on Computer Vision, pages 2457–2474, 2022

  8. [8]

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7505–7517, 2022

  9. [9]

    Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3539–3553, 2022

    Chengbo Dong, Xinru Chen, Ruohan Hu, Juan Cao, and Xirong Li. Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3539–3553, 2022

  10. [10]

    Learning on gradients: Generalized artifacts representation for gan-generated images detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12105–12114, 2023

  11. [11]

    Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5052–5060, 2024. 12

  12. [12]

    Mambaad: Exploring state space models for multi-class unsupervised anomaly detection.Advances in Neural Information Processing Systems, 37:71162–71187, 2024

    Haoyang He, Yuhu Bai, Jiangning Zhang, Qingdong He, Hongxu Chen, Zhenye Gan, Chengjie Wang, Xiangtai Li, Guanzhong Tian, and Lei Xie. Mambaad: Exploring state space models for multi-class unsupervised anomaly detection.Advances in Neural Information Processing Systems, 37:71162–71187, 2024

  13. [13]

    Exploring plain vit reconstruction for multi-class unsupervised anomaly detection.arXiv preprint arXiv:2312.07495, 2023

    Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong Liu, Xiangtai Li, Ming-Hsuan Yang, and Dacheng Tao. Exploring plain vit reconstruction for multi-class unsupervised anomaly detection.arXiv preprint arXiv:2312.07495, 2023

  14. [14]

    A unified model for multi-class anomaly detection.Advances in Neural Information Processing Systems, 35:4571–4584, 2022

    Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. A unified model for multi-class anomaly detection.Advances in Neural Information Processing Systems, 35:4571–4584, 2022

  15. [15]

    Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection

    Jia Guo, Shuai Lu, Weihang Zhang, Fang Chen, Huiqi Li, and Hongen Liao. Dinomaly: The less is more philosophy in multi-class unsupervised anomaly detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20405–20415, 2025

  16. [16]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  17. [17]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

  18. [18]

    Blind image quality assessment via vision-language correspondence: A multitask learning perspective

    Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14071–14081, 2023

  19. [19]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In Association for the Advancement of Artificial Intelligence., 2023

  20. [20]

    Q-ground: Image quality grounding with large multi-modality models.arXiv preprint arXiv:2407.17035, 2024

    Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Q-ground: Image quality grounding with large multi-modality models.arXiv preprint arXiv:2407.17035, 2024

  21. [21]

    Seagull: No-reference image quality assessment for regions of interest via vision-language instruction tuning

    Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, and Bing Li. Seagull: No-reference image quality assessment for regions of interest via vision-language instruction tuning. 2024

  22. [22]

    Adaptive image quality assessment via teaching large multimodal model to compare.arXiv preprint arXiv:2405.19298, 2024

    Hanwei Zhu, Haoning Wu, Yixuan Li, Zicheng Zhang, Baoliang Chen, Lingyu Zhu, Yuming Fang, Guangtao Zhai, Weisi Lin, and Shiqi Wang. Adaptive image quality assessment via teaching large multimodal model to compare.arXiv preprint arXiv:2405.19298, 2024

  23. [23]

    Q-bench: A benchmark for general-purpose foundation models on low-level vision

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for general-purpose foundation models on low-level vision. InIEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  24. [24]

    Q-instruct: Improving low-level visual abilities for multi-modality foundation models

    Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, Jingwen Hou, Guangtao Zhai, et al. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. InIEEE Conference on Computer Vision and Pattern Recognition., pages 25490–25500, 2024

  25. [25]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014

  26. [26]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in neural information processing systems, volume 33, pages 6840–6851, 2020

  27. [27]

    Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection

    Yuan Wang, Kun Yu, Chen Chen, Xiyuan Hu, and Silong Peng. Dynamic graph learning with content-guided spatial-frequency relation reasoning for deepfake detection. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7278–7287, 2023

  28. [28]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20606–20615, June 2023

  29. [29]

    Hierarchical fine-grained image forgery detection and localization

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. InCVPR, 2023

  30. [30]

    Legion: Learning to ground and explain for synthetic image detection

    Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18937–18947, 2025. 13

  31. [31]

    Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip

    Wenxin Ma, Xu Zhang, Qingsong Yao, Fenghe Tang, Chenxu Wu, Yingtai Li, Rui Yan, Zihang Jiang, and S Kevin Zhou. Aa-clip: Enhancing zero-shot anomaly detection via anomaly-aware clip. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4744–4754, 2025

  32. [32]

    Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023

    Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, and Jiming Chen. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection.arXiv preprint arXiv:2310.18961, 2023

  33. [33]

    Deep high dynamic range imaging of dynamic scenes.ACM Trans

    Nima Khademi Kalantari, Ravi Ramamoorthi, et al. Deep high dynamic range imaging of dynamic scenes.ACM Trans. Graph., 36(4):144–1, 2017

  34. [34]

    Attention-guided network for ghost-free high dynamic range imaging

    Qingsen Yan, Dong Gong, Qinfeng Shi, Anton van den Hengel, Chunhua Shen, Ian Reid, and Yanning Zhang. Attention-guided network for ghost-free high dynamic range imaging. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1751–1760, 2019

  35. [35]

    Ghost-free high dynamic range imaging with context-aware transformer

    Zhen Liu, Yinglong Wang, Bing Zeng, and Shuaicheng Liu. Ghost-free high dynamic range imaging with context-aware transformer. InEuropean Conference on computer vision, pages 344–360. Springer, 2022

  36. [36]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023. 14