pith. sign in

arxiv: 2607.01535 · v1 · pith:YLITVJ2Xnew · submitted 2026-07-01 · 💻 cs.CV

Hidden-Shot: Towards One-Shot Task Generalization for Low-Level Vision Generalist Models

Pith reviewed 2026-07-03 20:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords one-shot task generalizationlow-level visiongeneralist modelsimplicit promptstask adaptationimage processingcomputer visionfew-shot learning
0
0 comments X

The pith

Hidden-Shot lets low-level vision generalist models adapt to new tasks from one example image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a vision generalist model can handle entirely new low-level tasks after seeing only one example image, without needing task labels or big changes to its structure. It does this by pulling out hidden task details from that image, wrapping them in a global textural prompt, and blending them selectively into the model's normal processing. A sympathetic reader would care because most current generalist models stay locked to the tasks they saw during training, so any real-world shift to a fresh problem like a different kind of noise or blur forces full retraining. The authors also supply two new evaluation setups, 3C4U and 3C7U, that mix familiar and unfamiliar tasks to measure whether generalization actually occurs. If the approach holds, generalist models could move from narrow specialists to tools that pick up fresh image-processing jobs on the fly.

Core claim

The central claim is that the Hidden-Shot implicit prompt mechanism extracts visual task-based information from a single example image, applies a global task-aware textural prompt, and selectively merges the implicit information with in-task processing information, allowing the original generalist model to achieve one-shot generalization on unseen low-level tasks through direct, low-cost injection with minimal architectural change. Experiments on seven and ten datasets show the method beats the prior state-of-the-art generalist under the 3C4U framework for retraining and the 3C7U framework for training from scratch, while keeping performance steady on the original tasks.

What carries the argument

Hidden-Shot, the implicit prompt mechanism that extracts task information from one image and merges it into the generalist model's processing via a task-aware textural prompt.

If this is right

  • The model outperforms prior generalist models on one-shot adaptation to four unconventional tasks after retraining on three conventional ones.
  • It also outperforms on seven unconventional tasks when trained from scratch alongside three conventional ones.
  • Performance on the original trained tasks stays consistent rather than degrading.
  • The approach works through cost-effective direct injection that leaves the base architecture almost untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same single-example extraction idea could let deployed models accept user-provided reference images for quick custom adaptation in editing tools.
  • The C/U assessment framework supplies a concrete template other researchers could reuse to test generalization claims in future generalist work.
  • If implicit task signals prove reliable, similar merging steps might reduce reliance on explicit task identifiers across a wider range of vision architectures.

Load-bearing premise

The implicit visual task-based information extracted from a single example image is sufficient to drive meaningful adaptation in the generalist model without task-specific labels or architecture changes.

What would settle it

A head-to-head test on the 3C7U framework in which the Hidden-Shot model shows no gain over the unmodified baseline on the seven unconventional tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.01535 by Shao-Jun Xia, Xianzheng Ma, Zichong Meng.

Figure 1
Figure 1. Figure 1: Overview of Hidden-Shot Framework. The right half illustrates the common [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 3C4U Generalization Evaluation Framework. The framework consists of three [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparisons on three conventional tasks (3C). Hidden-Shot produces [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparisons on four unconventional tasks (4U). Hidden-Shot consistently [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Despite the intense engagement surrounding low-level vision generalist models, their effectiveness in zero/few-shot scenarios beyond learned tasks remains unverified. The primary challenge of developing an ideal generalist lies in achieving the ability to generalize from new unseen tasks, which also can be assessed by matched quantitative criteria. Existing methods have made some progress in prompt engineering but have not systematically explored this gap across a wide range of low-level visual tasks. Stimulated by the problem, we propose Hidden-Shot, an implicit prompt mechanism aimed at exploring low-level task adaptation in a vision generalist model. Specifically, the method extracts implicit visual task-based information, utilizes a global task-aware textural prompt, and selectively merges implicit information with in-task processing information to enhance one-shot capabilities in new tasks. The overall design performs direct injection in a cost-effective manner, while minimally altering the architecture of the original generalist model. Additionally, we introduce a data-driven evaluation framework termed C/U assessment to cover two basic scenarios, 3C4U (3 conventional and 4 unconventional tasks) for retraining existing models and 3C7U (3 conventional and 7 unconventional tasks) for training from scratch, as a comprehensive assessment to systematically test the generalization ability of low-level generalist models. Experiments on seven and ten datasets outperform the state-of-the-art vision generalist model, respectively verified by 3C4U and 3C7U framework. Our presented Hidden-Shot approach demonstrates superior performance on one-shot new tasks while maintaining consistent performance on existing tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Hidden-Shot, an implicit prompt mechanism for enabling one-shot task generalization in low-level vision generalist models. The method extracts implicit visual task-based information from a single example image, employs a global task-aware textural prompt, and selectively merges this information with the in-task processing to enhance adaptation to new tasks without significant architectural changes. The authors introduce the C/U assessment framework, including the 3C4U (3 conventional and 4 unconventional tasks) and 3C7U (3 conventional and 7 unconventional tasks) suites, to systematically evaluate generalization capabilities. Experiments on seven and ten datasets respectively demonstrate outperformance over the state-of-the-art vision generalist model while maintaining consistent performance on existing tasks.

Significance. If the empirical results hold, this work could be significant for the field of low-level vision by providing a practical approach to task adaptation in generalist models using only one example image. The introduction of a data-driven evaluation framework for assessing one-shot generalization on both conventional and unconventional tasks fills a gap in the current literature and could serve as a standard for future evaluations. The cost-effective design that minimally alters the original model architecture is a practical strength.

minor comments (3)
  1. [Abstract] Abstract: The abstract claims outperformance on the 3C4U and 3C7U frameworks but does not provide any quantitative numbers, error bars, dataset details, or ablation results. Including key performance metrics in the abstract would strengthen the presentation of the central claims.
  2. [Abstract] Abstract: The term 'textural prompt' appears to be a possible typo for 'textual prompt'; please verify the intended terminology.
  3. [Method] The description of the selective merging mechanism and the global task-aware prompt is high-level in the provided summary; expanding on implementation details in the method section would improve reproducibility and clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive summary, positive assessment of significance, and recommendation of minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of the Hidden-Shot method for one-shot task generalization in low-level vision models, with no mathematical derivation chain, first-principles predictions, or equations that could reduce to inputs by construction. Claims rest on experimental outperformance under the newly introduced C/U assessment framework (3C4U/3C7U), which is presented as external to the method itself rather than derived from it. No self-citations, fitted parameters renamed as predictions, or ansatzes are load-bearing in the provided description, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the method is described as a lightweight addition to an existing generalist architecture.

pith-pipeline@v0.9.1-grok · 5817 in / 1076 out tokens · 20093 ms · 2026-07-03T20:44:17.632035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    J.Lu, C.Clark, R.Zellers, R.Mottaghi, A.Kembhavi, Unified-io: Auni- fied model for vision, language, and multi-modal tasks, in: The Eleventh International Conference on Learning Representations, 2022

  2. [2]

    P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, H. Yang, Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, in: International Con- ference on Machine Learning, PMLR, 2022, pp. 23318–23340

  3. [3]

    T. Chen, S. Saxena, L. Li, T.-Y. Lin, D. J. Fleet, G. E. Hinton, A uni- fied sequence interface for vision tasks, Advances in Neural Information Processing Systems 35 (2022) 31333–31346

  4. [4]

    K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoen- coders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000– 16009

  5. [5]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models arefew-shotlearners, Advancesinneuralinformationprocessingsystems 33 (2020) 1877–1901

  6. [6]

    A. Bar, Y. Gandelsman, T. Darrell, A. Globerson, A. Efros, Visual prompting via image inpainting, Advances in Neural Information Pro- cessing Systems 35 (2022) 25005–25017

  7. [7]

    6830–6839

    X.Wang, W.Wang, Y.Cao, C.Shen, T.Huang, Imagesspeakinimages: A generalist painter for in-context visual learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6830–6839. 23

  8. [8]

    J. Ma, T. Cheng, G. Wang, Q. Zhang, X. Wang, L. Zhang, Prores: Ex- ploring degradation-aware visual prompt for universal image restoration, arXiv preprint arXiv:2306.13653 (2023)

  9. [9]

    Y. Liu, X. Chen, X. Ma, X. Wang, J. Zhou, Y. Qiao, C. Dong, Unifying image processing as visual prompting question answering, in: Proceed- ings of the 41st International Conference on Machine Learning, 2024, pp. 30873–30891

  10. [10]

    Potlapalli, S

    V. Potlapalli, S. W. Zamir, S. H. Khan, F. Shahbaz Khan, Promptir: Prompting for all-in-one image restoration, Advances in Neural Infor- mation Processing Systems 36 (2023) 71275–71293

  11. [11]

    M. V. Conde, G. Geigle, R. Timofte, Instructir: High-quality image restoration following human instructions, in: European Conference on Computer Vision, Springer, 2024, pp. 1–21

  12. [12]

    H. Yu, I. Mineyev, L. R. Varshney, J. A. Evans, Learning from one and only one shot, npj Artificial Intelligence 1 (1) (2025) 13

  13. [13]

    D.Pereg, One-shotimagerestoration, in: EuropeanConferenceonCom- puter Vision, Springer, 2024, pp. 34–50

  14. [14]

    X. Chen, Y. Liu, Y. Pu, W. Zhang, J. Zhou, Y. Qiao, C. Dong, Learning a low-level vision generalist via visual task prompt, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2671– 2680

  15. [15]

    S. Wang, J. Zhang, J. Huang, F. Zhao, Image-free pre-training for low- level vision, in: Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 8825–8834

  16. [16]

    J. Hu, Z. You, J. Gu, K. Zhu, T. Xue, C. Dong, Revisiting the gener- alization problem of low-level vision models through the lens of image deraining, arXiv preprint arXiv:2502.12600 (2025)

  17. [17]

    D. Shi, S. Huang, Image dehazing algorithm based on deep transfer learning and local mean adaptation, Scientific Reports 15 (1) (2025) 27956. 24

  18. [18]

    Y. Liu, J. He, J. Gu, X. Kong, Y. Qiao, C. Dong, Degae: A new pre- trainingparadigmforlow-levelvision, in: ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23292–23303

  19. [19]

    H. Duan, X. Min, S. Wu, W. Shen, G. Zhai, Uniprocessor: a text- induced unified low-level image processor, in: European Conference on Computer Vision, Springer, 2024, pp. 180–199

  20. [20]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning research 21 (140) (2020) 1–67

  21. [21]

    Kolesnikov, A

    A. Kolesnikov, A. Susano Pinto, L. Beyer, X. Zhai, J. Harmsen, N. Houlsby, Uvim: A unified modeling approach for vision with learned guiding codes, Advances in Neural Information Processing Systems 35 (2022) 26295–26308

  22. [22]

    16804–16815

    X.Zhu, J.Zhu, H.Li, X.Wu, H.Li, X.Wang, J.Dai, Uni-perceiver: Pre- training unified architecture for generic perception for zero-shot and few- shot tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16804–16815

  23. [23]

    W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J.Zhou, Y.Qiao, etal., Visionllm: Largelanguagemodelisalsoanopen- ended decoder for vision-centric tasks, Advances in Neural Information Processing Systems 36 (2024)

  24. [24]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  25. [25]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High- resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, 2022, pp. 10684–10695. 25

  26. [26]

    Y. Gan, S. Park, A. Schubert, A. Philippakis, A. M. Alaa, Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists, arXiv preprint arXiv:2310.00390 (2023)

  27. [27]

    Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Hu, D. Chen, et al., Instructdiffusion: A generalist modeling interface for vision tasks, arXiv preprint arXiv:2309.03895 (2023)

  28. [28]

    Oorloff, V

    T. Oorloff, V. Sindagi, W. G. C. Bandara, A. Shafahi, A. Ghiasi, C. Prakash, R. Ardekani, Stable diffusion models are secretly good at visual in-context learning, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2025, pp. 23604–23613

  29. [29]

    Z. Wang, Y. Jiang, Y. Lu, P. He, W. Chen, Z. Wang, M. Zhou, et al., In-context learning unlocked for diffusion models, Advances in Neural Information Processing Systems 36 (2023) 8542–8562

  30. [30]

    Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Li, H. Hu, et al., Instructdiffusion: A generalist modeling interface for vision tasks, in: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2024, pp. 12709–12720

  31. [31]

    Z. Gu, S. Yang, J. Liao, J. Huo, Y. Gao, Analogist: Out-of-the-box vi- sual in-context learning with image diffusion model, ACM Transactions on Graphics (TOG) 43 (4) (2024) 1–15

  32. [32]

    J. Gao, Y. Sun, F. Shen, X. Jiang, Z. Xing, K. Chen, C. Zhao, Faceshot: Bring any character into life, arXiv preprint arXiv:2503.00740 (2025)

  33. [33]

    J. Gao, Y. Sun, Y. Liu, Y. Tang, Y. Zeng, D. Qi, K. Chen, C. Zhao, Styleshot: A snapshot on any style, IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025)

  34. [34]

    Tzeng, J

    E. Tzeng, J. Hoffman, T. Darrell, K. Saenko, Simultaneous deep transfer across domains and tasks, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 4068–4076

  35. [35]

    A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, S. Savarese, Taskonomy: Disentangling task transfer learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3712–3722. 26

  36. [36]

    A. Pal, V. N. Balasubramanian, Zero-shot task transfer, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2189–2198

  37. [37]

    A. Wang, M. Tarr, L. Wehbe, Neural taskonomy: Inferring the simi- larity of task-derived representations from brain activity, in: H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019

  38. [38]

    Dwivedi, G

    K. Dwivedi, G. Roig, Representation similarity analysis for efficient task taxonomy & transfer learning, in: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 12387– 12396

  39. [39]

    Bhattacharjee, S

    D. Bhattacharjee, S. Süsstrunk, M. Salzmann, Vision transformer adapters for generalizable multitask learning, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19015–19026

  40. [40]

    Cheng, I

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked- attention mask transformer for universal image segmentation, in: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

  41. [41]

    Huang, Z

    G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con- nected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708

  42. [42]

    Zhang, A

    L. Zhang, A. Rao, M. Agrawala, Adding conditional control to text-to- image diffusion models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847

  43. [43]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018)

  44. [44]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transfer- able visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. 27

  45. [45]

    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, Restormer: Efficient transformer for high-resolution image restoration, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739

  46. [46]

    Jiang, Z

    J. Jiang, Z. Zuo, G. Wu, K. Jiang, X. Liu, A survey on all-in-one image restoration: Taxonomy, evaluation and future trends, IEEE Transac- tions on Pattern Analysis and Machine Intelligence (2025)

  47. [47]

    Y. Cui, S. W. Zamir, S. Khan, A. Knoll, M. Shah, F. S. Khan, Adair: Adaptive all-in-one image restoration via frequency mining and mod- ulation, in: 13th International Conference on Learning Representa- tions, ICLR 2025, International Conference on Learning Representa- tions, ICLR, 2025, pp. 57335–57356

  48. [48]

    J. Jain, Y. Zhou, N. Yu, H. Shi, Keys to better image inpainting: Struc- ture and texture go hand in hand, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 208–217

  49. [49]

    H. Jin, Y. Li, F. Luan, Y. Xiangli, S. Bi, K. Zhang, Z. Xu, J. Sun, N. Snavely, Neural gaffer: Relighting any object via diffusion, Advances in Neural Information Processing Systems 37 (2024) 141129–141152

  50. [50]

    Cheng, Z

    X. Cheng, Z. Fu, J. Yang, Multi-scale dynamic feature encoding network for image demoiréing, in: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE, 2019, pp. 3486–3493

  51. [51]

    S. Kim, Y. Huo, S.-E. Yoon, Single image reflection removal with physically-based training images, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5164– 5173

  52. [52]

    Abdelhamed, S

    A. Abdelhamed, S. Lin, M. S. Brown, A high-quality denoising dataset for smartphone cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  53. [53]

    C. Wei, W. Wang, W. Yang, J. Liu, Deep retinex decomposition for low-light enhancement, arXiv preprint arXiv:1808.04560 (2018). 28

  54. [54]

    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, L. Shao, Learning enriched features for fast image restoration and en- hancement, IEEE transactions on pattern analysis and machine intelli- gence 45 (2) (2022) 1934–1948

  55. [55]

    Ancuti, C

    C. Ancuti, C. O. Ancuti, C. De Vleeschouwer, D-hazy: A dataset to evaluate quantitatively dehazing algorithms, in: 2016 IEEE Interna- tional Conference on Image Processing (ICIP), 2016, pp. 2226–2230. doi:10.1109/ICIP.2016.7532754

  56. [56]

    S. Nah, T. Hyun Kim, K. Mu Lee, Deep multi-scale convolutional neural network for dynamic scene deblurring, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3883– 3891

  57. [57]

    J. Wang, X. Li, J. Yang, Stacked conditional generative adversarial net- works for jointly learning shadow detection and shadow removal, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1788–1797

  58. [58]

    P. Roy, S. Ghosh, S. Bhattacharya, U. Pal, Effects of degradations on deep neural network architectures, arXiv preprint arXiv:1807.10108 (2018)

  59. [59]

    M. E. Helou, R. Zhou, J. Barthas, S. Süsstrunk, Vidit: Virtual im- age dataset for illumination transfer, arXiv preprint arXiv:2005.05460 (2020)

  60. [60]

    X. Yu, P. Dai, W. Li, L. Ma, J. Shen, J. Li, X. Qi, Towards efficient and scale-robust ultra-high-definition image demoiréing, in: European Conference on Computer Vision, Springer, 2022, pp. 646–662

  61. [61]

    R.Wan, B.Shi, L.-Y.Duan, A.-H.Tan, A.C.Kot, Benchmarkingsingle- image reflection removal algorithms, in: International Conference on Computer Vision (ICCV), 2017

  62. [62]

    Cheng, T

    X. Cheng, T. Cao, G. Cheng, B. Huang, X. Tian, Y. Wang, X. He, W. Li, T. Xue, X. Dong, Consistent diffusion: Denoising diffusion model with data-consistent training for image restoration, arXiv preprint arXiv:2412.12550 (2024). 29

  63. [63]

    Kamali, K

    N. Kamali, K. Nakamura, A. Kumar, A. Chatzimparmpas, J. Hullman, M. Groh, Characterizing photorealism and artifacts in diffusion model- generated images, in: Proceedings of the 2025 CHI Conference on Hu- man Factors in Computing Systems, 2025, pp. 1–26

  64. [64]

    Q. Wu, Y. Liu, H. Zhao, T. Bui, Z. Lin, Y. Zhang, S. Chang, Harness- ing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7766–7776

  65. [65]

    Z. Meng, C. Yang, J. Liu, H. Tang, P. Zhao, Y. Wang, Instructgie: Towards generalizable image editing, in: European Conference on Com- puter Vision, Springer, 2024, pp. 18–34

  66. [66]

    Soria, Y

    X. Soria, Y. Li, M. Rouhani, A. D. Sappa, Tiny and efficient model for the edge detection generalization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1364–1373

  67. [67]

    Cheng, Y

    D. Cheng, Y. Li, D. Zhang, N. Wang, J. Sun, X. Gao, Progressive negative enhancing contrastive learning for image dehazing and beyond, IEEE Transactions on Multimedia 26 (2024) 8783–8798

  68. [68]

    Cheng, Y

    D. Cheng, Y. Ji, D. Gong, Y. Li, N. Wang, J. Han, D. Zhang, Continual all-in-one adverse weather removal with knowledge replay on a unified network structure, IEEE Transactions on Multimedia 26 (2024) 8184– 8196

  69. [69]

    M.ElHelou, R.Zhou, S.Susstrunk, R.Timofte, Ntire2021depthguided image relighting challenge, in: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 566–577

  70. [70]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high- performance deep learning library, in: Advances in Neural Information Processing Syste...

  71. [71]

    Original Painter, Instruct-Diffusion, and Prompt Diffusion all use the officially released weights for inference

    tokenizer (with sequences padded to a maximum length of 256) for im- plicit global textural prompt, and a pretrained frozen CLIP [44] model (ViT- B/32) [24] for implicit information combination (implicit learning matrix: q = 256) and embedding space unification. Original Painter, Instruct-Diffusion, and Prompt Diffusion all use the officially released wei...