pith. machine review for the scientific record. sign in

arxiv: 2604.16086 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI· cs.LG· stat.ML

Recognition: unknown

Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGstat.ML
keywords self-supervised learningdisentanglementappearance modelingstyle-content separationweather analysismelanoma detectionJEPAadversarial training
0
0 comments X

The pith

ST-STORM disentangles image appearance from object content in self-supervised learning so that weather effects and textures become explicit semantic signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard SSL methods like MoCo and DINO train for invariance to illumination and geometry because those changes are usually noise when the goal is object recognition. The paper argues this invariance becomes harmful in domains where appearance itself carries the key information, such as rain streaks for road grip assessment or texture variations for melanoma detection. ST-STORM therefore splits the representation into two gated latent streams: a Content branch that uses a JEPA predictor plus contrastive loss to stay stable across appearance changes, and a Style branch that must predict and reconstruct appearance signatures under an adversarial constraint. Experiments on ImageNet-1K, a multi-weather dataset, and the ISIC 2024 melanoma challenge show the Style branch isolates these cues at high accuracy while the Content branch retains its semantic performance. The separation is presented as a way to make both kinds of information available without forcing one to suppress the other.

Core claim

By maintaining two explicitly gated latent streams, one optimized for appearance-invariant semantics via JEPA and contrastive objectives and the other optimized for appearance signatures via prediction, reconstruction, and adversarial losses, the model isolates complex appearance phenomena such as atmospheric scattering and texture patterns while preserving object-level semantic representations.

What carries the argument

Dual gated latent streams: Content branch (JEPA predictor plus contrastive loss for semantic invariance) and Style branch (feature prediction, reconstruction, and adversarial constraint for appearance signatures).

If this is right

  • The Style branch reaches F1 of 97 percent on multi-weather characterization while the Content branch holds F1 of 80 percent on ImageNet-1K.
  • The Style branch reaches F1 of 94 percent on ISIC 2024 melanoma detection using only 10 percent labeled data.
  • Critical appearance cues such as ground conditions and atmospheric effects remain available for downstream tasks like autonomous driving without trading off object recognition accuracy.
  • The hybrid architecture improves preservation of appearance information compared with purely invariant SSL baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gated separation could be applied to other appearance-heavy tasks such as material classification or art-style analysis without retraining the entire network.
  • If the streams remain cleanly separated, the Style features might serve as a lightweight prior for few-shot adaptation on new weather or lighting conditions.
  • A direct test would measure whether Style-branch representations transfer to material or reflectance estimation benchmarks that are not mentioned in the current evaluation.

Load-bearing premise

Explicit gating mechanisms and separate objectives can enforce clean disentanglement between content and style latent streams without information leakage or objective conflict.

What would settle it

A controlled ablation in which removing the gating layers or the adversarial term causes the Style branch features to become predictive of object category labels on ImageNet or causes the Content branch F1 to drop below the reported 80 percent.

Figures

Figures reproduced from arXiv: 2604.16086 by Fr\'ed\'eric Bernardin, Hamed Ouattara, Omar Ait Aider, Pascal Houssam Salmane, Pierre Duthon.

Figure 1
Figure 1. Figure 1: Toward a semantic representation of appearance conditions. The goal is to transform the perception of style: no longer to see it as a random disturbance, but as a structured category. This approach mimics the human ability to dissociate the object (the car) from the context (the rain), while understanding that rain is a physical entity with predictable consequences (slippery road, reduced visibility), whic… view at source ↗
Figure 2
Figure 2. Figure 2: Weather simulation by CycleGAN style transfer [11]. Left column: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the invariance paradigm in self-supervised learning. Consider the same cat photographed in an apartment: it moves from the living room to the bedroom, the lighting changes (shadow, lamp, daylight), it adopts different poses, and it may be partially occluded (for example under a bed). Despite these appearance variations, the identity of the object does not change: it is still the same cat. T… view at source ↗
Figure 5
Figure 5. Figure 5: Principle of predictability at the embedding level (I-JEPA). From a partial context, some properties of a cat are predictable (presence of ears, whiskers, tail), whereas contingent details are much less so (exact fur color, eye shade). I-JEPA therefore imposes a prediction constraint in the representation space rather than at the pixel level, in order to favor abstract and transferable information. In our … view at source ↗
Figure 7
Figure 7. Figure 7: Neural style transfer [33]: a synthetic image is optimized to preserve [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The receptive field of a neuron is the part of the image to which it is [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Semantic learning of style. (A) An image [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: As in the human brain, where certain regions are specialized for [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Partitioning into pseudo-domains and cyclic learning. The dataset is [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Architectural disentanglement of ST-STORM. The content path (U [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The predictability constraint acts as a filter that removes contingent [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Controlled and learned perturbation of style on an image from the ISIC 2024 Challenge dataset. The model mainly modifies appearance signatures, then learns to reconstruct the original appearance under a predictability constraint, in order to obtain semantic and reusable style tokens for downstream tasks. • Lmse measures the prediction error of masked tokens; • Lvar penalizes variance collapse; • Lcov pena… view at source ↗
Figure 16
Figure 16. Figure 16: Semantic learning of content with MoCo. The main difference from [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
read the original abstract

One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ST-STORM, a hybrid self-supervised learning framework that explicitly disentangles content and style via two gated latent streams. The Content branch combines a JEPA scheme with contrastive learning to produce appearance-invariant semantic representations, while the Style branch employs feature prediction, reconstruction, and adversarial objectives to isolate appearance signatures such as textures, contrasts, and scattering effects. Evaluations claim that the Style branch achieves F1=97% on Multi-Weather and F1=94% on ISIC 2024 (with only 10% labeled data) without degrading the Content branch's F1=80% on ImageNet-1K.

Significance. If the reported disentanglement is verified, the work would be significant for computer vision applications in which appearance is itself the discriminative signal (e.g., weather-aware autonomous driving or dermatological imaging). It directly addresses a known limitation of invariance-focused SSL methods such as MoCo and DINO by treating style as a semantic modality rather than noise, and the dual-branch design with explicit gating offers a concrete architectural alternative.

major comments (2)
  1. [Evaluation] Evaluation section: The headline claim that the Style branch 'effectively isolates complex appearance phenomena' while the Content branch remains unaffected rests on the F1 scores (97% Multi-Weather, 94% ISIC, 80% ImageNet-1K), yet no cross-branch diagnostics (mutual information, linear probes from one branch into the other, or gating ablations) are reported. Without these, the numbers are compatible with both successful separation and with simple specialization under separate losses.
  2. [Methods] Methods section: The architecture is described as using 'gating mechanisms' and separate objectives (JEPA+contrastive on Content; prediction+reconstruction+adversarial on Style), but no equations for the combined loss, no diagram of the gating, and no training hyper-parameters are supplied. These details are load-bearing for reproducing the claimed separation and for assessing whether objective conflict or leakage occurs.
minor comments (1)
  1. [Abstract] The final sentence of the abstract is truncated ('and improves the preservation of critical appearance').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of explicitly treating style as a semantic modality in SSL. We address each major comment point by point below, with clarifications and commitments to revisions that strengthen the presentation of our results and methods without altering the core claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline claim that the Style branch 'effectively isolates complex appearance phenomena' while the Content branch remains unaffected rests on the F1 scores (97% Multi-Weather, 94% ISIC, 80% ImageNet-1K), yet no cross-branch diagnostics (mutual information, linear probes from one branch into the other, or gating ablations) are reported. Without these, the numbers are compatible with both successful separation and with simple specialization under separate losses.

    Authors: We agree that additional cross-branch diagnostics would provide more direct evidence of disentanglement beyond the task-specific F1 scores. The current results show the Style branch achieving high accuracy on appearance-driven tasks (F1=97% Multi-Weather, 94% ISIC with limited labels) while the Content branch retains semantic performance (F1=80% ImageNet-1K) under its invariance-focused objectives, which is enabled by the gated dual-stream design. To rule out the possibility of simple specialization without separation, we will add in revision: mutual information estimates between the two branches' representations, linear probes trained on one branch to predict the other, and gating ablations. These will appear in a new subsection of the Experiments section. revision: yes

  2. Referee: [Methods] Methods section: The architecture is described as using 'gating mechanisms' and separate objectives (JEPA+contrastive on Content; prediction+reconstruction+adversarial on Style), but no equations for the combined loss, no diagram of the gating, and no training hyper-parameters are supplied. These details are load-bearing for reproducing the claimed separation and for assessing whether objective conflict or leakage occurs.

    Authors: We acknowledge that the manuscript currently omits these implementation details, which are essential for reproducibility and for evaluating potential conflicts between the branch objectives. In the revised manuscript we will add: (1) a figure explicitly diagramming the dual-branch architecture and gating mechanisms, (2) the full mathematical formulation of the combined loss (including weighting coefficients for the JEPA, contrastive, prediction, reconstruction, and adversarial terms), and (3) a table listing all training hyperparameters such as learning rates, batch sizes, and regularization strengths. These additions will allow readers to assess objective interactions and leakage. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on training and evaluation, not self-referential derivations

full rationale

The paper introduces an architectural framework (gated content/style branches with JEPA+contrastive vs. prediction+reconstruction+adversarial objectives) and reports F1 scores on ImageNet-1K, Multi-Weather, and ISIC 2024. No equations, first-principles derivations, or predictions appear that reduce to fitted inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central claims. Results are obtained via standard empirical SSL training and downstream evaluation, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on the unproven premise that appearance and semantics occupy cleanly separable latent subspaces and that the chosen objectives plus gating will enforce that separation; no independent evidence for these premises is supplied in the abstract.

free parameters (1)
  • Gating mechanism weights
    Parameters that control information flow between content and style streams; their values are not specified and must be learned or tuned.
axioms (1)
  • domain assumption Appearance signatures can be isolated from semantic content via separate optimization objectives and gating
    Invoked to justify the dual-branch design and the claim that style does not degrade content performance.
invented entities (2)
  • Content branch no independent evidence
    purpose: Produce appearance-invariant semantic representations via JEPA and contrastive loss
    New architectural component introduced to achieve invariance.
  • Style branch no independent evidence
    purpose: Capture textures, contrasts and scattering via feature prediction, reconstruction and adversarial constraint
    New architectural component introduced to treat appearance as a semantic modality.

pith-pipeline@v0.9.0 · 5641 in / 1340 out tokens · 54305 ms · 2026-05-10T08:56:39.980365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    A combined corner and edge detector,

    C. Harris, M. Stephenset al., “A combined corner and edge detector,” inAlvey vision conference, vol. 15, no. 50. Manchester, UK, 1988, pp. 10–5244

  2. [2]

    Object recognition from local scale-invariant features,

    D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the seventh IEEE international conference on computer vision, vol. 2. Ieee, 1999, pp. 1150–1157

  3. [3]

    Distinctive image features from scale-invariant keypoints,

    ——, “Distinctive image features from scale-invariant keypoints,”Inter- national journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004

  4. [4]

    Surf: Speeded up robust features,

    H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” inEuropean conference on computer vision. Springer, 2006, pp. 404–417

  5. [5]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

  6. [6]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025. [Online]. Available: h...

  7. [7]

    An empirical study of training self- supervised vision transformers,

    X. Chen, S. Xie, and K. He, “An empirical study of training self- supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9640–9649. 20

  8. [8]

    What should not be contrastive in contrastive learning,

    T. Xiao, X. Wang, A. A. Efros, and T. Darrell, “What should not be contrastive in contrastive learning,”arXiv preprint arXiv:2008.05659, 2020

  9. [9]

    Amortised invariance learning for contrastive self- supervision,

    R. Chavhan, H. Gouk, J. Stuehmer, C. Heggan, M. Yaghoobi, and T. Hospedales, “Amortised invariance learning for contrastive self- supervision,”arXiv preprint arXiv:2302.12712, 2023

  10. [10]

    Understanding the role of equivariance in self-supervised learning,

    Y . Wang, K. Hu, S. Gupta, Z. Ye, Y . Wang, and S. Jegelka, “Understanding the role of equivariance in self-supervised learning,”

  11. [11]

    Available: https://arxiv.org/abs/2411.06508

    [Online]. Available: https://arxiv.org/abs/2411.06508

  12. [12]

    Unpaired Image- to-Image Translation using Cycle-Consistent Adversarial Networks,

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image- to-image translation using cycle-consistent adversarial networks,” 2020, arXiv preprint arXiv:1703.10593. [Online]. Available: https: //arxiv.org/abs/1703.10593

  13. [13]

    arXiv preprint arXiv:2007.15651

    T. Park, A. A. Efros, R. Zhang, and J.-Y . Zhu, “Contrastive learning for unpaired image-to-image translation,” 2020, arXiv preprint arXiv:2007.15651. [Online]. Available: https://arxiv.org/abs/2007.15651

  14. [14]

    An algorithm for the machine calculation of complex fourier series,

    J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,”Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965

  15. [15]

    Sliced and radon wasserstein barycenters of measures,

    N. Bonneel, J. Rabin, G. Peyr ´e, and H. Pfister, “Sliced and radon wasserstein barycenters of measures,”Journal of Mathematical Imaging and Vision, vol. 51, no. 1, pp. 22–45, 2015

  16. [16]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  17. [17]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

  18. [18]

    Simclr: A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “Simclr: A simple framework for contrastive learning of visual representations,” inInterna- tional Conference on Learning Representations, vol. 2, no. 4. PMLR New York, NY , USA, 2020

  19. [19]

    BEiT: BERT Pre-Training of Image Transformers

    H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,”arXiv preprint arXiv:2106.08254, 2021

  20. [20]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  21. [21]

    arXiv preprint arXiv:2512.15715 (2025) PRISM-CTG: A Foundation Model for CTG Analysis with Multi-View SSL 17

    L. Yang, S.-W. Li, Y . Li, X. Lei, D. Wang, A. Mohamed, H. Zhao, and H. Xu, “In pursuit of pixel supervision for visual pre-training,”arXiv preprint arXiv:2512.15715, 2025

  22. [22]

    Joint embedding vs reconstruction: Provable benefits of latent space prediction for self supervised learning,

    H. V . Assel, M. Ibrahim, T. Biancalani, A. Regev, and R. Balestriero, “Joint embedding vs reconstruction: Provable benefits of latent space prediction for self supervised learning,” 2025. [Online]. Available: https://arxiv.org/abs/2505.12477

  23. [23]

    Pixmim: Rethink- ing pixel reconstruction in masked image modeling,

    Y . Liu, S. Zhang, J. Chen, K. Chen, and D. Lin, “Pixmim: Rethink- ing pixel reconstruction in masked image modeling,”arXiv preprint arXiv:2303.02416, 2023

  24. [24]

    Stare at what you see: Masked image modeling without reconstruction,

    H. Xue, P. Gao, H. Li, Y . Qiao, H. Sun, H. Li, and J. Luo, “Stare at what you see: Masked image modeling without reconstruction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 732–22 741

  25. [25]

    A-jepa: Joint-embedding predictive architecture can listen.arXiv preprint arXiv:2311.15830,

    Z. Fei, M. Fan, and J. Huang, “A-jepa: Joint-embedding predictive architecture can listen,”arXiv preprint arXiv:2311.15830, 2023

  26. [26]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholuset al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,”arXiv preprint arXiv:2506.09985, 2025

  27. [27]

    Real- Time Weather Image Classification with SVM.arXiv preprint arXiv:2409.00821 (2024)

    E. Ship, E. Spivak, S. Agarwal, R. Birman, and O. Hadar, “Real-time weather image classification with svm,”arXiv preprint arXiv:2409.00821, 2024

  28. [28]

    Feature extraction for classification of different weather conditions,

    X. Zhao, P. Liu, J. Liu, and X. Tang, “Feature extraction for classification of different weather conditions,”Frontiers of Electrical and Electronic Engineering in China, vol. 6, no. 2, pp. 339–346, 2011

  29. [29]

    Community transferrable representation learning for image style classification,

    J. Cui, J. Shen, J. Wei, S. Liu, Z. Ye, S. Luo, and Z. Qin, “Community transferrable representation learning for image style classification,” ACM Transactions on Multimedia Computing, Communications and Applications, 2025

  30. [30]

    Aladin- nst: Self-supervised disentangled representation learning of artistic style through neural style transfer,

    D. Ruta, G. C. Tarres, A. Black, A. Gilbert, and J. Collomosse, “Aladin- nst: Self-supervised disentangled representation learning of artistic style through neural style transfer,”arXiv preprint arXiv:2304.05755, 2023

  31. [31]

    Unsupervised image style embeddings for retrieval and recognition tasks,

    S. Gairola, R. Shah, and P. Narayanan, “Unsupervised image style embeddings for retrieval and recognition tasks,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3281–3289

  32. [32]

    Deep ten: Texture encoding network,

    H. Zhang, J. Xue, and K. Dana, “Deep ten: Texture encoding network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 708–717

  33. [33]

    Bilinear cnns for fine-grained visual recognition,

    T.-Y . Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnns for fine-grained visual recognition,” 2017. [Online]. Available: https: //arxiv.org/abs/1504.07889

  34. [34]

    Image style transfer using convolutional neural networks,

    L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423

  35. [35]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015. [Online]. Available: https: //arxiv.org/abs/1409.1556

  36. [36]

    Image-to-Image Translation with Conditional Adversarial Networks , journal =

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” 2018, arXiv preprint arXiv:1611.07004. [Online]. Available: https://arxiv.org/abs/1611.07004

  37. [37]

    Generative adversarial networks,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

  38. [38]

    Semantic image synthesis with spatially-adaptive normalization,

    T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu, “Semantic image synthesis with spatially-adaptive normalization,” 2019. [Online]. Available: https://arxiv.org/abs/1903.07291

  39. [39]

    The ‘ugly duckling’sign: identification of the common characteristics of nevi in an individual as a basis for melanoma screening,

    J. Grob and J. Bonerandi, “The ‘ugly duckling’sign: identification of the common characteristics of nevi in an individual as a basis for melanoma screening,”Archives of dermatology, vol. 134, no. 1, pp. 103–104, 1998

  40. [40]

    Early detection of malignant melanoma: the role of physician examination and self-examination of the skin

    R. J. Friedman, D. S. Rigel, and A. W. Kopf, “Early detection of malignant melanoma: the role of physician examination and self-examination of the skin.”CA: a cancer journal for clinicians, vol. 35, no. 3, pp. 130–151, 1985

  41. [41]

    Multi-class weather classification on single images,

    Z. Zhang and H. Ma, “Multi-class weather classification on single images,” in2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015, pp. 4396–4400

  42. [42]

    Weather phenomenon database (weapd),

    H. Xiao, “Weather phenomenon database (weapd),”Harvard Dataverse dataset, p. 627, 2021

  43. [43]

    Evaluation of cnn-based approaches to adverse weather image classification for autonomous driving systems,

    V . Afxentiou and T. Vladimirova, “Evaluation of cnn-based approaches to adverse weather image classification for autonomous driving systems,” IEEE Open Journal of Intelligent Transportation Systems, 2025

  44. [44]

    A study of weather-image classification combining vit and a dual enhanced-attention module,

    J. Li and X. Luo, “A study of weather-image classification combining vit and a dual enhanced-attention module,”Electronics, vol. 12, no. 5, p. 1213, 2023

  45. [45]

    Bootstrap your own latent-a new approach to self-supervised learning,

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azaret al., “Bootstrap your own latent-a new approach to self-supervised learning,”Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020

  46. [46]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

  47. [47]

    Big self-supervised models are strong semi-supervised learners,

    T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-supervised models are strong semi-supervised learners,”Advances in neural information processing systems, vol. 33, pp. 22 243–22 255, 2020