Harnessing Self-Supervised Features for Art Classification

Davide Bilardello; Emanuele Prato; Evelyn Turri; Federico Melis; Lorenzo Baraldi

arxiv: 2605.18974 · v1 · pith:U7O3B2PDnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.MM

Harnessing Self-Supervised Features for Art Classification

Federico Melis , Davide Bilardello , Emanuele Prato , Evelyn Turri , Lorenzo Baraldi This is my paper

Pith reviewed 2026-05-20 10:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords self-supervised learningartwork classificationfeature extractionDINOCLIPpaintingsimage retrievalcomputer vision

0 comments

The pith

Self-supervised backbones lead to consistent improvements in artwork classification performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic study of supervised versus self-supervised backbones for extracting features to classify and retrieve artworks, especially paintings. It tests models from the DINO family and CLIP across several classification strategies and ways of representing the features. The key result is that self-supervised backbones yield better classification results. A sympathetic reader would care because this could lead to more effective tools for organizing and searching large art collections without relying as heavily on labeled training data.

Core claim

Through extensive experiments with DINO family and CLIP models, the paper shows that self-supervised backbones outperform supervised ones as feature extractors for both artwork classification and retrieval tasks, with particular gains for paintings.

What carries the argument

Self-supervised backbones from the DINO family and CLIP models used as feature extractors with multiple classification strategies and feature representations.

If this is right

Employing a self-supervised backbone leads to consistent improvements in artwork classification performance.
The same self-supervised features support both classification and retrieval of artworks.
The approach supplies insights for practical uses such as classification and retrieval modules in VR applications that support museum navigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-supervised features may reduce reliance on large labeled art datasets for training effective classifiers.
The gains could be tested on other fine-grained visual domains outside paintings.
Higher classification accuracy might improve content-based search and navigation inside virtual museum environments.

Load-bearing premise

The experimental setup using DINO family and CLIP models with multiple classification strategies and feature representations provides a fair and generalizable comparison between supervised and self-supervised approaches for paintings.

What would settle it

A follow-up experiment on a new set of artworks or different self-supervised models that finds no improvement or worse results from the self-supervised backbones would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18974 by Davide Bilardello, Emanuele Prato, Evelyn Turri, Federico Melis, Lorenzo Baraldi.

**Figure 2.** Figure 2: Qualitative results for style and genre classification. For each image, we present predictions for EfficientNetV2, DINO-Linear and CLIP-Linear (Ours) compared to WikiArt [8] Ground Truth. 4.3. Performance on WikiArt Classification. We report in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: This figure shows qualitative retrieval results obtained with CLIP-ViT-L/14. The query images are depicted in the first column for each row, and they span across different classes of style and genre. For each query sample, we show top-5 best retrieved images. these features, we include an additional retrieval example using a query image outside the WikiArt dataset. The query depicts "La creazione di Adamo"… view at source ↗

read the original abstract

Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-supervised features improve art classification in systematic tests, but the work is mostly an application study.

read the letter

The main thing to know is that this paper finds self-supervised models give better results than supervised ones when used as backbones for art classification and retrieval tasks. They test the DINO family and CLIP on painting datasets, using various ways to classify and retrieve based on the extracted features. The paper does a good job laying out the experimental comparisons across strategies and representations. It also highlights potential uses in virtual reality for museums, which adds a practical dimension to the work. Where it could be stronger is in quantifying how much better the self-supervised approach is and whether those gains hold up under different conditions or art genres. The abstract emphasizes consistent improvements, but the actual effect sizes and any ablation studies would help assess if this is a meaningful advance or just expected transfer learning behavior. Since no new models are introduced, the contribution rests on the thoroughness of the domain-specific tests. The citation pattern looks standard and appropriate. Readers working on computer vision for cultural heritage or fine-grained image tasks would find this relevant. It could serve as a reference for choosing backbones in similar specialized domains. Overall, this deserves peer review. The experiments appear well-structured for an application paper, and feedback from referees could improve the discussion of limitations and broader implications.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic empirical investigation comparing supervised and self-supervised backbones (DINO family and CLIP models) as feature extractors for artwork classification and retrieval tasks, with emphasis on paintings. Multiple classification strategies and feature representations are evaluated, leading to the central claim that self-supervised backbones yield consistent performance improvements; the work also discusses applicability to real-world settings such as VR museum navigation.

Significance. If the reported gains are robust and obtained under matched experimental conditions, the results would offer useful evidence on the transferability of self-supervised representations to fine-grained, domain-specific visual tasks in cultural heritage. The practical orientation toward retrieval and VR applications strengthens the potential impact within computer vision for specialized datasets.

major comments (2)

[§4] §4 (Experimental Evaluation): the claim of 'consistent improvements' across backbones requires explicit reporting of per-model accuracy deltas, standard deviations, and statistical tests; without these, it is difficult to assess whether observed gains exceed experimental noise or dataset-specific effects.
[Table 2] Table 2 (Classification results): the comparison between supervised and self-supervised backbones must clarify whether the supervised baselines were trained from scratch on the same art datasets or used ImageNet-pretrained weights; mismatched pretraining regimes would undermine the fairness of the central claim.

minor comments (2)

[Abstract] The abstract mentions 'multiple classification strategies' but does not enumerate them; a brief list in the abstract or §3 would improve clarity.
[Figure 3] Figure 3 (feature visualization): axis labels and color scales are difficult to read at the provided resolution; consider increasing font size and adding a legend for the different backbone types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and clarity that we address below. We have prepared revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [§4] §4 (Experimental Evaluation): the claim of 'consistent improvements' across backbones requires explicit reporting of per-model accuracy deltas, standard deviations, and statistical tests; without these, it is difficult to assess whether observed gains exceed experimental noise or dataset-specific effects.

Authors: We agree that reporting per-model accuracy deltas, standard deviations from multiple runs, and statistical tests would allow readers to better evaluate the robustness of the improvements. Our experiments include results from repeated trials with different random seeds, enabling computation of these metrics. In the revised manuscript, we will add this information to Section 4, including deltas relative to supervised baselines and results from paired statistical tests. revision: yes
Referee: [Table 2] Table 2 (Classification results): the comparison between supervised and self-supervised backbones must clarify whether the supervised baselines were trained from scratch on the same art datasets or used ImageNet-pretrained weights; mismatched pretraining regimes would undermine the fairness of the central claim.

Authors: We thank the referee for this observation. In our evaluation protocol, the supervised baselines use standard ImageNet-pretrained weights as feature extractors, while the self-supervised models (DINO and CLIP) use their respective pretraining. No models were trained from scratch on the art datasets; all backbones remain frozen, with only a linear classifier trained on top for the target task. This ensures a matched comparison of representation quality. We will revise the text describing Table 2 and the experimental setup in Section 4 to state the pretraining details explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper conducts an experimental comparison of supervised versus self-supervised backbones (DINO family and CLIP) for artwork classification and retrieval. The central claim rests on reported performance improvements from empirical runs across multiple classification strategies and feature representations. No derivation chain, equations, or self-citation load-bearing steps are present that reduce results to fitted inputs by construction. The evaluation setup is described as providing a fair comparison, and results are presented as direct measurements rather than predictions forced by prior definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the central claim rests on standard domain assumptions in computer vision about transferability of self-supervised features to fine-grained visual tasks.

axioms (1)

domain assumption Self-supervised models produce general-purpose visual features transferable to artwork classification and retrieval.
Implied by the choice to evaluate DINO and CLIP as backbones for the target tasks.

pith-pipeline@v0.9.0 · 5649 in / 1184 out tokens · 40365 ms · 2026-05-20T10:27:34.636465+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ a pre-trained vision encoder from CLIP and DINO... linear classification layers are trained using cross-entropy loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: CVPR, 2016

work page 2016
[2]

Simonyan, A

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, in: ICLR, 2015

work page 2015
[3]

Szegedy, W

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going Deeper With Convolutions, in: CVPR, 2015

work page 2015
[4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging Properties in Self-Supervised Vision Transformers, in: ICCV, 2021

work page 2021
[5]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, P. Bojanowski, DINOv2: Learning Robust Visual Features without Supervisio...

work page 2024
[6]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al., Dinov3, arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning Transferable Visual Models From Natural Language Supervision, in: ICML, 2021

work page 2021
[8]

W. R. Tan, C. S. Chan, H. E. Aguirre, K. Tanaka, Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork, IEEE TIP (2018)

work page 2018
[9]

M. Tan, Q. Le, EfficientNetV2: Smaller Models and Faster Training, in: ICML, 2021

work page 2021
[10]

Recognizing image style,

S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, H. Winnemoeller, Recognizing Image Style, arXiv preprint arXiv:1311.3715 (2013)

work page arXiv 2013
[11]

R. S. Arora, A. Elgammal, Towards automated classification of fine-art painting style: A comparative study, in: ICPR, 2012

work page 2012
[12]

Zujovic, L

J. Zujovic, L. Gandy, S. Friedman, B. Pardo, T. N. Pappas, Classifying paintings by artistic genre: An analysis of features & classifiers, in: 2009 IEEE international workshop on multimedia signal processing, 2009

work page 2009
[13]

Cetinic, T

E. Cetinic, T. Lipic, S. Grgic, Fine-tuning Convolutional Neural Networks for fine art classification, Expert Systems with Applications (2018)

work page 2018
[14]

Y. Hong, J. Kim, Art painting detection and identification based on deep learning and image local features, Multimedia Tools and Applications (2019)

work page 2019
[15]

M. V. Conde, K. Turgutlu, CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification, in: CVPR Workshops, 2021

work page 2021
[16]

Zhang, C

C. Zhang, C. Kaeser-Chen, G. Vesom, J. Choi, M. Kessler, S. Belongie, The iMet Collection 2019 Challenge Dataset, 2019

work page 2019
[17]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning, Advances in neural information processing systems (2020)

work page 2020
[19]

Pérez-García, H

F. Pérez-García, H. Sharma, S. Bond-Taylor, K. Bouzid, V. Salvatelli, M. Ilse, S. Bannur, D. C. Castro, A. Schwaighofer, M. P. Lungren, M. T. Wetscherek, N. Codella, S. L. Hyland, J. Alvarez-Valle, O. Oktay, Exploring scalable medical image encoders beyond text supervision, Nature Machine Intelligence (2025)

work page 2025
[20]

G. M. Çökmez, Y. Zhang, C. Schroers, T. O. Aydin, CLIP-Fusion: A Spatio-Temporal Quality Metric for Frame Interpolation, in: WACV, 2025

work page 2025
[21]

G. Wu, J. Chen, W. Zhang, R. Wang, Feature Adaptation with CLIP for Few-shot Classification, in: Proceedings of the 5th ACM International Conference on Multimedia in Asia, 2023

work page 2023
[22]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: CVPR, 2009

work page 2009
[23]

Douze, A

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, The Faiss Library, IEEE Transactions on Big Data (2025)

work page 2025
[24]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, The journal of machine learning research (2014)

work page 2014
[25]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, in: ICLR, 2019

work page 2019
[26]

D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, CoRR (2014)

work page 2014
[27]

C. Jose, T. Moutakanni, D. Kang, F. Baldassarre, T. Darcet, H. Xu, D. Li, M. Szafraniec, M. Rama- monjisoa, M. Oquab, O. Siméoni, H. V. Vo, P. Labatut, P. Bojanowski, DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment, in: CVPR, 2025

work page 2025

[1] [1]

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: CVPR, 2016

work page 2016

[2] [2]

Simonyan, A

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, in: ICLR, 2015

work page 2015

[3] [3]

Szegedy, W

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going Deeper With Convolutions, in: CVPR, 2015

work page 2015

[4] [4]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, A. Joulin, Emerging Properties in Self-Supervised Vision Transformers, in: ICCV, 2021

work page 2021

[5] [5]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, P. Bojanowski, DINOv2: Learning Robust Visual Features without Supervisio...

work page 2024

[6] [6]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al., Dinov3, arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning Transferable Visual Models From Natural Language Supervision, in: ICML, 2021

work page 2021

[8] [8]

W. R. Tan, C. S. Chan, H. E. Aguirre, K. Tanaka, Improved ArtGAN for Conditional Synthesis of Natural Image and Artwork, IEEE TIP (2018)

work page 2018

[9] [9]

M. Tan, Q. Le, EfficientNetV2: Smaller Models and Faster Training, in: ICML, 2021

work page 2021

[10] [10]

Recognizing image style,

S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell, A. Hertzmann, H. Winnemoeller, Recognizing Image Style, arXiv preprint arXiv:1311.3715 (2013)

work page arXiv 2013

[11] [11]

R. S. Arora, A. Elgammal, Towards automated classification of fine-art painting style: A comparative study, in: ICPR, 2012

work page 2012

[12] [12]

Zujovic, L

J. Zujovic, L. Gandy, S. Friedman, B. Pardo, T. N. Pappas, Classifying paintings by artistic genre: An analysis of features & classifiers, in: 2009 IEEE international workshop on multimedia signal processing, 2009

work page 2009

[13] [13]

Cetinic, T

E. Cetinic, T. Lipic, S. Grgic, Fine-tuning Convolutional Neural Networks for fine art classification, Expert Systems with Applications (2018)

work page 2018

[14] [14]

Y. Hong, J. Kim, Art painting detection and identification based on deep learning and image local features, Multimedia Tools and Applications (2019)

work page 2019

[15] [15]

M. V. Conde, K. Turgutlu, CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification, in: CVPR Workshops, 2021

work page 2021

[16] [16]

Zhang, C

C. Zhang, C. Kaeser-Chen, G. Vesom, J. Choi, M. Kessler, S. Belongie, The iMet Collection 2019 Challenge Dataset, 2019

work page 2019

[17] [17]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning, Advances in neural information processing systems (2020)

work page 2020

[19] [19]

Pérez-García, H

F. Pérez-García, H. Sharma, S. Bond-Taylor, K. Bouzid, V. Salvatelli, M. Ilse, S. Bannur, D. C. Castro, A. Schwaighofer, M. P. Lungren, M. T. Wetscherek, N. Codella, S. L. Hyland, J. Alvarez-Valle, O. Oktay, Exploring scalable medical image encoders beyond text supervision, Nature Machine Intelligence (2025)

work page 2025

[20] [20]

G. M. Çökmez, Y. Zhang, C. Schroers, T. O. Aydin, CLIP-Fusion: A Spatio-Temporal Quality Metric for Frame Interpolation, in: WACV, 2025

work page 2025

[21] [21]

G. Wu, J. Chen, W. Zhang, R. Wang, Feature Adaptation with CLIP for Few-shot Classification, in: Proceedings of the 5th ACM International Conference on Multimedia in Asia, 2023

work page 2023

[22] [22]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: CVPR, 2009

work page 2009

[23] [23]

Douze, A

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, The Faiss Library, IEEE Transactions on Big Data (2025)

work page 2025

[24] [24]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, The journal of machine learning research (2014)

work page 2014

[25] [25]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, in: ICLR, 2019

work page 2019

[26] [26]

D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, CoRR (2014)

work page 2014

[27] [27]

C. Jose, T. Moutakanni, D. Kang, F. Baldassarre, T. Darcet, H. Xu, D. Li, M. Szafraniec, M. Rama- monjisoa, M. Oquab, O. Siméoni, H. V. Vo, P. Labatut, P. Bojanowski, DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment, in: CVPR, 2025

work page 2025