Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature
read the original abstract
In the past few years, the number of fine-art collections that are digitized and publicly available has been growing rapidly. With the availability of such large collections of digitized artworks comes the need to develop multimedia systems to archive and retrieve this pool of data. Measuring the visual similarity between artistic items is an essential step for such multimedia systems, which can benefit more high-level multimedia tasks. In order to model this similarity between paintings, we should extract the appropriate visual features for paintings and find out the best approach to learn the similarity metric based on these features. We investigate a comprehensive list of visual features and metric learning approaches to learn an optimized similarity measure between paintings. We develop a machine that is able to make aesthetic-related semantic-level judgments, such as predicting a painting's style, genre, and artist, as well as providing similarity measures optimized based on the knowledge available in the domain of art historical interpretation. Our experiments show the value of using this similarity measure for the aforementioned prediction tasks.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
NetTailor: Tuning the Architecture, Not Just the Weights
NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition
Insert In Style is a zero-shot framework that disentangles identity, style, and composition via multi-stage training, masked attention, and prior preservation to enable harmonious cross-domain object insertion in images.
-
The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models
Text-to-image diffusion models exhibit varying degrees of emergent content-style separation in art generation, with content tokens primarily influencing object regions and style tokens affecting backgrounds and textures.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Linking Art through Human Poses
Human pose similarity matching with spatial verification outperforms standard content-based image retrieval for discovering composition transfers in art on a manually annotated dataset.
-
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks i...
-
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen ...
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.