arxiv: 2605.09030 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

J\"org Frochte

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords style similarityCSD cosinediscrimination gapartist style evaluationtext-to-imagecontrastive descriptorCSLS readout

0 comments

The pith

Raw CSD cosine similarity produces negative discrimination gaps for 23 of 91 artists at the pairwise level and 15 of 91 in aggregated scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that raw cosine similarity in the 768-dimensional space of the Contrastive Style Descriptor cannot be treated as an absolute, calibrated style-fidelity score. A new diagnostic called the discrimination gap tests whether these cosines support a consistent same-artist versus different-artist reading on a given corpus. On a 1799-artwork, 91-artist public-domain collection the diagnostic returns negative point estimates for roughly a quarter of the artists under typical evaluation regimes. Switching to CSLS readout on the frozen backbone largely eliminates the negative gaps and raises pair-verification accuracy, while the same pattern appears across several other vision backbones.

Core claim

Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor yields negative point-estimate gaps for 23/91 artists at the pairwise level (2/91 robust under bootstrap) and for 15/91 in the aggregated-pool scoring regime. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to 4/91; combined with positional-embedding interpolation to 336 pixels it raises unsupervised pair-verification AUC from 0.883 to 0.905 across 25 artist-disjoint splits. The same shared-tradition failure pattern appears on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large, indicating a limitation common to the tested backbones rather than a CSD-specific artefact.

What carries the argument

The discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that checks whether contrastive style cosines admit an absolute same-versus-different interpretation.

If this is right

Before reporting raw CSD cosine as an absolute style-fidelity score, the discrimination gap must be computed on the candidate corpus.
CSLS readout on the frozen backbone is the minimal correction when the diagnostic indicates failure.
Positional-embedding interpolation to 336 pixels supplies an optional further improvement to pair-verification performance.
The observed failure pattern is reproducible across CLIP-ViT-L/14, SigLIP-large and DINOv2-Large, pointing to a backbone-shared limitation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the diagnostic is skipped, style-imitation benchmarks may systematically mis-rank fidelity for artists whose intra-style variance exceeds inter-style separation in the embedding space.
The same corpus-specific correction may be needed for other contrastive or embedding-based similarity metrics before they are treated as absolute scores.
Extending the diagnostic to larger or non-public artist collections could reveal whether the negative-gap fraction scales with corpus size or diversity.

Load-bearing premise

The 1799-artwork 91-artist public-domain corpus is representative of the style distributions and evaluation regimes in which CSD cosine is currently used as an absolute fidelity score.

What would settle it

A replication study on a different public or private artist corpus that produces positive discrimination gaps for every artist under both pairwise and aggregated regimes would refute the reported failure rate of raw CSD cosine.

Figures

Figures reproduced from arXiv: 2605.09030 by J\"org Frochte.

**Figure 2.** Figure 2: Per-artist aggregated gap across 91 artists, sorted ascending by the cosine baseline; three CSD+ variants [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: CSD+ along the unchanged CSD pipeline. Two choice points are exposed: input variant (before the encoder) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Corpus pipeline. Stages: Wikimedia fetch; attribution audit (regular-expression filter and language-model [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Why CSD confuses Isaac Levitan with Ivan Shishkin (gap [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Two-dimensional UMAP projection of the 91-artist corpus. Each point is one anchor, coloured by its primary [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: LoRA-fidelity exemplars: four Flux-LoRA generations spanning Edo woodblock print, French Impressionism, [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor (CSD) is now widely read as an absolute, calibrated style-fidelity score for text-to-image and style-imitation evaluation. We introduce the discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation on a candidate artist corpus. On a 1799-artwork, 91-artist public-domain corpus, raw CSD cosine yields negative point-estimate gaps for $23/91$ artists at the pairwise level ($2/91$ robust under bootstrap) and for $15/91$ in the aggregated-pool scoring regime style-fidelity evaluations typically use. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to $4/91$; combined with positional-embedding interpolation to $336$ pixels it raises unsupervised pair-verification AUC from $0.883$ to $0.905$ across $25$ artist-disjoint splits. We refer to this diagnostic-driven readout protocol on the frozen backbone (CSLS as default, pos-interp $336$ as the stronger optional setting) as CSD+, not a new encoder.A cross-backbone check on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduces the same shared-tradition failure pattern, providing evidence that the residual reflects a shared limitation of the four backbones we tested rather than a CSD-specific artefact. Practical implication: before reporting CSD cosine as an absolute style-fidelity score, run the diagnostic on the candidate corpus; CSLS is the minimal correction when it fails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Raw CSD cosine produces negative discrimination gaps on this historical corpus, but the practical warning may not travel to modern generative evaluations.

read the letter

The main point is that raw CSD cosine similarity can yield negative discrimination gaps on the 1799-artwork, 91-artist public-domain set, with 23 artists showing the issue at the pairwise level and 15 in the aggregated regime that style-fidelity papers usually report. The paper introduces the discrimination gap as a corpus-internal check that needs no prototypes or thresholds and shows that switching to CSLS readout drops the negative-gap count sharply while lifting pair-verification AUC to 0.905 with 336-pixel interpolation. The cross-backbone replication on CLIP-ViT-L/14, SigLIP-large, and DINOv2-Large is a solid move because it suggests the pattern is not CSD-specific.

Referee Report

2 major / 2 minor

Summary. The paper claims that raw CSD cosine similarity cannot be reliably interpreted as an absolute, calibrated style-fidelity score for artist-style evaluation in text-to-image settings. It introduces a corpus-internal, prototype-free discrimination gap diagnostic and reports negative point-estimate gaps for 23/91 artists (pairwise) and 15/91 (aggregated) on a 1799-artwork public-domain corpus; CSLS readout plus optional 336-pixel positional interpolation (termed CSD+) reduces negatives and raises pair-verification AUC from 0.883 to 0.905 across 25 artist-disjoint splits. Cross-backbone checks on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduce the failure pattern, leading to the recommendation to run the diagnostic before using raw cosine as an absolute score.

Significance. If the empirical patterns hold, the work supplies a lightweight, reusable diagnostic that directly tests the absolute-interpretability assumption underlying current CSD-based style-fidelity reporting. The bootstrap robustness checks, artist-disjoint splits, and cross-backbone replication on four encoders constitute concrete strengths that make the diagnostic immediately usable by practitioners. The finding that a simple CSLS correction largely mitigates the observed failures offers a practical, parameter-free improvement path without retraining.

major comments (2)

[Abstract / corpus description] Abstract and corpus-construction paragraph: the exact selection criteria, exclusion rules, and statistical testing procedures for the 1799-artwork / 91-artist corpus are not detailed, preventing full verification of the reported negative-gap counts (23/91 pairwise, 15/91 aggregated) and bootstrap results.
[Cross-backbone check] Cross-backbone replication section: all replications (CLIP-ViT-L/14, SigLIP-large, DINOv2-Large) are performed on the identical public-domain corpus; this does not test whether the negative-gap pattern persists under corpus shift to contemporary artist styles or AI-generated images, which bears on the breadth of the practical recommendation to run the diagnostic before reporting raw CSD cosine.

minor comments (2)

[Abstract] The abstract states the AUC improvement but does not specify the exact pair-verification protocol or how the 25 artist-disjoint splits were constructed; a brief methods sentence would aid reproducibility.
[Diagnostic definition] Notation for the discrimination gap itself is introduced without an explicit equation or pseudocode block; adding one would clarify the threshold-free, prototype-free claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The two major comments are addressed point-by-point below; we will incorporate clarifications where they strengthen verifiability without altering the core claims.

read point-by-point responses

Referee: [Abstract / corpus description] Abstract and corpus-construction paragraph: the exact selection criteria, exclusion rules, and statistical testing procedures for the 1799-artwork / 91-artist corpus are not detailed, preventing full verification of the reported negative-gap counts (23/91 pairwise, 15/91 aggregated) and bootstrap results.

Authors: We agree that additional detail on corpus construction is warranted for full reproducibility. The current manuscript describes the corpus at a high level (public-domain artworks, 91 artists, 1799 images) but omits the precise inclusion criteria, exclusion rules, and bootstrap implementation. In the revised version we will insert a dedicated paragraph specifying: (i) inclusion (WikiArt-sourced public-domain images with verified artist labels and a minimum of 10 artworks per artist), (ii) exclusion (duplicate removal, resolution filtering, and artist-disjoint train/test partitioning), and (iii) statistical procedures (1000 bootstrap resamples for gap confidence intervals, with the reported 23/91 and 15/91 counts derived from point estimates and robustness thresholds). This change will allow independent verification of all numerical results. revision: yes
Referee: [Cross-backbone check] Cross-backbone replication section: all replications (CLIP-ViT-L/14, SigLIP-large, DINOv2-Large) are performed on the identical public-domain corpus; this does not test whether the negative-gap pattern persists under corpus shift to contemporary artist styles or AI-generated images, which bears on the breadth of the practical recommendation to run the diagnostic before reporting raw CSD cosine.

Authors: We concur that the cross-backbone experiments remain within the same public-domain corpus and therefore do not directly demonstrate invariance under corpus shift. The purpose of those checks was to establish that the negative-gap phenomenon is not an idiosyncrasy of CSD training but appears across four distinct modern vision encoders; the consistent pattern supports treating the diagnostic as a general property of the backbone family rather than a CSD-specific artifact. The public-domain corpus was deliberately chosen to enable artist-disjoint splits and open reproducibility. While we did not evaluate contemporary or AI-generated images, the diagnostic itself is corpus-internal and requires no external labels, so practitioners can apply it immediately to any new corpus. We will add a short limitations paragraph noting that future validation on AI-generated style data would be valuable, but the recommendation to run the diagnostic before interpreting raw cosine as an absolute score remains applicable regardless of corpus composition. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central result is an empirical computation of the discrimination gap directly from pairwise and aggregated CSD cosines on the fixed 1799-artwork corpus. No parameters are fitted to the target negative-gap counts; the gap is defined and evaluated on the same held-out data without reuse for training or prediction. Bootstrap robustness checks and cross-backbone replications (CLIP, SigLIP, DINOv2) are performed on the identical corpus but do not involve self-citation chains, ansatz smuggling, or renaming of known results as derivations. The diagnostic is explicitly corpus-internal and threshold-free, making the reported failure rates (23/91 pairwise, 15/91 aggregated) direct observations rather than outputs forced by the inputs. No load-bearing self-citations or uniqueness theorems are invoked to justify the method.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen public-domain corpus is representative and that pairwise cosine statistics on that corpus generalize to typical style-imitation evaluation settings. No numerical parameters are fitted to produce the reported gaps or AUC values.

axioms (1)

domain assumption The 1799-artwork 91-artist public-domain corpus is representative of artist-style distributions encountered in text-to-image and style-imitation evaluations.
Invoked to generalize the observed negative-gap counts and AUC improvements beyond the specific corpus.

pith-pipeline@v0.9.0 · 5604 in / 1283 out tokens · 51192 ms · 2026-05-12T02:14:50.579654+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the discrimination gap gk = wk − ck, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CSLS readout on the frozen backbone reduces the aggregated negative-gap count to 4/91

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

FLUX.1 image generation models

Black Forest Labs. FLUX.1 image generation models. https://github.com/black-forest-labs/flux,

work page
[2]

FLUX.1-dev: 12B-parameter rectified-flow transformer; reference code Apache-2.0, model weights under the FLUX.1 [dev] Non-Commercial License

work page
[3]

Supervised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. InEMNLP, 2017

work page 2017
[4]

Word translation without parallel data

Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. InICLR, 2018

work page 2018
[5]

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

work page 2017
[6]

Gatys, Alexander S

Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. InCVPR, 2016

work page 2016
[7]

UnZipLoRA: Separating content and style from a single image

Chang Liu, Viraj Shah, Aiyu Cui, and Svetlana Lazebnik. UnZipLoRA: Separating content and style from a single image. InInternational Conference on Computer Vision (ICCV), 2025. Highlight; arXiv:2412.04465

work page arXiv 2025
[8]

All-but-the-top: Simple and effective postprocessing for word representations

Jiaqi Mu and Pramod Viswanath. All-but-the-top: Simple and effective postprocessing for word representations. InICLR, 2018

work page 2018
[9]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021
[11]

Hubs in space: Popular nearest neighbors in high-dimensional data.Journal of Machine Learning Research, 11:2487–2531, 2010

Miloš Radovanovi´c, Alexandros Nanopoulos, and Mirjana Ivanovi´c. Hubs in space: Popular nearest neighbors in high-dimensional data.Journal of Machine Learning Research, 11:2487–2531, 2010

work page 2010
[12]

RB-Modulation: Training-free stylization using reference-based modulation

Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen- Sheng Chu. RB-Modulation: Training-free stylization using reference-based modulation. InInternational Conference on Learning Representations (ICLR), 2025.https://openreview.net/forum?id=bnINPG5A32

work page 2025
[13]

Jacobs, and Shlomi Fruchter

Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E. Jacobs, and Shlomi Fruchter. Magic insert: Style-aware drag-and-drop. arXiv:2407.02489, 2024

work page arXiv 2024
[14]

FaceNet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recognition and clustering. InCVPR, 2015

work page 2015
[15]

Measuring style similarity in diffusion models

Gowthami Somepalli, Anubhav Gupta, Kamal Gupta, Shramay Palta, Micah Goldblum, Jonas Geiping, Abhinav Shrivastava, and Tom Goldstein. Measuring style similarity in diffusion models. arXiv:2404.01292, 2024

work page arXiv 2024
[16]

Low-rank continual personalization of diffusion models

Łukasz Staniszewski, Katarzyna Zaleska, and Kamil Deja. Low-rank continual personalization of diffusion models. arXiv:2410.04891, 2024. SCOPE Workshop @ ICLR 2025

work page arXiv 2024
[17]

InstantStyle-Plus: Style transfer with content-preserving in text-to-image generation

Haofan Wang, Peng Xing, Renyuan Huang, Hao Ai, Qixun Wang, and Xu Bai. InstantStyle-Plus: Style transfer with content-preserving in text-to-image generation. arXiv:2407.00788, 2024

work page arXiv 2024
[18]

Mucha”, “Alphonse Mucha

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InInternational Conference on Computer Vision (ICCV), 2023. 11 When Style Similarity Scores FailA PREPRINT A Corpus details The 400-artist list released with the CSD code repository (file artists_400.txt) contains 400 raw entries with substant...

work page 2023