pith. sign in

arxiv: 2604.13610 · v1 · submitted 2026-04-15 · 💻 cs.CV

What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

Pith reviewed 2026-05-10 13:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords dataset biassemantic separabilityunsupervised clusteringresolution artifactsweb-scale image datasetsfoundational vision modelssupervised classificationnatural image collections
0
0 comments X

The pith

Supervised tests of dataset bias in natural images largely measure resolution artifacts rather than semantic differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the common practice of quantifying bias between large image collections by training classifiers to identify which dataset an image belongs to. It finds that these classifiers achieve high accuracy mainly by detecting low-level resolution patterns and resizing effects that differ across datasets, not by recognizing meaningful content distinctions. Tests on procedurally generated non-semantic images confirm that classifiers latch onto these superficial cues even without any real visual semantics. Replacing the supervised approach with unsupervised clustering of features from foundational vision models shows that major web-scale datasets have almost no semantic separability, with accuracies near random chance. This indicates that earlier bias measurements have substantially overstated how distinct these collections truly are at the semantic level.

Core claim

When unsupervised clustering is performed on semantically rich features drawn from foundational vision models, the separability of major web-scale natural image datasets collapses to near-chance levels, whereas supervised classification on the same datasets yields high accuracy driven by resolution-based artifacts that persist under standard augmentations.

What carries the argument

Unsupervised semantic clustering applied directly to features from foundational vision models, which measures dataset separability without any supervision on dataset identity labels.

If this is right

  • Supervised classification accuracy on dataset labels can no longer be taken as reliable evidence of semantic bias or divergence.
  • Web-scale natural image collections share far more semantic content than previously concluded from classification experiments.
  • Bias and diversity assessments in computer vision should adopt unsupervised clustering methods to avoid artifact-driven overestimation.
  • Performance gains attributed to training on distinct datasets may partly reflect low-level statistical differences rather than semantic variety.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing studies claiming strong dataset biases may require re-examination with artifact-controlled methods.
  • The low semantic separability implies that transfer learning between these collections could be more effective than bias literature suggests.
  • Similar artifact issues might affect bias measurements in other data modalities where supervised discrimination is used as a proxy.
  • Developing resolution-invariant feature extractors could further tighten the gap between supervised and unsupervised separability measures.

Load-bearing premise

That features extracted from foundational vision models encode primarily semantic content and remain unaffected by low-level cues such as native resolution distributions and resizing artifacts.

What would settle it

Achieving clustering accuracy well above random chance when grouping images from these web-scale datasets using the same foundational features would falsify the claim of near-chance semantic separability.

Figures

Figures reproduced from arXiv: 2604.13610 by Amir Hossein Saleknia, Mohammad Sabokrou.

Figure 1
Figure 1. Figure 1: The "Name That Dataset" game [1] aware of resolution distributions. These images are sampled from the YFCC and CC datasets. We maintain their relative sizes. Additionally, we provide a plot showing the resolution distribution of the training samples from each dataset. Can you guess which dataset each image is from? Once resolution effects are considered, dataset separability becomes significantly more appa… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of our proposed unsupervised semantic bias assessment pipeline. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Normalized confusion matrices (per-class percentage) for three models evalu [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized confusion matrices (per-class percentage) for three models evaluated [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: KDE plots of average image resolution for YFCC, CC, and DataComp datasets, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: KDE plots of average image resolution for YFCC, CC, DataComp, WIT, and [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Several samples of generated fake images at 100 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of average image resolutions across datasets, with DataComp dom [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Residual images highlighting artifacts introduced by resizing. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: KDE plots of the average image resolution in the CC dataset using different [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example image from the DataComp dataset before and after applying super [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual characterization of semantic bias. Correctly clustered images reveal the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pipeline for semantic characterization using CLIP. Image-prompt similarity is [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that supervised classification-based methods for measuring dataset bias in web-scale natural image collections are flawed, as they achieve high accuracy by exploiting resolution artifacts and interpolation effects rather than semantic content. This is demonstrated via controlled experiments showing strong dataset classification even on non-semantic procedurally generated images. The authors propose an alternative unsupervised framework that clusters features from foundational vision models to assess true semantic separability, reporting that this yields near-chance accuracy on major datasets and thus that prior supervised evaluations systematically overstate semantic bias.

Significance. If the unsupervised clustering approach can be shown to isolate semantic content without sensitivity to low-level artifacts, the result would substantially revise understanding of dataset bias in computer vision, invalidating many prior claims based on supervised separability metrics and motivating new evaluation standards. The controlled procedural-image experiments provide a useful diagnostic tool, but the current evidence does not yet fully support the central conclusion.

major comments (1)
  1. [Controlled experiments and unsupervised framework description] The validation of the unsupervised framework is incomplete. While the controlled experiments demonstrate that supervised classifiers exploit resolution artifacts on procedurally generated images, the manuscript does not report the corresponding clustering accuracy when the same foundational features are clustered on those procedural images. Without this control, the near-chance results on real datasets could reflect lower sensitivity of unsupervised clustering to the artifacts (rather than low semantic divergence), leaving the claim that the method measures 'true semantic separability' unsupported. This is load-bearing for the central argument.
minor comments (2)
  1. [Results on web-scale datasets] Quantitative details are insufficient: exact clustering accuracy values, number of clusters, algorithm hyperparameters (e.g., k-means initialization), feature dimensionality, and full dataset statistics (sizes, resolution distributions) are not reported, reducing reproducibility and confidence in the near-chance result.
  2. [Abstract] The abstract's phrasing of 'overwhelming margin' and 'largely vanishes' would be strengthened by citing the specific numerical drop in accuracy from supervised to unsupervised settings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The concern about validating the unsupervised framework against the same procedural controls used for supervised classifiers is well-taken and directly addresses a load-bearing aspect of our argument. We respond to the single major comment below and commit to incorporating the requested control experiment.

read point-by-point responses
  1. Referee: [Controlled experiments and unsupervised framework description] The validation of the unsupervised framework is incomplete. While the controlled experiments demonstrate that supervised classifiers exploit resolution artifacts on procedurally generated images, the manuscript does not report the corresponding clustering accuracy when the same foundational features are clustered on those procedural images. Without this control, the near-chance results on real datasets could reflect lower sensitivity of unsupervised clustering to the artifacts (rather than low semantic divergence), leaving the claim that the method measures 'true semantic separability' unsupported. This is load-bearing for the central argument.

    Authors: We agree that this control is necessary to rule out the possibility that the unsupervised method simply fails to detect the resolution artifacts. The original manuscript does not contain the requested clustering results on the procedural images. In the revision we have now run the identical unsupervised clustering pipeline (using the same foundation-model features and clustering procedure) on the procedurally generated images from our controlled experiments. The resulting cluster purity and normalized mutual information remain at near-chance levels (approximately 34 % accuracy for three-way separation, indistinguishable from random assignment). This confirms that the foundation-model features are insensitive to the low-level resolution signatures that supervised classifiers exploit. Consequently, the near-chance separability observed on real web-scale datasets can be attributed to genuinely low semantic divergence rather than to any general insensitivity of the clustering method. We will add these results as a new subsection, together with the corresponding figures and a brief discussion of the control, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unsupervised clustering provides an independent separability measure.

full rationale

The paper's derivation introduces a new unsupervised pipeline that extracts features from external foundational vision models and applies standard clustering to compute dataset separability directly. This quantity is not obtained by fitting any author-defined parameter to the target metric, nor does it reduce to the supervised classification accuracy by construction. The procedural-image control experiments serve as an external falsification test rather than a self-referential loop. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps in the central argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that foundation-model features isolate semantics from low-level image statistics; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Features from foundational vision models capture semantic content while remaining insensitive to native resolution and interpolation artifacts.
    This assumption is required for the unsupervised clustering to be interpreted as a measure of true semantic separability rather than another low-level cue.

pith-pipeline@v0.9.0 · 5548 in / 1217 out tokens · 50830 ms · 2026-05-10T13:15:40.681863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

  1. [1]

    Torralba, A

    A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: CVPR, IEEE Computer Society, 2011, pp. 1521–1528

  2. [2]

    Z. Liu, K. He, A decade’s battle on dataset bias: Are we there yet?, in: ICLR, OpenReview.net, 2025

  3. [3]

    Thomee, D

    B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L. Li, YFCC100M: the new data in multimedia research, Commun. ACM 59 (2016) 64–73

  4. [4]

    Changpinyo, P

    S. Changpinyo, P. Sharma, N. Ding, R. Soricut, Conceptual 12m: Push- ing web-scale image-text pre-training to recognize long-tail visual con- cepts, in: CVPR, Computer Vision Foundation / IEEE, 2021, pp. 3558– 3568

  5. [5]

    S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. M. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Muss- mann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Di- makis, J. Jitsev...

  6. [6]

    Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: CVPR, IEEE, 2022, pp. 11966–11976

  7. [7]

    DINOv2: Learning Robust Visual Features without Supervision

    M.Oquab, T.Darcet, T.Moutakanni, H.V.Vo, M.Szafraniec, V.Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. La- batut, A.Joulin, P.Bojanowski, Dinov2: Learningrobustvisualfeatures without supervision, CoRR abs...

  8. [8]

    Dalal, B

    N. Dalal, B. Triggs, Histograms of oriented gradients for human detec- tion, in: CVPR (1), IEEE Computer Society, 2005, pp. 886–893. 36

  9. [9]

    Oliva, A

    A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, Int. J. Comput. Vis. 42 (2001) 145–175

  10. [10]

    Panda, J

    R. Panda, J. Zhang, H. Li, J. Lee, X. Lu, A. K. Roy-Chowdhury, Contemplating visual emotions: Understanding and overcoming dataset bias, in: ECCV(2), volume11206ofLecture Notes in Computer Science, Springer, 2018, pp. 594–612

  11. [11]

    Jaipuria, X

    N. Jaipuria, X. Zhang, R. Bhasin, M. Arafa, P. Chakravarty, S. Shrivas- tava, S. Manglani, V. N. Murali, Deflating dataset bias using synthetic data augmentation, in: CVPR Workshops, Computer Vision Founda- tion / IEEE, 2020, pp. 3344–3353

  12. [12]

    Laroca, M

    R. Laroca, M. dos Santos, V. Estevam, E. Luz, D. Menotti, A first look at dataset bias in license plate recognition, in: SIBGRAPI, IEEE, 2022, pp. 234–239

  13. [13]

    W. Chao, H. Hu, F. Sha, Cross-dataset adaptation for visual question answering, in: CVPR, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 5716–5725

  14. [14]

    Wachinger, A

    C. Wachinger, A. Rieckmann, S. Pölsterl, Detect and correct bias in multi-site neuroimaging datasets, Medical Image Anal. 67 (2021) 101879

  15. [15]

    Wachinger, B

    C. Wachinger, B. Gutiérrez-Becker, A. Rieckmann, S. Pölsterl, Quanti- fying confounding bias in neuroimaging datasets with causal inference, in: MICCAI (4), volume 11767 ofLecture Notes in Computer Science, Springer, 2019, pp. 484–492

  16. [16]

    Mansour, R

    Y. Mansour, R. Heckel, Measuring bias of web-filtered text datasets and bias propagation through training, CoRR abs/2412.02857 (2024)

  17. [17]

    Khosla, T

    A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, A. Torralba, Undoing the damage of dataset bias, in: ECCV (1), volume 7572 ofLecture Notes in Computer Science, Springer, 2012, pp. 158–171

  18. [18]

    Donahue, Y

    J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Dar- rell, Decaf: A deep convolutional activation feature for generic visual 37 recognition, in: ICML, volume 32 ofJMLR Workshop and Conference Proceedings, JMLR.org, 2014, pp. 647–655

  19. [19]

    Tommasi, N

    T. Tommasi, N. Patricia, B. Caputo, T. Tuytelaars, A deeper look at dataset bias, in: GCPR, volume 9358 ofLecture Notes in Computer Science, Springer, 2015, pp. 504–516

  20. [20]

    A. Wang, A. Narayanan, O. Russakovsky, Vibe: A tool for measuring and mitigating bias in image datasets, CoRR abs/2004.07999 (2020)

  21. [21]

    9572–9581

    Y.Li, N.Vasconcelos, REPAIR:removingrepresentationbiasbydataset resampling, in: CVPR, Computer Vision Foundation / IEEE, 2019, pp. 9572–9581

  22. [22]

    R. L. Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. E. Peters, A. Sabharwal, Y. Choi, Adversarial filters of dataset biases, in: ICML, volume 119 ofProceedings of Machine Learning Research, PMLR, 2020, pp. 1078–1088

  23. [23]

    5309–5318

    T.Wang, J.Zhao, M.Yatskar, K.Chang, V.Ordonez, Balanceddatasets are not enough: Estimating and mitigating gender bias in deep image representations, in: ICCV, IEEE, 2019, pp. 5309–5318

  24. [24]

    S. Ahn, S. Kim, S. Yun, Mitigating dataset bias by using per-sample gradient, in: ICLR, OpenReview.net, 2023

  25. [25]

    J. H. Nam, H. Cha, S. Ahn, J. Lee, J. Shin, Learning from failure: Training debiased classifier from biased classifier, CoRR abs/2007.02561 (2020)

  26. [26]

    Ramos, V

    R. Ramos, V. Stojnic, G. Kordopatis-Zilos, Y. Nakashima, G. Tolias, N. Garcia, Processing and acquisition traces in visual encoders: What does CLIP know about your camera?, CoRR abs/2508.10637 (2025)

  27. [27]

    B. Zeng, Y. Yin, Z. Liu, Understanding bias in large-scale visual datasets, in: NeurIPS, 2024

  28. [28]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, UMAP: uniform manifold approximation and projection for dimension reduction, CoRR abs/1802.03426 (2018)

  29. [29]

    H. W. Kuhn, The hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1955) 83–97. 38

  30. [30]

    Srinivasan, K

    K. Srinivasan, K. Raman, J. Chen, M. Bendersky, M. Najork, WIT: wikipedia-based image text dataset for multimodal multilingual machine learning, in: SIGIR, ACM, 2021, pp. 2443–2449

  31. [31]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmar- czyk, J. Jitsev, LAION-5B: an open large-scale dataset for training next generation image-text models, in: NeurIPS, 2022

  32. [32]

    Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, C. Feichten- hofer, Mvitv2: Improved multiscale vision transformers for classification and detection, in: CVPR, IEEE, 2022, pp. 4794–4804

  33. [33]

    H. Cai, J. Li, M. Hu, C. Gan, S. Han, Efficientvit: Lightweight multi- scale attention for high-resolution dense prediction, in: ICCV, IEEE, 2023, pp. 17256–17267

  34. [34]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: ICML, volume 139 ofProceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763

  35. [35]

    Quattoni, A

    A. Quattoni, A. Torralba, Recognizing indoor scenes, in: CVPR, IEEE Computer Society, 2009, pp. 413–420

  36. [36]

    B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, L. Fei-Fei, Human action recognition by learning bases of action attributes and parts, in: ICCV, IEEE Computer Society, 2011, pp. 1331–1338

  37. [37]

    Krizhevsky, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009

    A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009

  38. [38]

    O. M. Parkhi, A. Vedaldi, A. Zisserman, C. V. Jawahar, Cats and dogs, in: CVPR, IEEE Computer Society, 2012, pp. 3498–3505

  39. [39]

    Liang, J

    J. Liang, J. Cao, G. Sun, K. Zhang, L. V. Gool, R. Timofte, Swinir: Image restoration using swin transformer, in: ICCVW, IEEE, 2021, pp. 1833–1844. 39

  40. [40]

    DINOv3

    O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sen- tana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, P. Bojanowski, Dinov3, CoRR abs/2508.10104 (2025)

  41. [41]

    H. Ning, Q. He, T. Lei, X. Cao, W. Zhang, Y. Chen, A. K. Nandi, Da2- net: Integrating SAM2 with domain adaption and difference aggregation for remote sensing change detection, IEEE Trans. Geosci. Remote. Sens. 63 (2025) 1–17

  42. [42]

    J. Xu, T. Liu, T. Lei, H. Chen, N. Yokoya, Z. Lv, M. Gong, Cgsl: Com- monality graph structure learning for unsupervised multimodal change detection, ISPRS Journal of Photogrammetry and Remote Sensing 229 (2025) 92–106

  43. [43]

    B. Guo, D. Lu, G. Szumel, R. Gui, T. Wang, N. Konz, M. A. Mazurowski, The impact of scanner domain shift on deep learn- ing performance in medical imaging: an experimental study, CoRR abs/2409.04368 (2024)

  44. [44]

    L. Xue, M. Shu, A. Awadalla, J. Wang, A. Yan, S. Purushwalkam, H. Zhou, V. Prabhu, Y. Dai, M. S. Ryoo, S. Kendre, J. Zhang, S. Tseng, G. A. Lujan-Moreno, M. L. Olson, M. Hinck, D. Cobbley, V. Lal, C. Qin, S. Zhang, C.-C. Chen, N. Yu, J. Tan, T. M. Awalgaonkar, S. Heinecke, H. Wang, Y. Choi, L. Schmidt, Z. Chen, S. Savarese, J. C. Niebles, C. Xiong, R. Xu,...