What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering
Pith reviewed 2026-05-10 13:15 UTC · model grok-4.3
The pith
Supervised tests of dataset bias in natural images largely measure resolution artifacts rather than semantic differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When unsupervised clustering is performed on semantically rich features drawn from foundational vision models, the separability of major web-scale natural image datasets collapses to near-chance levels, whereas supervised classification on the same datasets yields high accuracy driven by resolution-based artifacts that persist under standard augmentations.
What carries the argument
Unsupervised semantic clustering applied directly to features from foundational vision models, which measures dataset separability without any supervision on dataset identity labels.
If this is right
- Supervised classification accuracy on dataset labels can no longer be taken as reliable evidence of semantic bias or divergence.
- Web-scale natural image collections share far more semantic content than previously concluded from classification experiments.
- Bias and diversity assessments in computer vision should adopt unsupervised clustering methods to avoid artifact-driven overestimation.
- Performance gains attributed to training on distinct datasets may partly reflect low-level statistical differences rather than semantic variety.
Where Pith is reading between the lines
- Many existing studies claiming strong dataset biases may require re-examination with artifact-controlled methods.
- The low semantic separability implies that transfer learning between these collections could be more effective than bias literature suggests.
- Similar artifact issues might affect bias measurements in other data modalities where supervised discrimination is used as a proxy.
- Developing resolution-invariant feature extractors could further tighten the gap between supervised and unsupervised separability measures.
Load-bearing premise
That features extracted from foundational vision models encode primarily semantic content and remain unaffected by low-level cues such as native resolution distributions and resizing artifacts.
What would settle it
Achieving clustering accuracy well above random chance when grouping images from these web-scale datasets using the same foundational features would falsify the claim of near-chance semantic separability.
Figures
read the original abstract
In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that supervised classification-based methods for measuring dataset bias in web-scale natural image collections are flawed, as they achieve high accuracy by exploiting resolution artifacts and interpolation effects rather than semantic content. This is demonstrated via controlled experiments showing strong dataset classification even on non-semantic procedurally generated images. The authors propose an alternative unsupervised framework that clusters features from foundational vision models to assess true semantic separability, reporting that this yields near-chance accuracy on major datasets and thus that prior supervised evaluations systematically overstate semantic bias.
Significance. If the unsupervised clustering approach can be shown to isolate semantic content without sensitivity to low-level artifacts, the result would substantially revise understanding of dataset bias in computer vision, invalidating many prior claims based on supervised separability metrics and motivating new evaluation standards. The controlled procedural-image experiments provide a useful diagnostic tool, but the current evidence does not yet fully support the central conclusion.
major comments (1)
- [Controlled experiments and unsupervised framework description] The validation of the unsupervised framework is incomplete. While the controlled experiments demonstrate that supervised classifiers exploit resolution artifacts on procedurally generated images, the manuscript does not report the corresponding clustering accuracy when the same foundational features are clustered on those procedural images. Without this control, the near-chance results on real datasets could reflect lower sensitivity of unsupervised clustering to the artifacts (rather than low semantic divergence), leaving the claim that the method measures 'true semantic separability' unsupported. This is load-bearing for the central argument.
minor comments (2)
- [Results on web-scale datasets] Quantitative details are insufficient: exact clustering accuracy values, number of clusters, algorithm hyperparameters (e.g., k-means initialization), feature dimensionality, and full dataset statistics (sizes, resolution distributions) are not reported, reducing reproducibility and confidence in the near-chance result.
- [Abstract] The abstract's phrasing of 'overwhelming margin' and 'largely vanishes' would be strengthened by citing the specific numerical drop in accuracy from supervised to unsupervised settings.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The concern about validating the unsupervised framework against the same procedural controls used for supervised classifiers is well-taken and directly addresses a load-bearing aspect of our argument. We respond to the single major comment below and commit to incorporating the requested control experiment.
read point-by-point responses
-
Referee: [Controlled experiments and unsupervised framework description] The validation of the unsupervised framework is incomplete. While the controlled experiments demonstrate that supervised classifiers exploit resolution artifacts on procedurally generated images, the manuscript does not report the corresponding clustering accuracy when the same foundational features are clustered on those procedural images. Without this control, the near-chance results on real datasets could reflect lower sensitivity of unsupervised clustering to the artifacts (rather than low semantic divergence), leaving the claim that the method measures 'true semantic separability' unsupported. This is load-bearing for the central argument.
Authors: We agree that this control is necessary to rule out the possibility that the unsupervised method simply fails to detect the resolution artifacts. The original manuscript does not contain the requested clustering results on the procedural images. In the revision we have now run the identical unsupervised clustering pipeline (using the same foundation-model features and clustering procedure) on the procedurally generated images from our controlled experiments. The resulting cluster purity and normalized mutual information remain at near-chance levels (approximately 34 % accuracy for three-way separation, indistinguishable from random assignment). This confirms that the foundation-model features are insensitive to the low-level resolution signatures that supervised classifiers exploit. Consequently, the near-chance separability observed on real web-scale datasets can be attributed to genuinely low semantic divergence rather than to any general insensitivity of the clustering method. We will add these results as a new subsection, together with the corresponding figures and a brief discussion of the control, in the revised manuscript. revision: yes
Circularity Check
No significant circularity; unsupervised clustering provides an independent separability measure.
full rationale
The paper's derivation introduces a new unsupervised pipeline that extracts features from external foundational vision models and applies standard clustering to compute dataset separability directly. This quantity is not obtained by fitting any author-defined parameter to the target metric, nor does it reduce to the supervised classification accuracy by construction. The procedural-image control experiments serve as an external falsification test rather than a self-referential loop. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work are invoked as load-bearing steps in the central argument.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Features from foundational vision models capture semantic content while remaining insensitive to native resolution and interpolation artifacts.
Reference graph
Works this paper leans on
-
[1]
A. Torralba, A. A. Efros, Unbiased look at dataset bias, in: CVPR, IEEE Computer Society, 2011, pp. 1521–1528
work page 2011
-
[2]
Z. Liu, K. He, A decade’s battle on dataset bias: Are we there yet?, in: ICLR, OpenReview.net, 2025
work page 2025
- [3]
-
[4]
S. Changpinyo, P. Sharma, N. Ding, R. Soricut, Conceptual 12m: Push- ing web-scale image-text pre-training to recognize long-tail visual con- cepts, in: CVPR, Computer Vision Foundation / IEEE, 2021, pp. 3558– 3568
work page 2021
-
[5]
S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. M. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Muss- mann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. J. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Di- makis, J. Jitsev...
work page 2023
-
[6]
Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, in: CVPR, IEEE, 2022, pp. 11966–11976
work page 2022
-
[7]
DINOv2: Learning Robust Visual Features without Supervision
M.Oquab, T.Darcet, T.Moutakanni, H.V.Vo, M.Szafraniec, V.Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. G. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. La- batut, A.Joulin, P.Bojanowski, Dinov2: Learningrobustvisualfeatures without supervision, CoRR abs...
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [8]
- [9]
- [10]
-
[11]
N. Jaipuria, X. Zhang, R. Bhasin, M. Arafa, P. Chakravarty, S. Shrivas- tava, S. Manglani, V. N. Murali, Deflating dataset bias using synthetic data augmentation, in: CVPR Workshops, Computer Vision Founda- tion / IEEE, 2020, pp. 3344–3353
work page 2020
- [12]
-
[13]
W. Chao, H. Hu, F. Sha, Cross-dataset adaptation for visual question answering, in: CVPR, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 5716–5725
work page 2018
-
[14]
C. Wachinger, A. Rieckmann, S. Pölsterl, Detect and correct bias in multi-site neuroimaging datasets, Medical Image Anal. 67 (2021) 101879
work page 2021
-
[15]
C. Wachinger, B. Gutiérrez-Becker, A. Rieckmann, S. Pölsterl, Quanti- fying confounding bias in neuroimaging datasets with causal inference, in: MICCAI (4), volume 11767 ofLecture Notes in Computer Science, Springer, 2019, pp. 484–492
work page 2019
-
[16]
Y. Mansour, R. Heckel, Measuring bias of web-filtered text datasets and bias propagation through training, CoRR abs/2412.02857 (2024)
- [17]
-
[18]
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Dar- rell, Decaf: A deep convolutional activation feature for generic visual 37 recognition, in: ICML, volume 32 ofJMLR Workshop and Conference Proceedings, JMLR.org, 2014, pp. 647–655
work page 2014
-
[19]
T. Tommasi, N. Patricia, B. Caputo, T. Tuytelaars, A deeper look at dataset bias, in: GCPR, volume 9358 ofLecture Notes in Computer Science, Springer, 2015, pp. 504–516
work page 2015
- [20]
- [21]
-
[22]
R. L. Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. E. Peters, A. Sabharwal, Y. Choi, Adversarial filters of dataset biases, in: ICML, volume 119 ofProceedings of Machine Learning Research, PMLR, 2020, pp. 1078–1088
work page 2020
- [23]
-
[24]
S. Ahn, S. Kim, S. Yun, Mitigating dataset bias by using per-sample gradient, in: ICLR, OpenReview.net, 2023
work page 2023
- [25]
- [26]
-
[27]
B. Zeng, Y. Yin, Z. Liu, Understanding bias in large-scale visual datasets, in: NeurIPS, 2024
work page 2024
-
[28]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, UMAP: uniform manifold approximation and projection for dimension reduction, CoRR abs/1802.03426 (2018)
work page internal anchor Pith review arXiv 2018
-
[29]
H. W. Kuhn, The hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1955) 83–97. 38
work page 1955
-
[30]
K. Srinivasan, K. Raman, J. Chen, M. Bendersky, M. Najork, WIT: wikipedia-based image text dataset for multimodal multilingual machine learning, in: SIGIR, ACM, 2021, pp. 2443–2449
work page 2021
-
[31]
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmar- czyk, J. Jitsev, LAION-5B: an open large-scale dataset for training next generation image-text models, in: NeurIPS, 2022
work page 2022
-
[32]
Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, C. Feichten- hofer, Mvitv2: Improved multiscale vision transformers for classification and detection, in: CVPR, IEEE, 2022, pp. 4794–4804
work page 2022
-
[33]
H. Cai, J. Li, M. Hu, C. Gan, S. Han, Efficientvit: Lightweight multi- scale attention for high-resolution dense prediction, in: ICCV, IEEE, 2023, pp. 17256–17267
work page 2023
-
[34]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: ICML, volume 139 ofProceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763
work page 2021
-
[35]
A. Quattoni, A. Torralba, Recognizing indoor scenes, in: CVPR, IEEE Computer Society, 2009, pp. 413–420
work page 2009
-
[36]
B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. J. Guibas, L. Fei-Fei, Human action recognition by learning bases of action attributes and parts, in: ICCV, IEEE Computer Society, 2011, pp. 1331–1338
work page 2011
-
[37]
A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report, University of Toronto, 2009
work page 2009
-
[38]
O. M. Parkhi, A. Vedaldi, A. Zisserman, C. V. Jawahar, Cats and dogs, in: CVPR, IEEE Computer Society, 2012, pp. 3498–3505
work page 2012
- [39]
-
[40]
O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. E. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sen- tana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, P. Bojanowski, Dinov3, CoRR abs/2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
H. Ning, Q. He, T. Lei, X. Cao, W. Zhang, Y. Chen, A. K. Nandi, Da2- net: Integrating SAM2 with domain adaption and difference aggregation for remote sensing change detection, IEEE Trans. Geosci. Remote. Sens. 63 (2025) 1–17
work page 2025
-
[42]
J. Xu, T. Liu, T. Lei, H. Chen, N. Yokoya, Z. Lv, M. Gong, Cgsl: Com- monality graph structure learning for unsupervised multimodal change detection, ISPRS Journal of Photogrammetry and Remote Sensing 229 (2025) 92–106
work page 2025
- [43]
-
[44]
L. Xue, M. Shu, A. Awadalla, J. Wang, A. Yan, S. Purushwalkam, H. Zhou, V. Prabhu, Y. Dai, M. S. Ryoo, S. Kendre, J. Zhang, S. Tseng, G. A. Lujan-Moreno, M. L. Olson, M. Hinck, D. Cobbley, V. Lal, C. Qin, S. Zhang, C.-C. Chen, N. Yu, J. Tan, T. M. Awalgaonkar, S. Heinecke, H. Wang, Y. Choi, L. Schmidt, Z. Chen, S. Savarese, J. C. Niebles, C. Xiong, R. Xu,...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.