LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models
Pith reviewed 2026-05-22 00:48 UTC · model grok-4.3
The pith
LAION-C introduces six novel image distortions that remain out-of-distribution for web-scale vision models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LAION-C consists of six novel distortion types specifically designed to be out-of-distribution even for web-scale datasets such as LAION. Comprehensive evaluation shows these distortions pose significant challenges to state-of-the-art models including MLLMs such as Gemini and GPT-4o. Comparison with lab-quality human data reveals a paradigm shift from humans outperforming models to the best models now matching or outperforming the best human observers.
What carries the argument
The LAION-C benchmark consisting of six novel distortion types verified to be absent from web-scale training distributions.
If this is right
- Models trained on web-scale data still require advances to handle truly novel distortions.
- Future robustness benchmarks must exclude distortions that already occur in internet-sourced training data.
- Direct comparisons with human observers on these tasks set a new reference point for generalization progress.
- Multimodal models exhibit similar vulnerabilities to these distortions as specialized vision models.
Where Pith is reading between the lines
- Scaling data volume alone may leave gaps in coverage for certain classes of image transformations.
- The benchmark could serve as a diagnostic tool to guide development of more robust training procedures.
- Analogous construction methods might be used to build OOD tests for video, audio, or multimodal inputs.
Load-bearing premise
The six novel distortion types are truly absent from the training distributions of web-scale datasets such as LAION and therefore constitute genuine OOD cases.
What would settle it
Locating any of the six distortion types in samples from LAION or similar web-scale collections would show they are not genuinely out-of-distribution.
Figures
read the original abstract
Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LAION-C as an alternative to saturated ImageNet-era OOD benchmarks such as ImageNet-C. It consists of six novel distortion types explicitly designed to remain out-of-distribution even for web-scale training corpora like LAION. The work reports comprehensive evaluations of state-of-the-art vision models and multimodal LLMs (including Gemini and GPT-4o), presents results from a controlled human psychophysical study, and concludes that the best current models now match or surpass the best human observers on these distortions, indicating a paradigm shift in OOD generalization.
Significance. If the OOD status of the distortions holds, LAION-C supplies a much-needed, forward-looking benchmark that can track genuine robustness progress once older corruption sets have been absorbed into web-scale training data. The direct comparison to lab-quality human performance data is a particular strength, as it grounds model numbers against a reproducible human baseline rather than relying solely on relative model rankings. The work also usefully documents saturation on legacy benchmarks and the resulting need for new test distributions.
major comments (1)
- [Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.
minor comments (2)
- [Abstract] Abstract and introduction: The phrase 'paradigm shift' is used to characterize the model-human performance reversal; a more measured formulation (e.g., 'reversal on this benchmark') would better reflect that the result is tied to the specific six distortions rather than a universal change in OOD behavior.
- [Evaluation tables] Evaluation tables: It would be helpful to report per-distortion human and model accuracies alongside aggregate scores so readers can see whether the model-human parity holds uniformly or is driven by particular corruption types.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of LAION-C's significance as a forward-looking OOD benchmark and for highlighting the value of the human psychophysical study. We address the major comment on the distortion design and verification section below.
read point-by-point responses
-
Referee: [Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.
Authors: We agree that the verification procedure merits a more detailed and transparent description to fully substantiate the OOD status of the distortions. The current manuscript emphasizes the design principles chosen to minimize overlap with web-scale data but does not elaborate sufficiently on the empirical checks. In the revised manuscript we will expand the 'Distortion Design and Verification' section to specify the similarity-search method employed, the number of LAION images inspected, the quantitative similarity thresholds applied, and the statistical sampling strategy used. These additions will provide the rigorous account requested and allow readers to better evaluate the likelihood of undetected training-data exposure. revision: yes
Circularity Check
No circularity: empirical benchmark with independent test data
full rationale
The paper introduces LAION-C as a new benchmark consisting of six novel distortion types and reports empirical model and human performance numbers on it. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claims rest on experimental results from applying existing models to the new test set and a separate psychophysical study, with no reduction of any result to self-citation chains or inputs by construction. The OOD assumption is a design claim subject to external verification rather than a tautological step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six chosen distortion families do not appear at scale in LAION or similar web-scraped corpora.
Reference graph
Works this paper leans on
-
[1]
Are we done with imagenet?arXiv preprint arXiv:2006.07159,
URL https://arxiv.org/abs/ 2006.07159. Biederman, I. and Ju, G. Surface versus edge-based deter- minants of visual recognition. Cognitive psychology, 20 (1):38–64,
-
[2]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
URL https://arxiv.org/abs/ 1704.04861. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jang, H. and Tong, F. Improved modeling of human vi- sion by incorporating robustness to blur in convolutional neural networks. Nature Communications, 15(1):1989,
work page 1989
-
[4]
URL https://arxiv.org/abs/ 1908.08016. Kellman, P. J. and Spelke, E. S. Perception of partly oc- cluded objects in infancy. Cognitive psychology, 15(4): 483–524,
-
[5]
The role of imagenet classes in fr \’echet inception distance
URL https: //arxiv.org/abs/2203.06026. Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR,
-
[6]
Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a
URL https://arxiv.org/abs/ 2310.09562. Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization. In ICML 2024 Workshop on F oundation Models in the Wild ,
-
[7]
Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming
URL https://openreview.net/forum? id=Bc2p8T4V32. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bring- mann, O., Ecker, A. S., Bethge, M., and Brendel, W. Benchmarking robustness in object detection: Au- tonomous driving when winter is coming.arXiv preprint arXiv:1907.07484,
-
[8]
URL https: //arxiv.org/abs/2208.06366. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML,
-
[9]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Resnet strikes back: An im- proved training procedure in timm,
URL https://arxiv.org/abs/ 2110.00476. Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences,
-
[11]
URL https://arxiv.org/ abs/1605.07146. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scal- ing vision transformers. In CVPR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
12 LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models A. Appendix A.1. Related Work OOD generalization ability of vision models. As deep learning has advanced to the point where models can reliably generalize to data that matches their training distribution or even exceed the quality of the original labels (Beyer et al., 2020), OOD-robu...
work page 2020
-
[13]
introduce a dataset of black and white sketches matching the labels and scale of the ImageNet validation set, called ImageNet-Sketch. A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al. (2019). They collected ImageNetV2, a new test set for ImageNet that shoul...
work page 2019
-
[14]
has redefined what constitutes standard performance across many visual tasks. These improvements in performance partially stem from architectural innovations and parameter optimization, but were mostly powered by the effective leveraging of unprecedented dataset sizes (Zhai et al., 2022). However, because visual foundation models were trained on web-scale...
work page 2022
-
[15]
and were found to be the best available models for neuronal activity in the primate visual cortex (Yamins et al., 2014), even if not trained for this task. Today, there is a growing body of research dedicated to evaluating the adequacy of neural networks as behavioral models of human core object recognition (Doerig et al., 2023; Schrimpf et al., 2018; Wic...
work page 2014
-
[16]
This figure illustrates the icon layout as displayed to participants during the study
Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, U...
work page 2018
-
[17]
Congratulations! You just earned some extra money!
To encourage responses rather than leaving selections blank, a message was displayed at the top of the screen 0.75 second before icon display time ended, prompting participants to make a choice. At the end of each block, if a participant surpassed the 90% accuracy threshold calibrated using internal baseline performance data, they received an encouraging ...
work page 1960
-
[18]
We analyze the agreement of error patterns between different families of vision models (see Tab
Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency ( κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45...
work page 2021
-
[19]
Tile sizes at each level. A.4.2. G LITCHED The original image undergoes an artistic digital corruption, with horizontal lines overlaying shifted image segments and color channel shifts. Here, a region refers to a randomly selected rectangular area of the image, and a shift denotes the horizontal displacement (left or right) of that region by a certain per...
work page 2019
-
[20]
Glitch parameters at each level. The implementation is inspired by GitHub user “totallynotchase” (T, 2020). A.4.3. V ERTICAL LINES The original image is transformed through a process of vertical deconstruction. It is first divided into multiple vertical sections, which are further subdivided along the y-axis into small segments called y-steps. In each of ...
work page 2020
-
[21]
Vertical sectioning and step sizes at each level. A.4.4. G EOMETRIC SHAPES The original image is overlaid with overlapping geometric figures such as squares, circles, and stars. This visual clutter introduces local noise that obscures the main object, like the Kaleidoscope corruption from (Kaufmann et al., 2019). The number of shapes for each intensity le...
work page 2019
-
[22]
Performance Divergence of Models on LAION-C and ImageNet-C (1k classes).The figure illustrates the scattered perfor- mance of models across the ImageNet-C and LAION-C dataset, where a Kendall’sτ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we suppl...
work page 2024
-
[23]
Numbers show top-1 accuracy in percent
LAION-C benchmark results. Numbers show top-1 accuracy in percent. ImageNet refers to model accuracy on the (uncor- rupted) ImageNet validation set (values sourced from the timm leaderboard (Wightman, 2024)). For each corruption, we report mean top-1 accuracy across all intensity levels, with LAION-C as the overall benchmark metric (averaged across corrup...
work page 2024
-
[24]
Abbreviation Full Model Name Description EV A-G-P14-560-M30M-IN22K evagiantpatch14560.m30mftin22kin1k EV A giant model, patch size 14, pre-trained with masked image model-ing (MIM) on a Merged-30M dataset, fine-tuned on ImageNet-22k andImageNet-1k (Fang et al., 2023).EV A02-L-P14-448-MIM-M38M-IN22K eva02largepatch14448.mimm38mftin22kin1k EV A02 large mode...
work page 2023
-
[25]
and can be traced back to its original ImageNet class label. Is any information missing from individual in- stances? If so, please provide a description, explain- ing why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No information is missing...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.