pith. sign in

arxiv: 2506.16950 · v2 · pith:ZKL5V6FOnew · submitted 2025-06-20 · 💻 cs.CV · cs.LG

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Pith reviewed 2026-05-22 00:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords out-of-distribution robustnessimage distortion benchmarkweb-scale vision modelsnovel corruptionshuman-model comparisonLAION dataset
0
0 comments X

The pith

LAION-C introduces six novel image distortions that remain out-of-distribution for web-scale vision models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces LAION-C as a replacement for older benchmarks like ImageNet-C, which no longer qualify as out-of-distribution tests because common corruptions appear in today's large web-scraped training sets. The authors design and verify six new distortion types that are absent from such data, then evaluate a range of current models including multimodal systems. A separate psychophysical study measures human performance on the same distortions. The results indicate that top models now reach or exceed the accuracy of the best human observers.

Core claim

LAION-C consists of six novel distortion types specifically designed to be out-of-distribution even for web-scale datasets such as LAION. Comprehensive evaluation shows these distortions pose significant challenges to state-of-the-art models including MLLMs such as Gemini and GPT-4o. Comparison with lab-quality human data reveals a paradigm shift from humans outperforming models to the best models now matching or outperforming the best human observers.

What carries the argument

The LAION-C benchmark consisting of six novel distortion types verified to be absent from web-scale training distributions.

If this is right

  • Models trained on web-scale data still require advances to handle truly novel distortions.
  • Future robustness benchmarks must exclude distortions that already occur in internet-sourced training data.
  • Direct comparisons with human observers on these tasks set a new reference point for generalization progress.
  • Multimodal models exhibit similar vulnerabilities to these distortions as specialized vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling data volume alone may leave gaps in coverage for certain classes of image transformations.
  • The benchmark could serve as a diagnostic tool to guide development of more robust training procedures.
  • Analogous construction methods might be used to build OOD tests for video, audio, or multimodal inputs.

Load-bearing premise

The six novel distortion types are truly absent from the training distributions of web-scale datasets such as LAION and therefore constitute genuine OOD cases.

What would settle it

Locating any of the six distortion types in samples from LAION or similar web-scale collections would show they are not genuinely out-of-distribution.

Figures

Figures reproduced from arXiv: 2506.16950 by Fanfei Li, Robert Geirhos, Roland S. Zimmermann, Thomas Klein, Wieland Brendel.

Figure 1
Figure 1. Figure 1: ImageNet-C corruptions are not out-of-distribution (OOD) for web-scale datasets like LAION-400M. Exemplary cor￾rupted images from ImageNet-C (left) are similar to LAION-400M samples (right). Each row shows example corruptions and dataset images for one ImageNet-C corruption category (Noise, Blur, Weather, Digital). The presence of these distortions in web-scale datasets indicates the need for an OOD benchm… view at source ↗
Figure 2
Figure 2. Figure 2: LAION-C distortions, intended to be OOD even for web-scale datasets. This figure illustrates the six LAION-C distortions at five intensity levels. Following the standard experimental paradigm from psychophysics, our dataset spans from near-perfect to chance-level difficulties, thoroughly testing models and leaving room for future model improvements. Best viewed on screen. highest intensity level, i.e. no m… view at source ↗
Figure 3
Figure 3. Figure 3: Performance Divergence of Models on LAION-C and ImageNet-C 16 class. Evaluating models on the 16-class versions of ImageNet-C and LAION-C produces a plateaued per￾formance on ImageNet-C, while LAION-C still yields a high vari￾ance across models. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LAION-C offers better resolution of model differ￾ences. We tested 9 models pre-trained on LAION2B, evaluating them across all intensity levels if applicable. LAION-C captures a broader variance in model performance, with a standard devia￾tion of ∼27%, compared to an average of ∼10% in other common OOD datasets. Notably, LAION-C is tested on a 16-class basis, while other datasets typically use 200-1000 clas… view at source ↗
Figure 5
Figure 5. Figure 5: LAION-C poses a greater challenge to model robustness than ImageNet-C. We plot distortion intensity against each model’s average accuracy. Visual foundation models evaluated on ImageNet-C maintain high accuracy, with minimal drop across increasing intensity levels. On our LAION-C dataset, the models exhibit a sharper decline in accuracy, highlighting the benchmark’s effectiveness in measuring model robustn… view at source ↗
Figure 6
Figure 6. Figure 6: Human vs. machine accuracy on all distortions. For each LAION-C distortion, we plot the distortion intensity against the accuracy of the best human and the best model in this condition (for average human performance, see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, United States) using a 12-core desktop computer (AMD … view at source ↗
Figure 8
Figure 8. Figure 8: Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency (κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45, indicating that behavioral differences between huma… view at source ↗
Figure 9
Figure 9. Figure 9: LAION-C can be solved. For every distortion, we plot the accuracy of our reference model (ViT-H-P14-336-CLIP-LAION￾IN12K) before and after fine-tuning, in comparison to the best human participant for reference. Most distortions can be learned perfectly, only the Stickers and Mosaic distortions might have been too difficult at the highest intensity levels. Further performance gains might be possible with mo… view at source ↗
Figure 10
Figure 10. Figure 10: Performance Divergence of Models on LAION-C and ImageNet-C (1k classes). The figure illustrates the scattered perfor￾mance of models across the ImageNet-C and LAION-C dataset, where a Kendall’s τ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we supplement the suite of models with additional top-perf… view at source ↗
Figure 11
Figure 11. Figure 11: Model performance on LAION-C. Analogous to [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual Reasoning in Gemini. We provide examples of visual reasoning in Gemini-1.5-Pro, consisting of a LAION-C sample, the reasons for classification that Gemini provided and meta-information (like the final label, the ground-truth label and corruption details). In line with our findings about Error Consistency (see [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual Reasoning in GPT. Figure analogous to [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces LAION-C as an alternative to saturated ImageNet-era OOD benchmarks such as ImageNet-C. It consists of six novel distortion types explicitly designed to remain out-of-distribution even for web-scale training corpora like LAION. The work reports comprehensive evaluations of state-of-the-art vision models and multimodal LLMs (including Gemini and GPT-4o), presents results from a controlled human psychophysical study, and concludes that the best current models now match or surpass the best human observers on these distortions, indicating a paradigm shift in OOD generalization.

Significance. If the OOD status of the distortions holds, LAION-C supplies a much-needed, forward-looking benchmark that can track genuine robustness progress once older corruption sets have been absorbed into web-scale training data. The direct comparison to lab-quality human performance data is a particular strength, as it grounds model numbers against a reproducible human baseline rather than relying solely on relative model rankings. The work also usefully documents saturation on legacy benchmarks and the resulting need for new test distributions.

major comments (1)
  1. [Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrase 'paradigm shift' is used to characterize the model-human performance reversal; a more measured formulation (e.g., 'reversal on this benchmark') would better reflect that the result is tied to the specific six distortions rather than a universal change in OOD behavior.
  2. [Evaluation tables] Evaluation tables: It would be helpful to report per-distortion human and model accuracies alongside aggregate scores so readers can see whether the model-human parity holds uniformly or is driven by particular corruption types.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of LAION-C's significance as a forward-looking OOD benchmark and for highlighting the value of the human psychophysical study. We address the major comment on the distortion design and verification section below.

read point-by-point responses
  1. Referee: [Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.

    Authors: We agree that the verification procedure merits a more detailed and transparent description to fully substantiate the OOD status of the distortions. The current manuscript emphasizes the design principles chosen to minimize overlap with web-scale data but does not elaborate sufficiently on the empirical checks. In the revised manuscript we will expand the 'Distortion Design and Verification' section to specify the similarity-search method employed, the number of LAION images inspected, the quantitative similarity thresholds applied, and the statistical sampling strategy used. These additions will provide the rigorous account requested and allow readers to better evaluate the likelihood of undetected training-data exposure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent test data

full rationale

The paper introduces LAION-C as a new benchmark consisting of six novel distortion types and reports empirical model and human performance numbers on it. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claims rest on experimental results from applying existing models to the new test set and a separate psychophysical study, with no reduction of any result to self-citation chains or inputs by construction. The OOD assumption is a design claim subject to external verification rather than a tautological step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an empirical benchmark rather than a theoretical derivation, so the ledger contains no fitted parameters or invented physical entities. The main background assumptions are standard computer-vision notions of distribution shift and the validity of the chosen distortion generation procedures.

axioms (1)
  • domain assumption The six chosen distortion families do not appear at scale in LAION or similar web-scraped corpora.
    Invoked when claiming the corruptions remain OOD for web-scale models.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 41068 ms · 2026-05-22T00:48:05.099622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

  1. [1]

    Are we done with imagenet?arXiv preprint arXiv:2006.07159,

    URL https://arxiv.org/abs/ 2006.07159. Biederman, I. and Ju, G. Surface versus edge-based deter- minants of visual recognition. Cognitive psychology, 20 (1):38–64,

  2. [2]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    URL https://arxiv.org/abs/ 1704.04861. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR,

  3. [3]

    and Tong, F

    Jang, H. and Tong, F. Improved modeling of human vi- sion by incorporating robustness to blur in convolutional neural networks. Nature Communications, 15(1):1989,

  4. [4]

    Kellman, P

    URL https://arxiv.org/abs/ 1908.08016. Kellman, P. J. and Spelke, E. S. Perception of partly oc- cluded objects in infancy. Cognitive psychology, 15(4): 483–524,

  5. [5]

    The role of imagenet classes in fr \’echet inception distance

    URL https: //arxiv.org/abs/2203.06026. Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR,

  6. [6]

    Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

    URL https://arxiv.org/abs/ 2310.09562. Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization. In ICML 2024 Workshop on F oundation Models in the Wild ,

  7. [7]

    Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming

    URL https://openreview.net/forum? id=Bc2p8T4V32. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bring- mann, O., Ecker, A. S., Bethge, M., and Brendel, W. Benchmarking robustness in object detection: Au- tonomous driving when winter is coming.arXiv preprint arXiv:1907.07484,

  8. [8]

    Radford, A., Kim, J

    URL https: //arxiv.org/abs/2208.06366. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML,

  9. [9]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  10. [10]

    Resnet strikes back: An im- proved training procedure in timm,

    URL https://arxiv.org/abs/ 2110.00476. Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences,

  11. [11]

    Wide Residual Networks

    URL https://arxiv.org/ abs/1605.07146. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scal- ing vision transformers. In CVPR,

  12. [12]

    Appendix A.1

    12 LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models A. Appendix A.1. Related Work OOD generalization ability of vision models. As deep learning has advanced to the point where models can reliably generalize to data that matches their training distribution or even exceed the quality of the original labels (Beyer et al., 2020), OOD-robu...

  13. [13]

    A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al

    introduce a dataset of black and white sketches matching the labels and scale of the ImageNet validation set, called ImageNet-Sketch. A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al. (2019). They collected ImageNetV2, a new test set for ImageNet that shoul...

  14. [14]

    has redefined what constitutes standard performance across many visual tasks. These improvements in performance partially stem from architectural innovations and parameter optimization, but were mostly powered by the effective leveraging of unprecedented dataset sizes (Zhai et al., 2022). However, because visual foundation models were trained on web-scale...

  15. [15]

    and were found to be the best available models for neuronal activity in the primate visual cortex (Yamins et al., 2014), even if not trained for this task. Today, there is a growing body of research dedicated to evaluating the adequacy of neural networks as behavioral models of human core object recognition (Doerig et al., 2023; Schrimpf et al., 2018; Wic...

  16. [16]

    This figure illustrates the icon layout as displayed to participants during the study

    Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, U...

  17. [17]

    Congratulations! You just earned some extra money!

    To encourage responses rather than leaving selections blank, a message was displayed at the top of the screen 0.75 second before icon display time ended, prompting participants to make a choice. At the end of each block, if a participant surpassed the 90% accuracy threshold calibrated using internal baseline performance data, they received an encouraging ...

  18. [18]

    We analyze the agreement of error patterns between different families of vision models (see Tab

    Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency ( κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45...

  19. [19]

    Tile sizes at each level. A.4.2. G LITCHED The original image undergoes an artistic digital corruption, with horizontal lines overlaying shifted image segments and color channel shifts. Here, a region refers to a randomly selected rectangular area of the image, and a shift denotes the horizontal displacement (left or right) of that region by a certain per...

  20. [20]

    totallynotchase

    Glitch parameters at each level. The implementation is inspired by GitHub user “totallynotchase” (T, 2020). A.4.3. V ERTICAL LINES The original image is transformed through a process of vertical deconstruction. It is first divided into multiple vertical sections, which are further subdivided along the y-axis into small segments called y-steps. In each of ...

  21. [21]

    Vertical sectioning and step sizes at each level. A.4.4. G EOMETRIC SHAPES The original image is overlaid with overlapping geometric figures such as squares, circles, and stars. This visual clutter introduces local noise that obscures the main object, like the Kaleidoscope corruption from (Kaufmann et al., 2019). The number of shapes for each intensity le...

  22. [22]

    Performance Divergence of Models on LAION-C and ImageNet-C (1k classes).The figure illustrates the scattered perfor- mance of models across the ImageNet-C and LAION-C dataset, where a Kendall’sτ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we suppl...

  23. [23]

    Numbers show top-1 accuracy in percent

    LAION-C benchmark results. Numbers show top-1 accuracy in percent. ImageNet refers to model accuracy on the (uncor- rupted) ImageNet validation set (values sourced from the timm leaderboard (Wightman, 2024)). For each corruption, we report mean top-1 accuracy across all intensity levels, with LAION-C as the overall benchmark metric (averaged across corrup...

  24. [24]

    GPT-4o gpt-4o-2024-08-06 At the time of writing, the most recent snapshot of OpenAI’s flagshipmodel (OpenAI, 2024)

    Abbreviation Full Model Name Description EV A-G-P14-560-M30M-IN22K evagiantpatch14560.m30mftin22kin1k EV A giant model, patch size 14, pre-trained with masked image model-ing (MIM) on a Merged-30M dataset, fine-tuned on ImageNet-22k andImageNet-1k (Fang et al., 2023).EV A02-L-P14-448-MIM-M38M-IN22K eva02largepatch14448.mimm38mftin22kin1k EV A02 large mode...

  25. [25]

    ImageNet

    and can be traced back to its original ImageNet class label. Is any information missing from individual in- stances? If so, please provide a description, explain- ing why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No information is missing...