LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Fanfei Li; Robert Geirhos; Roland S. Zimmermann; Thomas Klein; Wieland Brendel

arxiv: 2506.16950 · v2 · pith:ZKL5V6FOnew · submitted 2025-06-20 · 💻 cs.CV · cs.LG

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Fanfei Li , Thomas Klein , Wieland Brendel , Robert Geirhos , Roland S. Zimmermann This is my paper

Pith reviewed 2026-05-22 00:48 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords out-of-distribution robustnessimage distortion benchmarkweb-scale vision modelsnovel corruptionshuman-model comparisonLAION dataset

0 comments

The pith

LAION-C introduces six novel image distortions that remain out-of-distribution for web-scale vision models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces LAION-C as a replacement for older benchmarks like ImageNet-C, which no longer qualify as out-of-distribution tests because common corruptions appear in today's large web-scraped training sets. The authors design and verify six new distortion types that are absent from such data, then evaluate a range of current models including multimodal systems. A separate psychophysical study measures human performance on the same distortions. The results indicate that top models now reach or exceed the accuracy of the best human observers.

Core claim

LAION-C consists of six novel distortion types specifically designed to be out-of-distribution even for web-scale datasets such as LAION. Comprehensive evaluation shows these distortions pose significant challenges to state-of-the-art models including MLLMs such as Gemini and GPT-4o. Comparison with lab-quality human data reveals a paradigm shift from humans outperforming models to the best models now matching or outperforming the best human observers.

What carries the argument

The LAION-C benchmark consisting of six novel distortion types verified to be absent from web-scale training distributions.

If this is right

Models trained on web-scale data still require advances to handle truly novel distortions.
Future robustness benchmarks must exclude distortions that already occur in internet-sourced training data.
Direct comparisons with human observers on these tasks set a new reference point for generalization progress.
Multimodal models exhibit similar vulnerabilities to these distortions as specialized vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling data volume alone may leave gaps in coverage for certain classes of image transformations.
The benchmark could serve as a diagnostic tool to guide development of more robust training procedures.
Analogous construction methods might be used to build OOD tests for video, audio, or multimodal inputs.

Load-bearing premise

The six novel distortion types are truly absent from the training distributions of web-scale datasets such as LAION and therefore constitute genuine OOD cases.

What would settle it

Locating any of the six distortion types in samples from LAION or similar web-scale collections would show they are not genuinely out-of-distribution.

Figures

Figures reproduced from arXiv: 2506.16950 by Fanfei Li, Robert Geirhos, Roland S. Zimmermann, Thomas Klein, Wieland Brendel.

**Figure 1.** Figure 1: ImageNet-C corruptions are not out-of-distribution (OOD) for web-scale datasets like LAION-400M. Exemplary corrupted images from ImageNet-C (left) are similar to LAION-400M samples (right). Each row shows example corruptions and dataset images for one ImageNet-C corruption category (Noise, Blur, Weather, Digital). The presence of these distortions in web-scale datasets indicates the need for an OOD benchm… view at source ↗

**Figure 2.** Figure 2: LAION-C distortions, intended to be OOD even for web-scale datasets. This figure illustrates the six LAION-C distortions at five intensity levels. Following the standard experimental paradigm from psychophysics, our dataset spans from near-perfect to chance-level difficulties, thoroughly testing models and leaving room for future model improvements. Best viewed on screen. highest intensity level, i.e. no m… view at source ↗

**Figure 3.** Figure 3: Performance Divergence of Models on LAION-C and ImageNet-C 16 class. Evaluating models on the 16-class versions of ImageNet-C and LAION-C produces a plateaued performance on ImageNet-C, while LAION-C still yields a high variance across models. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: LAION-C offers better resolution of model differences. We tested 9 models pre-trained on LAION2B, evaluating them across all intensity levels if applicable. LAION-C captures a broader variance in model performance, with a standard deviation of ∼27%, compared to an average of ∼10% in other common OOD datasets. Notably, LAION-C is tested on a 16-class basis, while other datasets typically use 200-1000 clas… view at source ↗

**Figure 5.** Figure 5: LAION-C poses a greater challenge to model robustness than ImageNet-C. We plot distortion intensity against each model’s average accuracy. Visual foundation models evaluated on ImageNet-C maintain high accuracy, with minimal drop across increasing intensity levels. On our LAION-C dataset, the models exhibit a sharper decline in accuracy, highlighting the benchmark’s effectiveness in measuring model robustn… view at source ↗

**Figure 6.** Figure 6: Human vs. machine accuracy on all distortions. For each LAION-C distortion, we plot the distortion intensity against the accuracy of the best human and the best model in this condition (for average human performance, see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, United States) using a 12-core desktop computer (AMD … view at source ↗

**Figure 8.** Figure 8: Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency (κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45, indicating that behavioral differences between huma… view at source ↗

**Figure 9.** Figure 9: LAION-C can be solved. For every distortion, we plot the accuracy of our reference model (ViT-H-P14-336-CLIP-LAIONIN12K) before and after fine-tuning, in comparison to the best human participant for reference. Most distortions can be learned perfectly, only the Stickers and Mosaic distortions might have been too difficult at the highest intensity levels. Further performance gains might be possible with mo… view at source ↗

**Figure 10.** Figure 10: Performance Divergence of Models on LAION-C and ImageNet-C (1k classes). The figure illustrates the scattered performance of models across the ImageNet-C and LAION-C dataset, where a Kendall’s τ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we supplement the suite of models with additional top-perf… view at source ↗

**Figure 11.** Figure 11: Model performance on LAION-C. Analogous to [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Visual Reasoning in Gemini. We provide examples of visual reasoning in Gemini-1.5-Pro, consisting of a LAION-C sample, the reasons for classification that Gemini provided and meta-information (like the final label, the ground-truth label and corruption details). In line with our findings about Error Consistency (see [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Visual Reasoning in GPT. Figure analogous to [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAION-C adds six new distortion types and human baselines that could track real OOD progress, but the paradigm-shift claim depends on whether those distortions are genuinely absent from LAION-scale training data.

read the letter

Hi colleague, The punchline on this one is that LAION-C introduces six novel corruption families that the authors argue stay OOD for current web-scale datasets, paired with model tests and a human baseline that suggests top models now match or exceed human performance on these. What stands out as new is the deliberate design of distortions to sidestep the artifacts already baked into LAION and similar collections. Previous benchmarks like ImageNet-C have become less informative because models train on data that includes those exact issues. Here they evaluate a bunch of state-of-the-art systems, including multimodal ones, and run psychophysical experiments with humans to get comparable numbers. That setup lets them point to a shift where models are closing the gap. They handle the evaluations comprehensively enough on the surface, and the human data provides a useful anchor. No free parameters or invented stuff in the core claims; it's straightforward empirical work. The softer area is confirming that the new distortions truly have no presence in the training distributions. With datasets in the billions, even careful checks might miss rare or similar instances, which could mean the observed performance reflects partial exposure rather than pure OOD robustness. That makes the paradigm shift claim a bit provisional until the dataset and verification details are fully out and scrutinized. This paper is for the robustness community working on large vision and vision-language models. Readers interested in benchmarks that keep pace with scaling will find it relevant. It has enough substance and timeliness to warrant a serious referee, though I'd expect questions on the OOD verification process. Recommendation: send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces LAION-C as an alternative to saturated ImageNet-era OOD benchmarks such as ImageNet-C. It consists of six novel distortion types explicitly designed to remain out-of-distribution even for web-scale training corpora like LAION. The work reports comprehensive evaluations of state-of-the-art vision models and multimodal LLMs (including Gemini and GPT-4o), presents results from a controlled human psychophysical study, and concludes that the best current models now match or surpass the best human observers on these distortions, indicating a paradigm shift in OOD generalization.

Significance. If the OOD status of the distortions holds, LAION-C supplies a much-needed, forward-looking benchmark that can track genuine robustness progress once older corruption sets have been absorbed into web-scale training data. The direct comparison to lab-quality human performance data is a particular strength, as it grounds model numbers against a reproducible human baseline rather than relying solely on relative model rankings. The work also usefully documents saturation on legacy benchmarks and the resulting need for new test distributions.

major comments (1)

[Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.

minor comments (2)

[Abstract] Abstract and introduction: The phrase 'paradigm shift' is used to characterize the model-human performance reversal; a more measured formulation (e.g., 'reversal on this benchmark') would better reflect that the result is tied to the specific six distortions rather than a universal change in OOD behavior.
[Evaluation tables] Evaluation tables: It would be helpful to report per-distortion human and model accuracies alongside aggregate scores so readers can see whether the model-human parity holds uniformly or is driven by particular corruption types.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of LAION-C's significance as a forward-looking OOD benchmark and for highlighting the value of the human psychophysical study. We address the major comment on the distortion design and verification section below.

read point-by-point responses

Referee: [Distortion design and verification section] Distortion design and verification section: The central claim that LAION-C measures genuine OOD generalization (and therefore supports the reported paradigm shift) rests on the assertion that none of the six novel distortions appear in LAION or comparable web-scale corpora. The manuscript describes the design intent but provides insufficient detail on the verification procedure—specifically, the similarity-search method employed, the number of LAION images inspected, any quantitative similarity thresholds, or statistical sampling strategy. Because exhaustive verification over billions of images is infeasible, a more rigorous and transparent account of the checks performed is required to rule out undetected overlap that could explain model performance through training-data exposure rather than improved generalization.

Authors: We agree that the verification procedure merits a more detailed and transparent description to fully substantiate the OOD status of the distortions. The current manuscript emphasizes the design principles chosen to minimize overlap with web-scale data but does not elaborate sufficiently on the empirical checks. In the revised manuscript we will expand the 'Distortion Design and Verification' section to specify the similarity-search method employed, the number of LAION images inspected, the quantitative similarity thresholds applied, and the statistical sampling strategy used. These additions will provide the rigorous account requested and allow readers to better evaluate the likelihood of undetected training-data exposure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent test data

full rationale

The paper introduces LAION-C as a new benchmark consisting of six novel distortion types and reports empirical model and human performance numbers on it. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claims rest on experimental results from applying existing models to the new test set and a separate psychophysical study, with no reduction of any result to self-citation chains or inputs by construction. The OOD assumption is a design claim subject to external verification rather than a tautological step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an empirical benchmark rather than a theoretical derivation, so the ledger contains no fitted parameters or invented physical entities. The main background assumptions are standard computer-vision notions of distribution shift and the validity of the chosen distortion generation procedures.

axioms (1)

domain assumption The six chosen distortion families do not appear at scale in LAION or similar web-scraped corpora.
Invoked when claiming the corruptions remain OOD for web-scale models.

pith-pipeline@v0.9.0 · 5815 in / 1235 out tokens · 41068 ms · 2026-05-22T00:48:05.099622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

URL https://arxiv.org/abs/ 2006.07159. Biederman, I. and Ju, G. Surface versus edge-based deter- minants of visual recognition. Cognitive psychology, 20 (1):38–64,

work page arXiv 2006
[2]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

URL https://arxiv.org/abs/ 1704.04861. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

and Tong, F

Jang, H. and Tong, F. Improved modeling of human vi- sion by incorporating robustness to blur in convolutional neural networks. Nature Communications, 15(1):1989,

work page 1989
[4]

Kellman, P

URL https://arxiv.org/abs/ 1908.08016. Kellman, P. J. and Spelke, E. S. Perception of partly oc- cluded objects in infancy. Cognitive psychology, 15(4): 483–524,

work page arXiv 1908
[5]

The role of imagenet classes in fr \’echet inception distance

URL https: //arxiv.org/abs/2203.06026. Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR,

work page arXiv
[6]

Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

URL https://arxiv.org/abs/ 2310.09562. Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization. In ICML 2024 Workshop on F oundation Models in the Wild ,

work page arXiv 2024
[7]

Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming

URL https://openreview.net/forum? id=Bc2p8T4V32. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bring- mann, O., Ecker, A. S., Bethge, M., and Brendel, W. Benchmarking robustness in object detection: Au- tonomous driving when winter is coming.arXiv preprint arXiv:1907.07484,

work page arXiv 1907
[8]

Radford, A., Kim, J

URL https: //arxiv.org/abs/2208.06366. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML,

work page arXiv
[9]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Resnet strikes back: An im- proved training procedure in timm,

URL https://arxiv.org/abs/ 2110.00476. Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences,

work page arXiv
[11]

Wide Residual Networks

URL https://arxiv.org/ abs/1605.07146. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scal- ing vision transformers. In CVPR,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Appendix A.1

12 LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models A. Appendix A.1. Related Work OOD generalization ability of vision models. As deep learning has advanced to the point where models can reliably generalize to data that matches their training distribution or even exceed the quality of the original labels (Beyer et al., 2020), OOD-robu...

work page 2020
[13]

A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al

introduce a dataset of black and white sketches matching the labels and scale of the ImageNet validation set, called ImageNet-Sketch. A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al. (2019). They collected ImageNetV2, a new test set for ImageNet that shoul...

work page 2019
[14]

has redefined what constitutes standard performance across many visual tasks. These improvements in performance partially stem from architectural innovations and parameter optimization, but were mostly powered by the effective leveraging of unprecedented dataset sizes (Zhai et al., 2022). However, because visual foundation models were trained on web-scale...

work page 2022
[15]

and were found to be the best available models for neuronal activity in the primate visual cortex (Yamins et al., 2014), even if not trained for this task. Today, there is a growing body of research dedicated to evaluating the adequacy of neural networks as behavioral models of human core object recognition (Doerig et al., 2023; Schrimpf et al., 2018; Wic...

work page 2014
[16]

This figure illustrates the icon layout as displayed to participants during the study

Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, U...

work page 2018
[17]

Congratulations! You just earned some extra money!

To encourage responses rather than leaving selections blank, a message was displayed at the top of the screen 0.75 second before icon display time ended, prompting participants to make a choice. At the end of each block, if a participant surpassed the 90% accuracy threshold calibrated using internal baseline performance data, they received an encouraging ...

work page 1960
[18]

We analyze the agreement of error patterns between different families of vision models (see Tab

Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency ( κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45...

work page 2021
[19]

Tile sizes at each level. A.4.2. G LITCHED The original image undergoes an artistic digital corruption, with horizontal lines overlaying shifted image segments and color channel shifts. Here, a region refers to a randomly selected rectangular area of the image, and a shift denotes the horizontal displacement (left or right) of that region by a certain per...

work page 2019
[20]

totallynotchase

Glitch parameters at each level. The implementation is inspired by GitHub user “totallynotchase” (T, 2020). A.4.3. V ERTICAL LINES The original image is transformed through a process of vertical deconstruction. It is first divided into multiple vertical sections, which are further subdivided along the y-axis into small segments called y-steps. In each of ...

work page 2020
[21]

Vertical sectioning and step sizes at each level. A.4.4. G EOMETRIC SHAPES The original image is overlaid with overlapping geometric figures such as squares, circles, and stars. This visual clutter introduces local noise that obscures the main object, like the Kaleidoscope corruption from (Kaufmann et al., 2019). The number of shapes for each intensity le...

work page 2019
[22]

Performance Divergence of Models on LAION-C and ImageNet-C (1k classes).The figure illustrates the scattered perfor- mance of models across the ImageNet-C and LAION-C dataset, where a Kendall’sτ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we suppl...

work page 2024
[23]

Numbers show top-1 accuracy in percent

LAION-C benchmark results. Numbers show top-1 accuracy in percent. ImageNet refers to model accuracy on the (uncor- rupted) ImageNet validation set (values sourced from the timm leaderboard (Wightman, 2024)). For each corruption, we report mean top-1 accuracy across all intensity levels, with LAION-C as the overall benchmark metric (averaged across corrup...

work page 2024
[24]

GPT-4o gpt-4o-2024-08-06 At the time of writing, the most recent snapshot of OpenAI’s flagshipmodel (OpenAI, 2024)

Abbreviation Full Model Name Description EV A-G-P14-560-M30M-IN22K evagiantpatch14560.m30mftin22kin1k EV A giant model, patch size 14, pre-trained with masked image model-ing (MIM) on a Merged-30M dataset, fine-tuned on ImageNet-22k andImageNet-1k (Fang et al., 2023).EV A02-L-P14-448-MIM-M38M-IN22K eva02largepatch14448.mimm38mftin22kin1k EV A02 large mode...

work page 2023
[25]

ImageNet

and can be traced back to its original ImageNet class label. Is any information missing from individual in- stances? If so, please provide a description, explain- ing why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No information is missing...

work page arXiv 2012

[1] [1]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

URL https://arxiv.org/abs/ 2006.07159. Biederman, I. and Ju, G. Surface versus edge-based deter- minants of visual recognition. Cognitive psychology, 20 (1):38–64,

work page arXiv 2006

[2] [2]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

URL https://arxiv.org/abs/ 1704.04861. Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

and Tong, F

Jang, H. and Tong, F. Improved modeling of human vi- sion by incorporating robustness to blur in convolutional neural networks. Nature Communications, 15(1):1989,

work page 1989

[4] [4]

Kellman, P

URL https://arxiv.org/abs/ 1908.08016. Kellman, P. J. and Spelke, E. S. Perception of partly oc- cluded objects in infancy. Cognitive psychology, 15(4): 483–524,

work page arXiv 1908

[5] [5]

The role of imagenet classes in fr \’echet inception distance

URL https: //arxiv.org/abs/2203.06026. Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In CVPR,

work page arXiv

[6] [6]

Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

URL https://arxiv.org/abs/ 2310.09562. Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization. In ICML 2024 Workshop on F oundation Models in the Wild ,

work page arXiv 2024

[7] [7]

Benchmarking ro- bustness in object detection: Autonomous driving when win- ter is coming

URL https://openreview.net/forum? id=Bc2p8T4V32. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bring- mann, O., Ecker, A. S., Bethge, M., and Brendel, W. Benchmarking robustness in object detection: Au- tonomous driving when winter is coming.arXiv preprint arXiv:1907.07484,

work page arXiv 1907

[8] [8]

Radford, A., Kim, J

URL https: //arxiv.org/abs/2208.06366. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In ICML,

work page arXiv

[9] [9]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V . I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Resnet strikes back: An im- proved training procedure in timm,

URL https://arxiv.org/abs/ 2110.00476. Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., and DiCarlo, J. J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences,

work page arXiv

[11] [11]

Wide Residual Networks

URL https://arxiv.org/ abs/1605.07146. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scal- ing vision transformers. In CVPR,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Appendix A.1

12 LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models A. Appendix A.1. Related Work OOD generalization ability of vision models. As deep learning has advanced to the point where models can reliably generalize to data that matches their training distribution or even exceed the quality of the original labels (Beyer et al., 2020), OOD-robu...

work page 2020

[13] [13]

A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al

introduce a dataset of black and white sketches matching the labels and scale of the ImageNet validation set, called ImageNet-Sketch. A more subtle distribution shift which still caused considerable drops in model performance for ImageNet-trained models, was proposed by Recht et al. (2019). They collected ImageNetV2, a new test set for ImageNet that shoul...

work page 2019

[14] [14]

has redefined what constitutes standard performance across many visual tasks. These improvements in performance partially stem from architectural innovations and parameter optimization, but were mostly powered by the effective leveraging of unprecedented dataset sizes (Zhai et al., 2022). However, because visual foundation models were trained on web-scale...

work page 2022

[15] [15]

and were found to be the best available models for neuronal activity in the primate visual cortex (Yamins et al., 2014), even if not trained for this task. Today, there is a growing body of research dedicated to evaluating the adequacy of neural networks as behavioral models of human core object recognition (Doerig et al., 2023; Schrimpf et al., 2018; Wic...

work page 2014

[16] [16]

This figure illustrates the icon layout as displayed to participants during the study

Interface presented to participants. This figure illustrates the icon layout as displayed to participants during the study. The grid is adapted from (Geirhos et al., 2018), while most of the categories and therefore symbols are different. Toolbox (Kleiner et al., 2007, version 3.0.12) in MATLAB (Release 2016a, The MathWorks, Inc., Natick, Massachusetts, U...

work page 2018

[17] [17]

Congratulations! You just earned some extra money!

To encourage responses rather than leaving selections blank, a message was displayed at the top of the screen 0.75 second before icon display time ended, prompting participants to make a choice. At the end of each block, if a participant surpassed the 90% accuracy threshold calibrated using internal baseline performance data, they received an encouraging ...

work page 1960

[18] [18]

We analyze the agreement of error patterns between different families of vision models (see Tab

Humans and models make different mistakes. We analyze the agreement of error patterns between different families of vision models (see Tab. 11 for a complete list) and human observers. The error consistency ( κ) could theoretically achieve a maximum value of 1, but in line with earlier work (Geirhos et al., 2021), the EC values range between 0.15 and 0.45...

work page 2021

[19] [19]

Tile sizes at each level. A.4.2. G LITCHED The original image undergoes an artistic digital corruption, with horizontal lines overlaying shifted image segments and color channel shifts. Here, a region refers to a randomly selected rectangular area of the image, and a shift denotes the horizontal displacement (left or right) of that region by a certain per...

work page 2019

[20] [20]

totallynotchase

Glitch parameters at each level. The implementation is inspired by GitHub user “totallynotchase” (T, 2020). A.4.3. V ERTICAL LINES The original image is transformed through a process of vertical deconstruction. It is first divided into multiple vertical sections, which are further subdivided along the y-axis into small segments called y-steps. In each of ...

work page 2020

[21] [21]

Vertical sectioning and step sizes at each level. A.4.4. G EOMETRIC SHAPES The original image is overlaid with overlapping geometric figures such as squares, circles, and stars. This visual clutter introduces local noise that obscures the main object, like the Kaleidoscope corruption from (Kaufmann et al., 2019). The number of shapes for each intensity le...

work page 2019

[22] [22]

Performance Divergence of Models on LAION-C and ImageNet-C (1k classes).The figure illustrates the scattered perfor- mance of models across the ImageNet-C and LAION-C dataset, where a Kendall’sτ coefficient of 0.66 and the shallow slope indicate a dispersed performance on LAION-C. To provide a clearer trend and to better visualize the dispersion, we suppl...

work page 2024

[23] [23]

Numbers show top-1 accuracy in percent

LAION-C benchmark results. Numbers show top-1 accuracy in percent. ImageNet refers to model accuracy on the (uncor- rupted) ImageNet validation set (values sourced from the timm leaderboard (Wightman, 2024)). For each corruption, we report mean top-1 accuracy across all intensity levels, with LAION-C as the overall benchmark metric (averaged across corrup...

work page 2024

[24] [24]

GPT-4o gpt-4o-2024-08-06 At the time of writing, the most recent snapshot of OpenAI’s flagshipmodel (OpenAI, 2024)

Abbreviation Full Model Name Description EV A-G-P14-560-M30M-IN22K evagiantpatch14560.m30mftin22kin1k EV A giant model, patch size 14, pre-trained with masked image model-ing (MIM) on a Merged-30M dataset, fine-tuned on ImageNet-22k andImageNet-1k (Fang et al., 2023).EV A02-L-P14-448-MIM-M38M-IN22K eva02largepatch14448.mimm38mftin22kin1k EV A02 large mode...

work page 2023

[25] [25]

ImageNet

and can be traced back to its original ImageNet class label. Is any information missing from individual in- stances? If so, please provide a description, explain- ing why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No information is missing...

work page arXiv 2012