arxiv: 2603.07462 · v2 · submitted 2026-03-08 · 💻 cs.AI

Recognition: no theorem link

Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment

Binxia Xu , Xiaoliang Luo , Luke Dickens , Robert M. Mok

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords out-of-distributionhuman-AI alignmenterror alignmentperceptual difficultyvision-language modelsobject recognitionCNNViT

0 comments

The pith

A spectrum of human perceptual difficulty shows vision-language models align with humans most consistently across out-of-distribution conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that measures out-of-distribution degree by how much human accuracy drops on distorted images rather than by model training data or fixed distortion parameters. This produces a calibrated spectrum with four regimes of increasing perceptual challenge for object recognition tasks. Applying the spectrum to compare error patterns reveals that vision-language models match human failures most steadily from near to far OOD. Convolutional networks align better than vision transformers on easier near-OOD cases, while the ranking reverses on harder far-OOD cases. The result matters because it shows alignment is not uniform but depends on the specific level of human-like difficulty.

Core claim

By quantifying how far a set of stimuli deviates from an undistorted reference according to human accuracy, the method defines an OOD spectrum and identifies four regimes of perceptual challenge; this enables direct model-human error alignment comparisons at matched difficulty levels, exposing architecture-specific patterns where vision-language models are most consistently aligned, CNNs outperform ViTs on near-OOD, and ViTs outperform CNNs on far-OOD.

What carries the argument

The human-centred OOD spectrum, constructed by measuring stimulus deviation from an undistorted reference set via drops in human recognition accuracy, which partitions stimuli into four calibrated regimes of perceptual difficulty.

If this is right

Model-human comparisons become possible at explicitly matched difficulty levels instead of arbitrary OOD definitions.
Different deep-learning architectures exhibit distinct alignment profiles that depend on the perceptual regime.
Vision-language models maintain the steadiest error alignment from near-OOD through far-OOD conditions.
Assessments of alignment must account for cross-condition differences in perceptual difficulty to be reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectrum could be applied to other sensory domains such as audio or text to test whether architecture-specific alignment patterns generalize.
Training objectives that target alignment specifically within each regime might produce models that stay closer to human error patterns overall.
Hybrid architectures that combine early-layer features from CNNs with later-layer features from ViTs could reduce the observed crossover in alignment.
The framework supplies a concrete way to select test stimuli that probe alignment at the exact difficulty levels where current models diverge most from humans.

Load-bearing premise

Human accuracy on distorted stimuli supplies a valid, general, and model-independent yardstick for perceptual difficulty.

What would settle it

A replication in which the same images ordered by the human-accuracy spectrum produce different regime boundaries or reverse the reported CNN-ViT alignment crossover would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.07462 by Binxia Xu, Luke Dickens, Robert M. Mok, Xiaoliang Luo.

**Figure 2.** Figure 2: Human-human error alignment A. Dot plots for Error Consistency (EC; top), and Misclassification Agreement [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Model-human alignment across distortion levels. Accuracy (ACC, top row), Error Consistency (EC, middle [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Radar plots of mean model–human error alignment for each superfamily across distortion types, separated by [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Model–human error alignment rankings. (A) Model–human error alignment values normalised by hu [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram of accuracy on the uncorrupted datasets before and after logit transformation. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Model selection using the Bayesian Information Criterion (BIC) and corrected Akaike Information Criterion [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy (ACC, top row), Error Consistency (EC, middle row), and Misclassification Agreement (MA, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy (ACC, top row), Error Consistency (EC, middle row), and Misclassification Agreement (MA, [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: The ranking for each metric. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Radar plots of mean model–human error alignment for each subfamily across distortion types, separated by [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

Determining whether AI systems process information similarly to humans is central to cognitive science and trustworthy AI. While modern AI models can match human accuracy on standard tasks, such parity does not guarantee that their underlying decision-making strategies resemble those of humans. Assessing performance using error alignment metrics to compare how humans and models fail, and how this changes for distorted, or otherwise more challenging, stimuli, provides a viable pathway toward a finer characterization of model-human alignment. However, existing out-of-distribution (OOD) analyses for challenging stimuli are limited due to methodological choices: they define OOD shift relative to model training data or use arbitrary distortion-specific parameters with little correspondence to human perception, hindering principled comparisons. We propose a human-centred framework that redefines the degree of OOD as a spectrum of human perceptual difficulty. By quantifying how much a collection of stimuli deviates from an undistorted reference set based on human accuracy, we construct an OOD spectrum and identify four distinct regimes of perceptual challenge. This approach enables principled model-human comparisons at calibrated difficulty levels. We apply this framework to object recognition and reveal unique, regime-dependent model-human alignment rankings and profiles across deep learning architectures. Vision-language models are most consistently human aligned across near- and far-OOD conditions, but convolutional neural networks (CNNs) are more aligned than vision transformers (ViTs) for near-OOD and ViTs are more aligned than CNNs for far-OOD. Our work demonstrates the critical importance of accounting for cross-condition differences, such as perceptual difficulty, for a principled assessment of model-human alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's useful move is defining OOD regimes by human accuracy on distortions, which produces architecture-specific alignment rankings that flip between near and far conditions.

read the letter

The new piece here is replacing arbitrary distortion parameters or training-data definitions with a spectrum built from human accuracy drops. They bin stimuli into four perceptual-difficulty regimes and then track how error patterns line up between humans and CNNs, ViTs, and VLMs on object recognition. The headline result is that VLMs stay most consistent with humans across the whole range, while CNNs align better near the in-distribution end and ViTs pull ahead farther out. That regime dependence is the concrete observation worth testing further. The approach is straightforward and avoids the usual circularity of scoring models against their own training distribution. What is less clear is how stable the regime boundaries are once you vary the stimulus set or the exact human data collection. The abstract gives rankings but no error bars, sample sizes, or statistical tests, so it is hard to judge whether the CNN-ViT reversal is robust or sensitive to binning choices. The stress-test worry about model-specific distortion effects is reasonable on the surface, but the paper's own framing treats human accuracy as the external anchor, which at least keeps the comparison from being purely internal to any one architecture. This is aimed at people working on alignment metrics and cognitive modeling of vision. It is narrow enough that a referee could check the human data pipeline and the statistical support in one pass. I would send it to review; the framework is simple enough to replicate and the regime-dependent pattern is specific enough to be falsifiable.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a human-centred framework to define an out-of-distribution (OOD) spectrum based on human perceptual difficulty for assessing error alignment between humans and AI models in object recognition tasks. By using human accuracy on distorted stimuli to calibrate difficulty levels, it identifies four regimes and reports that vision-language models (VLMs) are most consistently aligned with humans across near- and far-OOD, with convolutional neural networks (CNNs) showing better alignment than vision transformers (ViTs) in near-OOD and the reverse in far-OOD.

Significance. This work has potential significance for cognitive science and trustworthy AI by providing a calibrated way to compare model and human error patterns at matched difficulty levels. If the spectrum construction is robust, it could highlight important architecture-specific differences in alignment that standard OOD analyses miss, encouraging more nuanced evaluations of AI systems.

major comments (2)

[Abstract] The abstract reports regime-dependent rankings but provides insufficient detail on how the OOD spectrum is constructed, including the specific human accuracy thresholds for the four regimes, the size of the stimulus set, and any statistical tests used to validate the regimes.
[Framework] The assumption that human accuracy defines a model-independent OOD spectrum (as per the framework description) is load-bearing for the central claim; without evidence that the perceptual difficulty ordering is consistent across CNNs, ViTs, and VLMs, the reported reversal in alignment rankings between near-OOD and far-OOD could be due to architecture-specific sensitivities to the distortions rather than true alignment differences.

minor comments (2)

Add error bars and report dataset sizes in all figures and tables presenting alignment metrics to improve clarity and reproducibility.
Clarify the exact definition of 'error alignment' metric used for comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their detailed and insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] The abstract reports regime-dependent rankings but provides insufficient detail on how the OOD spectrum is constructed, including the specific human accuracy thresholds for the four regimes, the size of the stimulus set, and any statistical tests used to validate the regimes.

Authors: We agree that the abstract should provide more detail on the OOD spectrum construction. In the revised manuscript, we will expand the abstract to summarize the human accuracy thresholds defining the four regimes, the total size of the stimulus set used for calibration, and the statistical approaches (including any tests for regime separation) employed to validate the spectrum. These additions will be kept concise while directing readers to the methods section for full details. revision: yes
Referee: [Framework] The assumption that human accuracy defines a model-independent OOD spectrum (as per the framework description) is load-bearing for the central claim; without evidence that the perceptual difficulty ordering is consistent across CNNs, ViTs, and VLMs, the reported reversal in alignment rankings between near-OOD and far-OOD could be due to architecture-specific sensitivities to the distortions rather than true alignment differences.

Authors: We thank the referee for raising this foundational concern. The OOD spectrum is defined solely from human accuracy on the stimuli, making the difficulty ordering and regime boundaries model-independent by construction; all models are evaluated against the same human-calibrated difficulty levels. The reversal in alignment rankings is therefore measured under this fixed human reference. To address potential concerns about architecture-specific sensitivities, we will add a supplementary analysis in the revision showing that accuracy for CNNs, ViTs, and VLMs decreases with increasing human-defined difficulty, supporting that the regimes reflect progressive challenge across architectures. This clarification and addition will strengthen the framework's justification without altering the core results. revision: partial

Circularity Check

0 steps flagged

No circularity in human-centred OOD definition

full rationale

The paper defines its OOD spectrum directly from measured human accuracy drops on a fixed collection of distorted stimuli relative to an undistorted reference, producing four perceptual-difficulty regimes as an external empirical input. Model error patterns are then compared to human error patterns at these fixed levels via separate alignment metrics. No step fits a parameter to model outputs and relabels it a prediction, invokes a self-citation as the sole justification for a uniqueness claim, or reduces the reported architecture rankings to a definitional identity. The derivation therefore remains self-contained against external human data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on treating human accuracy as the ground-truth measure of perceptual difficulty; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Human accuracy on distorted images is a valid proxy for perceptual difficulty independent of any particular model
Used to construct the OOD spectrum and enable model-human comparisons

pith-pipeline@v0.9.0 · 5592 in / 1209 out tokens · 48833 ms · 2026-05-15T15:10:30.124772+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

[1]

Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1(1):417–446, 2015

Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing.Annual review of vision science, 1(1):417–446, 2015

work page 2015
[2]

Deep neural networks as scientific models.Trends in cognitive sciences, 23(4):305–317, 2019

Radoslaw M Cichy and Daniel Kaiser. Deep neural networks as scientific models.Trends in cognitive sciences, 23(4):305–317, 2019

work page 2019
[3]

Biological constraints on neural network models of cognitive function.Nature Reviews Neuroscience, 22(8):488–502, 2021

Friedemann Pulvermüller, Rosario Tomasello, Malte R Henningsen-Schomers, and Thomas Wennekers. Biological constraints on neural network models of cognitive function.Nature Reviews Neuroscience, 22(8):488–502, 2021

work page 2021
[4]

Rishi Rajalingham, Elias B Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J DiCarlo. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks.Journal of Neuroscience, 38(33):7255–7269, 2018

work page 2018
[5]

Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018, 2023

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on representational alignment.arXiv preprint arXiv:2310.13018, 2023

work page arXiv 2023
[6]

Human uncertainty makes classification more robust

Joshua C Peterson, Ruairidh M Battleday, Thomas L Griffiths, and Olga Russakovsky. Human uncertainty makes classification more robust. InProceedings of the IEEE/CVF international conference on computer vision, pages 9617–9626, 2019

work page 2019
[7]

Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025

Lukas Muttenthaler, Klaus Greff, Frieda Born, Bernhard Spitzer, Simon Kornblith, Michael C Mozer, Klaus-Robert Müller, Thomas Unterthiner, and Andrew K Lampinen. Aligning machine and human visual representations across abstraction levels.Nature, 647(8089):349–355, 2025. *https://docs.google.com/spreadsheets/d/1ldqG8LlQd_tDh9f3xdZQRA3v2UffUIhaLds2wTrhoLU/e...

work page 2025
[8]

Harmonizing the object recognition strategies of deep neural networks with humans.Advances in neural information processing systems, 35:9432– 9446, 2022

Thomas Fel, Ivan F Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans.Advances in neural information processing systems, 35:9432– 9446, 2022

work page 2022
[9]

Increasing interpretability of neural networks by approximating human visual saliency.arXiv preprint arXiv:2410.16115, 2024

Aidan Boyd, Mohamed Trabelsi, Huseyin Uzunalioglu, and Dan Kushnir. Increasing interpretability of neural networks by approximating human visual saliency.arXiv preprint arXiv:2410.16115, 2024

work page arXiv 2024
[10]

Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception.arXiv preprint arXiv:2306.03229, 2023

Drew Linsley, Pinyuan Feng, Thibaut Boissin, Alekh Karkada Ashok, Thomas Fel, Stephanie Olaiya, and Thomas Serre. Adversarial alignment: Breaking the trade-off between the strength of an attack and its relevance to human perception.arXiv preprint arXiv:2306.03229, 2023

work page arXiv 2023
[11]

Alignment and adversarial robustness: Are more human-like models more secure?arXiv preprint arXiv:2502.12377, 2025

Blaine Hoak, Kunyang Li, and Patrick McDaniel. Alignment and adversarial robustness: Are more human-like models more secure?arXiv preprint arXiv:2502.12377, 2025

work page arXiv 2025
[12]

Human vs

Ali Borji and Laurent Itti. Human vs. computer in scene and object recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 113–120, 2014

work page 2014
[13]

Wichmann

Robert Geirhos, Kristof Meding, and Felix A. Wichmann. Beyond accuracy: quantifying trial-by-trial behaviour of cnns and humans by measuring error consistency. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc

work page 2020
[14]

Measuring error alignment for decision-making systems

Binxia Xu, Antonis Bikakis, Daniel FO Onah, Andreas Vlachidis, and Luke Dickens. Measuring error alignment for decision-making systems. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27731–27739, 2025

work page 2025
[15]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International conference on learning representations, 2018

work page 2018
[16]

Deep convolutional networks do not classify based on global object shape.PLoS computational biology, 14(12):e1006613, 2018

Nicholas Baker, Hongjing Lu, Gennady Erlikhman, and Philip J Kellman. Deep convolutional networks do not classify based on global object shape.PLoS computational biology, 14(12):e1006613, 2018

work page 2018
[17]

A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. A large-scale examination of inductive biases shaping high-level visual representation in brains and machines.Nature communications, 15(1):9383, 2024

work page 2024
[18]

Humans and deep networks largely agree on which kinds of variation make object recognition harder.Frontiers in computational neuroscience, 10:92, 2016

Saeed R Kheradpisheh, Masoud Ghodrati, Mohammad Ganjtabesh, and Timothée Masquelier. Humans and deep networks largely agree on which kinds of variation make object recognition harder.Frontiers in computational neuroscience, 10:92, 2016

work page 2016
[19]

A study and comparison of human and deep learning recognition performance under visual distortions

Samuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition performance under visual distortions. In2017 26th international conference on computer communication and networks (ICCCN), pages 1–7. IEEE, 2017

work page 2017
[20]

Generalisation in humans and deep neural networks.Advances in neural information processing systems, 31, 2018

Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks.Advances in neural information processing systems, 31, 2018

work page 2018
[21]

Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

work page 2019
[22]

What’out-of-distribution’is and is not

Sebastian Farquhar and Yarin Gal. What’out-of-distribution’is and is not. InNeurips ml safety workshop, 2022

work page 2022
[23]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification.Advances in Neural Information Processing Systems, 33:18583–18599, 2020

work page 2020
[25]

Partial success in closing the gap between human and machine vision

Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Partial success in closing the gap between human and machine vision. In Advances in Neural Information Processing Systems 34, 2021

work page 2021
[26]

Understanding how image quality affects deep neural networks

Samuel Dodge and Lina Karam. Understanding how image quality affects deep neural networks. In2016 eighth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2016

work page 2016
[27]

Methods and measurements to compare men against machines.Electronic Imaging, 29:36–45, 2017

Felix A Wichmann, David HJ Janssen, Robert Geirhos, Guillermo Aguilar, Heiko H Schütt, Marianne Maertens, and Matthias Bethge. Methods and measurements to compare men against machines.Electronic Imaging, 29:36–45, 2017

work page 2017
[28]

Extreme image transformations affect humans and machines differently.Biological Cybernetics, 117(4):331–343, 2023

Girik Malik, Dakarai Crowder, and Ennio Mingolla. Extreme image transformations affect humans and machines differently.Biological Cybernetics, 117(4):331–343, 2023. 14 Do machines fail like humans?

work page 2023
[29]

Distribution theory for glass’s estimator of effect size and related estimators.journal of Educational Statistics, 6(2):107–128, 1981

Larry V Hedges. Distribution theory for glass’s estimator of effect size and related estimators.journal of Educational Statistics, 6(2):107–128, 1981

work page 1981
[30]

Perceptual characterization of images degraded by blur and noise: model.Journal of the Optical Society of America A, 13(6):1178–1188, 1996

Vishwakumara Kayargadde and Jean-Bernard Martens. Perceptual characterization of images degraded by blur and noise: model.Journal of the Optical Society of America A, 13(6):1178–1188, 1996

work page 1996
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[32]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[33]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[35]

Are convolutional neural networks or transformers more like human vision?arXiv preprint arXiv:2105.07197, 2021

Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L Griffiths. Are convolutional neural networks or transformers more like human vision?arXiv preprint arXiv:2105.07197, 2021

work page arXiv 2021
[36]

Language can boost otherwise unseen objects into visual awareness.Proceedings of the National Academy of Sciences, 110(35):14196–14201, 2013

Gary Lupyan and Emily J Ward. Language can boost otherwise unseen objects into visual awareness.Proceedings of the National Academy of Sciences, 110(35):14196–14201, 2013

work page 2013
[37]

Semantic knowledge enhances conscious awareness of visual objects.Journal of Cognitive Neuroscience, 31(8):1216–1226, 2019

Peter D Weller, Milena Rabovsky, and Rasha Abdel Rahman. Semantic knowledge enhances conscious awareness of visual objects.Journal of Cognitive Neuroscience, 31(8):1216–1226, 2019

work page 2019
[38]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[39]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

work page 2017
[40]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[41]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[42]

Maxvit: Multi-axis vision transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

work page 2022
[43]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[44]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022. 15 Do machines fail like humans? Appendix Code is available athttps://github.com/xubinxia/ood-spectrum. A. Statistical ...

work page 2022