Would you still call this Dax? Novel Visual References in VLMs and Humans

Ada Defne T\"ur; Benno Krojer; Gaurav Kamath; Joyce Chai; Siva Reddy

arxiv: 2606.05409 · v3 · pith:WVDQEUGYnew · submitted 2026-06-03 · 💻 cs.CV · cs.CL

Would you still call this Dax? Novel Visual References in VLMs and Humans

Ada Defne T\"ur , Gaurav Kamath , Joyce Chai , Siva Reddy , Benno Krojer This is my paper

Pith reviewed 2026-06-28 06:34 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelsnovel concept learningin-context learningvisual perturbationshuman-model comparisongeneralizationNVRD

0 comments

The pith

Vision-language models struggle to acquire novel visual concepts in context when they contradict prior knowledge and overgeneralize more than humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Novel Visual References Dataset (NVRD) with 19,176 images across 90 new visual concepts and up to 20 perturbed versions each, all built from scratch to test in-context mapping of novel visuals to language. It evaluates several VLMs against 2,400 human judgments and finds that models have trouble learning these concepts when they clash with pre-training, even as both models and humans track visual changes in similar ways. Models extend the new labels to many more perturbed images than humans accept. This setup matters because it isolates how current systems handle genuinely new visual information that conflicts with what they already know, unlike tests on familiar objects.

Core claim

The authors claim that vision-language models struggle to acquire novel concepts in-context when they contradict prior knowledge, and while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject, as shown through evaluations on the NVRD dataset.

What carries the argument

The Novel Visual References Dataset (NVRD), a set of entirely novel visual concepts constructed from scratch with increasing perturbations to measure in-context acquisition and generalization boundaries.

If this is right

Models will underperform on acquiring labels for new objects that clash with pre-trained knowledge.
Sensitivity to visual perturbations will correlate between models and humans, yet models will accept a wider range of variants.
NVRD provides a benchmark for testing visual concept learning that avoids familiar objects.
In-context adaptation in VLMs faces limits when new information directly conflicts with existing representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training methods may need explicit ways to override or compartmentalize conflicting prior knowledge during in-context updates.
Real-world uses like robotics or interactive systems could face repeated failures when encountering truly new objects.
The gap might narrow if models incorporate uncertainty estimates that better match human rejection thresholds.
Extending the approach to video or multimodal sequences could reveal whether the overgeneralization pattern holds over time.

Load-bearing premise

The constructed concepts in NVRD genuinely contradict models' pre-training and the human judgments form a reliable baseline for comparison.

What would settle it

A test where models acquire the novel labels in context and restrict them to exactly the same perturbed images that humans accept would falsify the overgeneralization claim.

Figures

Figures reproduced from arXiv: 2606.05409 by Ada Defne T\"ur, Benno Krojer, Gaurav Kamath, Joyce Chai, Siva Reddy.

**Figure 2.** Figure 2: Overview of the creation pipeline for NVRD. On the left-most column, we present examples of the four [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Nonce vs. vanilla label responses across models and object categories. We find: Models adopt a nonce [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Model results on both the multi-image name generation and log probability settings across object [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Human and model ratings on the subset of perturbation types that show a clear degradation at strong [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Example base images from each of the four entity categories in NVRD. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Example perturbation sequences from NVRD. Each row shows an original base image and four increas [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Example prompt compositions used to generate novel entities. Each row shows a unique design [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Example trial human participants observed during our study. Participants see the original image (left) [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Sample of objects and perturbations from NVRD across the four object categories (Known, Shape [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Nonce vs. vanilla label responses across models and perturbation levels. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Nonce vs. vanilla label responses across models and perturbation types. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Model nonce reference usage across perturbation types and levels. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Model nonce z-scored log probabilities across perturbation types and levels. 5 10 15 20 Perturbation Level 0.2 0.0 0.2 Z-Scored Log-Prob Known 5 10 15 20 Perturbation Level Shape-Texture 5 10 15 20 Perturbation Level Shape-Shape 5 10 15 20 Perturbation Level Novel Idefics3 8B Molmo2 8B Qwen2-VL 7B [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Model nonce reference z-scored log probabilities across object categories and perturbation levels. F.3 Likert Rating Results Figures 16 and 17 show model Likert-scale ratings broken down by perturbation type and object category, respectively. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Model ratings across perturbation types and levels. [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Model ratings across object categories and perturbation levels. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Human–model rating comparison across object categories and perturbation types. [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Human–model rating comparison across perturbation types. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Scatterplot of human vs. model mean ratings across perturbation types and levels. [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Scatterplot of human vs. model mean ratings across object categories and perturbation types. [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Human–model rating bar plot comparison across object categories, perturbation types, and perturbation [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: Model performance and ratings as a function of visual similarity between each original and perturbed [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: Qwen-2 VL 7B performance across pool composition strategies: random, color similarity, and visual [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Qwen-2 VL 7B responses on the ablated “failure case” trials, by object category. [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Heatmap of Qwen-2 VL 7B responses on the ablated “failure case” trials, by object category. [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

read the original abstract

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NVRD is a new benchmark for testing VLMs on in-context learning of contradictory novel visuals, but the core claim needs a check that the concepts are actually outside pre-training.

read the letter

The main takeaway is that this paper ships a dataset of 90 concepts built from scratch, with 19k images and up to 20 perturbations each, plus 2400 human judgments. It directly compares a handful of open and closed VLMs against people on learning labels that are supposed to clash with prior knowledge.

The work is useful because it moves past the usual augmentations of familiar objects and tries to isolate genuine in-context acquisition. The reported pattern that models overgeneralize more than humans on the perturbed versions is the kind of concrete difference worth having data on.

The soft spot is the missing verification step. The abstract says the stimuli are entirely novel and constructed from scratch, yet there is no reported zero-shot accuracy on the base images, no embedding-space distance to training distributions, and no human-model agreement check on the unperturbed originals. Without that, the "struggle when contradicting prior knowledge" result could just be ordinary recognition failure rather than failure to override existing knowledge.

The experimental details on how the perturbations were generated and how the human trials were run are not visible in the abstract, so it is hard to judge whether the 2400 judgments form a stable baseline. If the full paper has those controls and the verification, the contribution strengthens.

This is for people who build or evaluate VLMs and want a benchmark that targets concept acquisition rather than retrieval. It is worth sending to peer review because a cleaned-up version of the dataset and the human-model comparison would be usable even if the interpretation needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Novel Visual References Dataset (NVRD) comprising 19,176 images across 90 visual concepts, each with up to 20 perturbed versions, constructed as entirely novel stimuli. It evaluates three open-source and two closed-source VLMs against 2,400 human judgments on in-context learning of these concepts, reporting that models struggle to acquire novel concepts contradicting prior knowledge and that models overgeneralize relative to humans despite correlated sensitivity to visual perturbations.

Significance. If the results hold after verification that the concepts lie outside pre-training distributions, the contribution of NVRD as a benchmark would be useful for studying differences in visual concept acquisition between VLMs and humans.

major comments (1)

[Abstract] Abstract: The central claim that models 'struggle to acquire novel concepts in-context when they contradict prior knowledge' requires that the 90 base concepts are outside the models' pre-training distribution. No verification is reported (zero-shot accuracy on unperturbed base images, nearest-neighbor distances in embedding space to LAION/ImageNet classes, or human-model agreement rates on the originals). Without this check, the observed effects cannot be distinguished from ordinary recognition failure.

minor comments (2)

The abstract states 'up to 20 increasingly perturbed versions' per concept but provides no details on the perturbation generation process, the exact number of versions per concept, or how perturbation levels were calibrated.
The number of human participants and the exact protocol for collecting the 2,400 judgments (e.g., trial structure, exclusion criteria) are not summarized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting an important point regarding verification of concept novelty. We address the major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that models 'struggle to acquire novel concepts in-context when they contradict prior knowledge' requires that the 90 base concepts are outside the models' pre-training distribution. No verification is reported (zero-shot accuracy on unperturbed base images, nearest-neighbor distances in embedding space to LAION/ImageNet classes, or human-model agreement rates on the originals). Without this check, the observed effects cannot be distinguished from ordinary recognition failure.

Authors: We agree that explicit verification is necessary to distinguish in-context acquisition difficulties from simple recognition failures on the base concepts. The manuscript describes the 90 concepts as 'entirely novel stimuli constructed from scratch' and contrasts them with prior work on augmentations of familiar concepts, but does not report the specific checks suggested (zero-shot accuracy on unperturbed images, embedding distances to LAION/ImageNet, or human-model agreement on the originals). We will add these analyses to the revised version, including zero-shot VLM performance on the base images and nearest-neighbor analyses where computationally feasible, to directly support the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and evaluation with no derivations or self-referential fits

full rationale

The paper presents an empirical benchmark (NVRD) consisting of 90 novel visual concepts and perturbed images, evaluated via model inference and 2400 human judgments. No equations, parameter fits, uniqueness theorems, or derivations appear in the provided text. Claims of novelty rest on construction statements rather than any self-referential reduction (e.g., no fitted parameter is relabeled as a prediction, and no self-citation chain supports a load-bearing premise). The central findings on model overgeneralization are direct experimental outcomes, not tautological restatements of inputs. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No full text available; cannot identify free parameters, axioms, or invented entities from the abstract alone.

pith-pipeline@v0.9.1-grok · 5747 in / 1105 out tokens · 16427 ms · 2026-06-28T06:34:42.428590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 8 canonical work pages

[1]

International Conference on Learning Representations , year=

Intriguing properties of generative classifiers , author=. International Conference on Learning Representations , year=
[2]

Concepts and Conceptual Structure , volume =

Medin, Doug , year =. Concepts and Conceptual Structure , volume =. American Psychologist , doi =
[3]

1986 , issn =

Categories and induction in young children , journal =. 1986 , issn =. doi:https://doi.org/10.1016/0010-0277(86)90034-X , url =

work page doi:10.1016/0010-0277(86)90034-x 1986
[4]

The Essential Child: Origins of Essentialism in Everyday Thought , isbn =

Gelman, Susan , year =. The Essential Child: Origins of Essentialism in Everyday Thought , isbn =. The Essential Child. Origins of Essentialism in Everday Thought. , doi =
[5]

, title =

Diesendruck, Gil and Gelman, Susan A. , title =. Psychonomic Bulletin & Review , volume =. 1999 , month = jun, doi =

1999
[6]

, title =

Keil, Frank C. , title =
[7]

, author=

The role of theories in conceptual coherence. , author=. Psychological review , year=
[8]

2024 , eprint=

Toward a Holistic Evaluation of Robustness in CLIP Models , author=. 2024 , eprint=

2024
[9]

Word and Object , publisher =

Willard Van Orman Quine , title =. Word and Object , publisher =. 1960 , pages =

1960
[10]

1992 , issn =

Syntactic context and the shape bias in children's and adults' lexical learning , journal =. 1992 , issn =. doi:https://doi.org/10.1016/0749-596X(92)90040-5 , url =

work page doi:10.1016/0749-596x(92)90040-5 1992
[11]

Monographs of the society for research in child development , pages=

The mutual exclusivity bias in children's word learning , author=. Monographs of the society for research in child development , pages=. 1989 , publisher=

1989
[12]

Journal of memory and language , volume=

Object shape, object function, and object name , author=. Journal of memory and language , volume=. 1998 , publisher=

1998
[13]

Child Development , year =

Object properties and knowledge in early lexical learning , author =. Child Development , year =. doi:10.1111/j.1467-8624.1991.tb01547.x , url =

work page doi:10.1111/j.1467-8624.1991.tb01547.x 1991
[14]

Psychological Review , year =

Recognition-by-components: A theory of human image understanding , author =. Psychological Review , year =. doi:10.1037/0033-295X.94.2.115 , url =

work page doi:10.1037/0033-295x.94.2.115
[15]

2021 , eprint=

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , author=. 2021 , eprint=

2021
[16]

arXiv preprint arXiv:2507.06261 , year =

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. arXiv preprint arXiv:2507.06261 , year =

Pith/arXiv arXiv
[17]

International Journal of Computer Vision , volume=

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition under Occlusion , author=. International Journal of Computer Vision , volume=. 2021 , doi=

2021
[18]

Cognitive Psychology , volume=

Priming contour-deleted images: Evidence for intermediate representations in visual object recognition , author=. Cognitive Psychology , volume=. 1991 , doi=

1991
[19]

International Conference on Learning Representations , year=

Can We Talk Models Into Seeing the World Differently? , author=. International Conference on Learning Representations , year=
[20]

2023 , doi=

Ma, Zixian and Hong, Jerry and Gul, Mustafa Omer and Gandhi, Mona and Gao, Irena and Krishna, Ranjay , booktitle=. 2023 , doi=

2023
[21]

Proceedings of the Royal Society of London

Representation and recognition of the spatial organization of three-dimensional shapes , author=. Proceedings of the Royal Society of London. Series B, Biological Sciences , volume=. 1978 , doi=

1978
[22]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , url =

Barbu, Andrei and Mayo, David and Alverio, Julian and Luo, William and Wang, Christopher and Gutfreund, Dan and Tenenbaum, Josh and Katz, Boris , booktitle =. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , url =
[23]

International Conference on Learning Representations , year=

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. International Conference on Learning Representations , year=
[24]

, booktitle=

Singh, Bharat and Davis, Larry S. , booktitle=. An Analysis of Scale Invariance in Object Detection --. 2018 , doi=

2018
[25]

and Ecker, Alexander S

Gatys, Leon A. and Ecker, Alexander S. and Bethge, Matthias , booktitle=. Image Style Transfer Using Convolutional Neural Networks , year=
[26]

European Conference on Computer Vision , pages=

Recognition in Terra Incognita , author=. European Conference on Computer Vision , pages=. 2018 , doi=

2018
[27]

International Conference on Learning Representations , year=

Noise or Signal: The Role of Image Backgrounds in Object Recognition , author=. International Conference on Learning Representations , year=
[28]

and Presnell, Lynn , year =

Tanaka, J. and Presnell, Lynn , year =. Color diagnosticity in object recognition , volume =. Percept. Psychophys. , doi =
[29]

Advances in Neural Information Processing Systems , volume=

Multimodal Few-Shot Learning with Frozen Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

2021
[30]

and Yu, Chen , title =

Smith, Linda B. and Yu, Chen , title =. Cognition , year =
[31]

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ma, Ziqiao and Pan, Jiayi and Chai, Joyce. World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.31

work page doi:10.18653/v1/2023.acl-long.31 2023
[32]

2025 , eprint=

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models , author=. 2025 , eprint=

2025
[33]

2024 , eprint=

Visually Grounded Speech Models have a Mutual Exclusivity Bias , author=. 2024 , eprint=

2024
[34]

1989 , publisher=

Categorization and Naming in Children: Problems of Induction , author=. 1989 , publisher=

1989
[35]

2025 , eprint=

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict , author=. 2025 , eprint=

2025
[36]

2023 , eprint=

Debiasing Vision-Language Models via Biased Prompts , author=. 2023 , eprint=

2023
[37]

International Conference on Learning Representations , year=

Grounded Language Learning Fast and Slow , author=. International Conference on Learning Representations , year=
[38]

2018 , eprint=

Assessing Shape Bias Property of Convolutional Neural Networks , author=. 2018 , eprint=

2018
[39]

Proceedings of the 25th Conference on Computational Natural Language Learning , pages=

The Emergence of the Shape Bias Results from Communicative Efficiency , author=. Proceedings of the 25th Conference on Computational Natural Language Learning , pages=. 2021 , publisher=

2021
[40]

ImageNet-trained

Burgert, Tom and Stoll, Oliver and Rota, Paolo and Demir, Beg\". ImageNet-trained. Advances in Neural Information Processing Systems , year=
[41]

Cognitive Psychology , volume=

Children’s use of mutual exclusivity to constrain the meanings of words , author=. Cognitive Psychology , volume=. 1988 , publisher=

1988
[42]

and Brendel, Wieland , booktitle=

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A. and Brendel, Wieland , booktitle=. ImageNet-trained. 2019 , url=

2019
[43]

2024 , eprint=

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models , author=. 2024 , eprint=

2024
[44]

and Jones, Susan S

Barbara Landau and Smith, Linda B. and Jones, Susan S. The importance of shape in early lexical learning. Cognitive Development. 1988. doi:10.1016/0885-2014(88)90014-7

work page doi:10.1016/0885-2014(88)90014-7 1988
[45]

Language Learning and Development , volume=

Dynamic noun generalization , author=. Language Learning and Development , volume=. 2007 , publisher=

2007
[46]

1990 , issn =

Constraints children place on word meanings , journal =. 1990 , issn =. doi:https://doi.org/10.1016/0364-0213(90)90026-S , url =

work page doi:10.1016/0364-0213(90)90026-s 1990
[47]

Young Children Extend Novel Words at the Basic Level: Evidence for the Principle of Categorical Scope , volume =

Golinkoff, Roberta and Shuff-Bailey, Margaret and Jaakkola, Kelly and Ruan, Wenjun , year =. Young Children Extend Novel Words at the Basic Level: Evidence for the Principle of Categorical Scope , volume =. Developmental Psychology , doi =
[48]

PLoS Computational Biology , volume=

Deep convolutional networks do not classify based on global object shape , author=. PLoS Computational Biology , volume=. 2018 , doi=

2018
[49]

Journal of Experimental Child Psychology , volume=

Clarifying the role of shape in children's taxonomic assumption , author=. Journal of Experimental Child Psychology , volume=. 1992 , doi=

1992
[50]

Is the Acquisition of Basic-Colour Terms in Young Children Constrained? , volume =

Pitchford, Nicola and Mullen, Kathy , year =. Is the Acquisition of Basic-Colour Terms in Young Children Constrained? , volume =. Perception , doi =
[51]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Signatures of Domain-General Categorization Mechanisms in ColorWord Learning , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=
[52]

Waxman , title =

Sandra R. Waxman , title =. Psychology of Learning and Motivation , volume =. 1998 , publisher =

1998
[53]

Principles that are invoked in the acquisition of words, but not facts , volume =

Waxman, Sandra and Booth, Amy , year =. Principles that are invoked in the acquisition of words, but not facts , volume =. Cognition , doi =
[54]

Papers and Reports on Child Language Development , volume=

Acquiring a Single New Word , author=. Papers and Reports on Child Language Development , volume=. 1978 , month=

1978
[55]

The innate mind: Foundations and the future , editor=

Rational statistical inference and cognitive development , author=. The innate mind: Foundations and the future , editor=. 2007 , publisher=

2007
[56]

Cognition , volume=

A probabilistic model of theory formation , author=. Cognition , volume=. 2010 , publisher=

2010
[57]

Behavioral and Brain Sciences , volume=

Building machines that learn and think like people , author=. Behavioral and Brain Sciences , volume=. 2017 , publisher=

2017
[58]

Developmental Science , volume=

Core knowledge , author=. Developmental Science , volume=. 2007 , publisher=

2007
[59]

2003 , publisher=

Constructing a Language: A Usage-Based Theory of Language Acquisition , author=. 2003 , publisher=

2003
[60]

Infancy , volume=

Dynamic noun generalization: Moment-to-moment interactions shape children's naming biases , author=. Infancy , volume=. 2007 , publisher=

2007
[61]

Approximating

Brendel, Wieland and Bethge, Matthias , booktitle =. Approximating. 2019 , url =

2019
[62]

Advances in Neural Information Processing Systems , volume =

Partial success in closing the gap between human and machine vision , author =. Advances in Neural Information Processing Systems , volume =
[63]

Science , volume =

Human-level concept learning through probabilistic program induction , author =. Science , volume =. 2015 , doi =

2015
[64]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

2022
[65]

Child Development , volume =

Word learning in children: An examination of fast mapping , author =. Child Development , volume =. 1987 , doi =

1987
[66]

Advances in Neural Information Processing Systems , volume =

Matching Networks for One Shot Learning , author =. Advances in Neural Information Processing Systems , volume =
[67]

International Conference on Machine Learning , pages=

Learning Transferable Visual Models from Natural Language Supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[68]

Advances in Neural Information Processing Systems , volume =

Prototypical Networks for Few-shot Learning , author =. Advances in Neural Information Processing Systems , volume =
[69]

2023 , eprint=

Prompting is not a substitute for probability measurements in large language models , author=. 2023 , eprint=

2023
[70]

Proceedings of the 34th International Conference on Machine Learning , pages =

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , publisher =

2017
[71]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =
[72]

Psychological Science , volume =

Rapid word learning under uncertainty via cross-situational statistics , author =. Psychological Science , volume =. 2007 , doi =

2007
[73]

Psychological Science , volume =

Object name learning provides on-the-job training for attention , author =. Psychological Science , volume =. 2002 , doi =

2002
[74]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024
[75]

2020 , eprint=

The Origins and Prevalence of Texture Bias in Convolutional Neural Networks , author=. 2020 , eprint=

2020
[76]

2016 , eprint=

Understanding How Image Quality Affects Deep Neural Networks , author=. 2016 , eprint=

2016
[77]

2020 , eprint=

Generalisation in humans and deep neural networks , author=. 2020 , eprint=

2020
[78]

Proceedings of the National Academy of Sciences , volume =

Atoms of Recognition in Human and Computer Vision , author =. Proceedings of the National Academy of Sciences , volume =. 2016 , doi =

2016
[79]

Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , journal =. Qwen2-. 2024 , url =

2024
[80]

2024 , eprint=

Building and better understanding vision-language models: insights and future directions , author=. 2024 , eprint=

2024

Showing first 80 references.

[1] [1]

International Conference on Learning Representations , year=

Intriguing properties of generative classifiers , author=. International Conference on Learning Representations , year=

[2] [2]

Concepts and Conceptual Structure , volume =

Medin, Doug , year =. Concepts and Conceptual Structure , volume =. American Psychologist , doi =

[3] [3]

1986 , issn =

Categories and induction in young children , journal =. 1986 , issn =. doi:https://doi.org/10.1016/0010-0277(86)90034-X , url =

work page doi:10.1016/0010-0277(86)90034-x 1986

[4] [4]

The Essential Child: Origins of Essentialism in Everyday Thought , isbn =

Gelman, Susan , year =. The Essential Child: Origins of Essentialism in Everyday Thought , isbn =. The Essential Child. Origins of Essentialism in Everday Thought. , doi =

[5] [5]

, title =

Diesendruck, Gil and Gelman, Susan A. , title =. Psychonomic Bulletin & Review , volume =. 1999 , month = jun, doi =

1999

[6] [6]

, title =

Keil, Frank C. , title =

[7] [7]

, author=

The role of theories in conceptual coherence. , author=. Psychological review , year=

[8] [8]

2024 , eprint=

Toward a Holistic Evaluation of Robustness in CLIP Models , author=. 2024 , eprint=

2024

[9] [9]

Word and Object , publisher =

Willard Van Orman Quine , title =. Word and Object , publisher =. 1960 , pages =

1960

[10] [10]

1992 , issn =

Syntactic context and the shape bias in children's and adults' lexical learning , journal =. 1992 , issn =. doi:https://doi.org/10.1016/0749-596X(92)90040-5 , url =

work page doi:10.1016/0749-596x(92)90040-5 1992

[11] [11]

Monographs of the society for research in child development , pages=

The mutual exclusivity bias in children's word learning , author=. Monographs of the society for research in child development , pages=. 1989 , publisher=

1989

[12] [12]

Journal of memory and language , volume=

Object shape, object function, and object name , author=. Journal of memory and language , volume=. 1998 , publisher=

1998

[13] [13]

Child Development , year =

Object properties and knowledge in early lexical learning , author =. Child Development , year =. doi:10.1111/j.1467-8624.1991.tb01547.x , url =

work page doi:10.1111/j.1467-8624.1991.tb01547.x 1991

[14] [14]

Psychological Review , year =

Recognition-by-components: A theory of human image understanding , author =. Psychological Review , year =. doi:10.1037/0033-295X.94.2.115 , url =

work page doi:10.1037/0033-295x.94.2.115

[15] [15]

2021 , eprint=

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , author=. 2021 , eprint=

2021

[16] [16]

arXiv preprint arXiv:2507.06261 , year =

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. arXiv preprint arXiv:2507.06261 , year =

Pith/arXiv arXiv

[17] [17]

International Journal of Computer Vision , volume=

Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition under Occlusion , author=. International Journal of Computer Vision , volume=. 2021 , doi=

2021

[18] [18]

Cognitive Psychology , volume=

Priming contour-deleted images: Evidence for intermediate representations in visual object recognition , author=. Cognitive Psychology , volume=. 1991 , doi=

1991

[19] [19]

International Conference on Learning Representations , year=

Can We Talk Models Into Seeing the World Differently? , author=. International Conference on Learning Representations , year=

[20] [20]

2023 , doi=

Ma, Zixian and Hong, Jerry and Gul, Mustafa Omer and Gandhi, Mona and Gao, Irena and Krishna, Ranjay , booktitle=. 2023 , doi=

2023

[21] [21]

Proceedings of the Royal Society of London

Representation and recognition of the spatial organization of three-dimensional shapes , author=. Proceedings of the Royal Society of London. Series B, Biological Sciences , volume=. 1978 , doi=

1978

[22] [22]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , url =

Barbu, Andrei and Mayo, David and Alverio, Julian and Luo, William and Wang, Christopher and Gutfreund, Dan and Tenenbaum, Josh and Katz, Boris , booktitle =. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , url =

[23] [23]

International Conference on Learning Representations , year=

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. International Conference on Learning Representations , year=

[24] [24]

, booktitle=

Singh, Bharat and Davis, Larry S. , booktitle=. An Analysis of Scale Invariance in Object Detection --. 2018 , doi=

2018

[25] [25]

and Ecker, Alexander S

Gatys, Leon A. and Ecker, Alexander S. and Bethge, Matthias , booktitle=. Image Style Transfer Using Convolutional Neural Networks , year=

[26] [26]

European Conference on Computer Vision , pages=

Recognition in Terra Incognita , author=. European Conference on Computer Vision , pages=. 2018 , doi=

2018

[27] [27]

International Conference on Learning Representations , year=

Noise or Signal: The Role of Image Backgrounds in Object Recognition , author=. International Conference on Learning Representations , year=

[28] [28]

and Presnell, Lynn , year =

Tanaka, J. and Presnell, Lynn , year =. Color diagnosticity in object recognition , volume =. Percept. Psychophys. , doi =

[29] [29]

Advances in Neural Information Processing Systems , volume=

Multimodal Few-Shot Learning with Frozen Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2021 , url=

2021

[30] [30]

and Yu, Chen , title =

Smith, Linda B. and Yu, Chen , title =. Cognition , year =

[31] [31]

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ma, Ziqiao and Pan, Jiayi and Chai, Joyce. World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.31

work page doi:10.18653/v1/2023.acl-long.31 2023

[32] [32]

2025 , eprint=

Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models , author=. 2025 , eprint=

2025

[33] [33]

2024 , eprint=

Visually Grounded Speech Models have a Mutual Exclusivity Bias , author=. 2024 , eprint=

2024

[34] [34]

1989 , publisher=

Categorization and Naming in Children: Problems of Induction , author=. 1989 , publisher=

1989

[35] [35]

2025 , eprint=

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict , author=. 2025 , eprint=

2025

[36] [36]

2023 , eprint=

Debiasing Vision-Language Models via Biased Prompts , author=. 2023 , eprint=

2023

[37] [37]

International Conference on Learning Representations , year=

Grounded Language Learning Fast and Slow , author=. International Conference on Learning Representations , year=

[38] [38]

2018 , eprint=

Assessing Shape Bias Property of Convolutional Neural Networks , author=. 2018 , eprint=

2018

[39] [39]

Proceedings of the 25th Conference on Computational Natural Language Learning , pages=

The Emergence of the Shape Bias Results from Communicative Efficiency , author=. Proceedings of the 25th Conference on Computational Natural Language Learning , pages=. 2021 , publisher=

2021

[40] [40]

ImageNet-trained

Burgert, Tom and Stoll, Oliver and Rota, Paolo and Demir, Beg\". ImageNet-trained. Advances in Neural Information Processing Systems , year=

[41] [41]

Cognitive Psychology , volume=

Children’s use of mutual exclusivity to constrain the meanings of words , author=. Cognitive Psychology , volume=. 1988 , publisher=

1988

[42] [42]

and Brendel, Wieland , booktitle=

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A. and Brendel, Wieland , booktitle=. ImageNet-trained. 2019 , url=

2019

[43] [43]

2024 , eprint=

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models , author=. 2024 , eprint=

2024

[44] [44]

and Jones, Susan S

Barbara Landau and Smith, Linda B. and Jones, Susan S. The importance of shape in early lexical learning. Cognitive Development. 1988. doi:10.1016/0885-2014(88)90014-7

work page doi:10.1016/0885-2014(88)90014-7 1988

[45] [45]

Language Learning and Development , volume=

Dynamic noun generalization , author=. Language Learning and Development , volume=. 2007 , publisher=

2007

[46] [46]

1990 , issn =

Constraints children place on word meanings , journal =. 1990 , issn =. doi:https://doi.org/10.1016/0364-0213(90)90026-S , url =

work page doi:10.1016/0364-0213(90)90026-s 1990

[47] [47]

Young Children Extend Novel Words at the Basic Level: Evidence for the Principle of Categorical Scope , volume =

Golinkoff, Roberta and Shuff-Bailey, Margaret and Jaakkola, Kelly and Ruan, Wenjun , year =. Young Children Extend Novel Words at the Basic Level: Evidence for the Principle of Categorical Scope , volume =. Developmental Psychology , doi =

[48] [48]

PLoS Computational Biology , volume=

Deep convolutional networks do not classify based on global object shape , author=. PLoS Computational Biology , volume=. 2018 , doi=

2018

[49] [49]

Journal of Experimental Child Psychology , volume=

Clarifying the role of shape in children's taxonomic assumption , author=. Journal of Experimental Child Psychology , volume=. 1992 , doi=

1992

[50] [50]

Is the Acquisition of Basic-Colour Terms in Young Children Constrained? , volume =

Pitchford, Nicola and Mullen, Kathy , year =. Is the Acquisition of Basic-Colour Terms in Young Children Constrained? , volume =. Perception , doi =

[51] [51]

Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

Signatures of Domain-General Categorization Mechanisms in ColorWord Learning , author=. Proceedings of the Annual Meeting of the Cognitive Science Society , volume=

[52] [52]

Waxman , title =

Sandra R. Waxman , title =. Psychology of Learning and Motivation , volume =. 1998 , publisher =

1998

[53] [53]

Principles that are invoked in the acquisition of words, but not facts , volume =

Waxman, Sandra and Booth, Amy , year =. Principles that are invoked in the acquisition of words, but not facts , volume =. Cognition , doi =

[54] [54]

Papers and Reports on Child Language Development , volume=

Acquiring a Single New Word , author=. Papers and Reports on Child Language Development , volume=. 1978 , month=

1978

[55] [55]

The innate mind: Foundations and the future , editor=

Rational statistical inference and cognitive development , author=. The innate mind: Foundations and the future , editor=. 2007 , publisher=

2007

[56] [56]

Cognition , volume=

A probabilistic model of theory formation , author=. Cognition , volume=. 2010 , publisher=

2010

[57] [57]

Behavioral and Brain Sciences , volume=

Building machines that learn and think like people , author=. Behavioral and Brain Sciences , volume=. 2017 , publisher=

2017

[58] [58]

Developmental Science , volume=

Core knowledge , author=. Developmental Science , volume=. 2007 , publisher=

2007

[59] [59]

2003 , publisher=

Constructing a Language: A Usage-Based Theory of Language Acquisition , author=. 2003 , publisher=

2003

[60] [60]

Infancy , volume=

Dynamic noun generalization: Moment-to-moment interactions shape children's naming biases , author=. Infancy , volume=. 2007 , publisher=

2007

[61] [61]

Approximating

Brendel, Wieland and Bethge, Matthias , booktitle =. Approximating. 2019 , url =

2019

[62] [62]

Advances in Neural Information Processing Systems , volume =

Partial success in closing the gap between human and machine vision , author =. Advances in Neural Information Processing Systems , volume =

[63] [63]

Science , volume =

Human-level concept learning through probabilistic program induction , author =. Science , volume =. 2015 , doi =

2015

[64] [64]

2022 , eprint=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. 2022 , eprint=

2022

[65] [65]

Child Development , volume =

Word learning in children: An examination of fast mapping , author =. Child Development , volume =. 1987 , doi =

1987

[66] [66]

Advances in Neural Information Processing Systems , volume =

Matching Networks for One Shot Learning , author =. Advances in Neural Information Processing Systems , volume =

[67] [67]

International Conference on Machine Learning , pages=

Learning Transferable Visual Models from Natural Language Supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[68] [68]

Advances in Neural Information Processing Systems , volume =

Prototypical Networks for Few-shot Learning , author =. Advances in Neural Information Processing Systems , volume =

[69] [69]

2023 , eprint=

Prompting is not a substitute for probability measurements in large language models , author=. 2023 , eprint=

2023

[70] [70]

Proceedings of the 34th International Conference on Machine Learning , pages =

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , publisher =

2017

[71] [71]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

[72] [72]

Psychological Science , volume =

Rapid word learning under uncertainty via cross-situational statistics , author =. Psychological Science , volume =. 2007 , doi =

2007

[73] [73]

Psychological Science , volume =

Object name learning provides on-the-job training for attention , author =. Psychological Science , volume =. 2002 , doi =

2002

[74] [74]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024

[75] [75]

2020 , eprint=

The Origins and Prevalence of Texture Bias in Convolutional Neural Networks , author=. 2020 , eprint=

2020

[76] [76]

2016 , eprint=

Understanding How Image Quality Affects Deep Neural Networks , author=. 2016 , eprint=

2016

[77] [77]

2020 , eprint=

Generalisation in humans and deep neural networks , author=. 2020 , eprint=

2020

[78] [78]

Proceedings of the National Academy of Sciences , volume =

Atoms of Recognition in Human and Computer Vision , author =. Proceedings of the National Academy of Sciences , volume =. 2016 , doi =

2016

[79] [79]

Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , journal =. Qwen2-. 2024 , url =

2024

[80] [80]

2024 , eprint=

Building and better understanding vision-language models: insights and future directions , author=. 2024 , eprint=

2024