Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

Bihan Wen; Patrick Batsell; Satoshi Tsutsui

arxiv: 2512.18951 · v3 · pith:LHQKVD52new · submitted 2025-12-22 · 💻 cs.LG

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

Patrick Batsell , Satoshi Tsutsui , Bihan Wen This is my paper

Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords attribute discriminationinfant-scale modelsvision-language modelscolor size texturesynthetic benchmarkCVCLCLIPgrounding

0 comments

The pith

Infant-trained vision models build strong visual size representations but perform poorly on color discrimination and text-based color grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a synthetic benchmark to test whether infant-scale vision-language models can discriminate fine-grained attributes such as color, size, and texture within the same object class, separate from basic category recognition. It evaluates models trained on limited infant-like data against web-scale and standard vision models using both pure image tests and combined text-image prompts. The results reveal a clear split: infant-scale models handle size well visually and match others on texture but fall short on color, while struggling to connect color words to images and showing only modest size grounding from text. In contrast, web-scale models excel at grounding color linguistically but show weaker visual size discrimination. This pattern matters because it clarifies what limited-data training captures about the attributes infants learn early and what remains missing.

Core claim

Infant-trained models such as CVCL and an infant DINO baseline form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination. In the text-vision setting with attribute-object prompts they struggle to ground color and show only modest size grounding. By contrast, web-trained vision-language models such as CLIP strongly ground color from text while exhibiting weaker visual size discrimination. These outcomes are measured on a controlled benchmark that applies synthetic rendering to vary color, size, and texture independently across 67 everyday object classes.

What carries the argument

A synthetic rendering benchmark that decouples color, size, and texture variations from object identity across 67 classes, evaluated via image-only prototype matching and text-vision grounding tests with attribute-object prompts.

If this is right

Infant-scale training produces visual features that capture size information more readily than color information.
Textual grounding of attributes depends strongly on the scale and source of training data.
Texture discrimination remains consistent across model scales and training regimes.
Synthetic attribute control isolates learning of individual properties without category confounds.
Web-scale data enables stronger language-to-visual mapping for color than limited infant-scale data does.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be applied to real-world image collections to check whether the color weakness holds outside synthetic conditions.
Training objectives focused on color statistics might close the gap for data-limited models without requiring web-scale data.
The observed split between visual size strength and color weakness may reflect differences in natural image statistics between small curated sets and internet data.
Similar controlled tests on additional attributes such as shape could map a fuller profile of what infant-scale models learn.

Load-bearing premise

The synthetic rendering procedure successfully decouples attribute values from object identity across the 67 classes so that performance differences reflect attribute discrimination rather than object recognition confounds.

What would settle it

Re-running the image-only prototype test and text-vision test on a set of real photographs with matched attribute variations would falsify the reported dissociation if infant-scale models then match or exceed web-scale models on color discrimination accuracy.

read the original abstract

Infants learn not only object categories but also fine-grained visual attributes such as color, size, and texture from limited experience. Prior infant-scale vision--language models have mainly been evaluated on object recognition, leaving open whether they support within-class attribute discrimination. We introduce a controlled benchmark that varies color, size, and texture across 67 everyday object classes using synthetic rendering to decouple attribute values from object identity. We evaluate infant-trained models (CVCL and an infant-trained DINO baseline) against web-scale and ImageNet models (CLIP, SigLIP, ResNeXt) under two complementary settings: an image-only prototype test and a text--vision test with attribute--object prompts. We find a dissociation between visual and linguistic attribute information: infant-trained models form strong visual representations for size and discriminate texture comparably to other models, but perform poorly on visual color discrimination, and in the text--vision setting they struggle to ground color and show only modest size grounding. In contrast, web-trained vision--language models strongly ground color from text while exhibiting weaker visual size discrimination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new synthetic benchmark shows infant-scale models handle size and texture visually better than color, with the opposite pattern for web-scale models on text grounding.

read the letter

The main point is that this work builds a controlled synthetic benchmark to test attribute discrimination—color, size, texture—while holding object class fixed across 67 everyday items. It reports a clear split: infant-trained models like CVCL do well on visual size and hold their own on texture but fall short on color, and they show only modest grounding of size in text-vision prompts while struggling with color. Web-scale models like CLIP reverse that, grounding color strongly from text but weaker on visual size discrimination. The two test settings, image-only prototypes and attribute-object text prompts, make the comparison direct. That controlled decoupling via synthetic rendering is the actual new piece; prior infant-scale evaluations stayed mostly at category level. The setup gives a practical way to probe what these small models actually learn from limited data. The reported patterns line up with the abstract's claims and seem internally consistent on the numbers given. The soft spot is the lack of an explicit check that the renders truly remove object-identity cues—no probe results or correlation tests are mentioned to rule out shape or lighting leaks that could drive the gaps. With only three attributes and a handful of models, the dissociation stays narrow, and the abstract leaves out error bars or exact statistical tests, so robustness is hard to judge from the summary alone. This is useful for anyone building or evaluating small-scale vision-language systems aimed at more human-like learning. A reader focused on benchmarks or developmental constraints would find the tool and the split worth seeing. It deserves peer review because the benchmark construction is concrete and the evaluation design is straightforward, even if the current results need tighter validation on the decoupling step.

Referee Report

1 major / 2 minor

Summary. The paper introduces a controlled synthetic benchmark that varies color, size, and texture across 67 object classes to evaluate attribute discrimination in infant-scale vision-language models (CVCL and infant-trained DINO) versus web-scale models (CLIP, SigLIP, ResNeXt). It reports results from an image-only prototype test and a text-vision grounding test with attribute-object prompts, claiming a dissociation: infant models show strong visual representations for size and comparable texture discrimination but poor color discrimination and weak color grounding, while web-trained models exhibit strong text-based color grounding but weaker visual size discrimination.

Significance. If the benchmark successfully isolates the targeted attributes, the reported dissociation would offer valuable empirical evidence on how limited, infant-like training data shapes visual versus linguistic attribute representations compared to web-scale models, with potential implications for developmental modeling in vision-language systems.

major comments (1)

[Benchmark construction] Benchmark construction (synthetic rendering procedure): the central dissociation claim requires that performance differences reflect attribute discrimination rather than object-identity confounds. No validation is reported (e.g., a probe classifier recovering object class from attribute-varied renders, or checks for residual shape-attribute correlations or rendering artifacts across the 67 classes). Without this, the image-only prototype results and text-vision scores could be driven by unintended signals.

minor comments (2)

[Results] The abstract and results sections do not report exact metrics, error bars, or statistical tests used to establish the claimed differences between model families.
[Methods] Clarify the precise definition of 'prototype test' and 'grounding score' in the text-vision setting, including how prompts are constructed and how similarity is computed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address the single major comment below and will revise the manuscript accordingly to strengthen the benchmark validation.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (synthetic rendering procedure): the central dissociation claim requires that performance differences reflect attribute discrimination rather than object-identity confounds. No validation is reported (e.g., a probe classifier recovering object class from attribute-varied renders, or checks for residual shape-attribute correlations or rendering artifacts across the 67 classes). Without this, the image-only prototype results and text-vision scores could be driven by unintended signals.

Authors: We agree that explicit validation of the synthetic benchmark is necessary to support the dissociation claims. The rendering pipeline was designed to hold object geometry fixed per class while independently varying color (via material albedo), size (via uniform scaling), and texture (via procedural material parameters) across the 67 classes, with the explicit goal of decoupling attributes from identity. However, we acknowledge that the original manuscript did not include quantitative checks such as a probe classifier for object-class recovery or explicit correlation analyses. In the revision we will add: (1) a linear probe trained on frozen visual features to recover object class from the attribute-varied renders, reporting accuracy well above chance to confirm identity preservation; (2) pairwise correlation statistics between rendered shape descriptors and each attribute dimension; and (3) qualitative inspection of a random sample of renders for visible artifacts. These additions will be placed in a new subsection of the Methods and referenced in the Results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent evaluations

full rationale

The paper introduces a new synthetic benchmark and reports direct empirical comparisons of existing models (CVCL, DINO, CLIP, etc.) on attribute discrimination tasks. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Central claims rest on observed performance differences rather than any reduction to author-defined quantities or prior self-referential results. The decoupling assumption is an empirical premise open to external validation, not a definitional or fitted construct.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic images isolate attributes cleanly and that the chosen infant-scale models are representative; no free parameters or new entities are introduced.

axioms (1)

domain assumption Synthetic rendering decouples attribute values from object identity across the 67 classes
Invoked to justify that performance differences measure attribute discrimination rather than object confounds.

pith-pipeline@v0.9.0 · 5486 in / 1268 out tokens · 28005 ms · 2026-05-16T20:35:04.910210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

INTRODUCTION Infants demonstrate remarkable efficiency in learning to rec- ognize not only object categories but also fine-grained visual attributes such as color, size, and texture within their first two years of life [1, 2, 3]. This developmental ability has in- spired research in computer vision that seeks to model learn- ing under similarly constraine...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Blue ball

BENCHMARK ATTRIBUTE DISCRIMINATION 2.1. Benchmark Design We construct a controlled benchmark that systematically varies three visual attributes—color, size, and texture—across multiple everyday object classes. These attributes were cho- sen because they are among the earliest perceptual features that infants reliably recognize and use to organize visual c...

work page
[3]

Same-class, different color (SCDC)

work page
[4]

Same-class, different size (SCDS)

work page
[5]

red cup,

Same-class, different texture (SCDT) In order to evaluate attribute discrimination, each condi- tion is tested under two complementary modes: Prototype tests (image–only):A prototype embedding is computed as the mean of unit-normalized images sharing a target attribute (e.g., agreenball), excluding the query. The mean is re-normalized, and the query is co...

work page
[6]

Raw Classification Accuracy We first evaluate overall classification performance without isolating specific attributes

RESULTS 3.1. Raw Classification Accuracy We first evaluate overall classification performance without isolating specific attributes. This serves two purposes: (i) to verify whether CVCL, trained on naturalistic infant-scale data, can generalize to synthetic images in our benchmark, and (ii) to establish a baseline before moving to fine-grained attribute-l...

work page
[7]

4: Per-class classification accuracy in text–vision mode

RELATION TO PRIOR WORK Our study connects two strands of prior work: developmental psychology on infant perception and computational models Fig. 4: Per-class classification accuracy in text–vision mode. CLIP achieves high performance across categories, whereas CVCL remains near chance. Fig. 5: Attribute discrimination in prototype (image-only) mode. CVCL ...

work page
[8]

Apply- ing this benchmark to CVCL and CLIP revealed distinct strengths and limitations

CONCLUSION We introduced a controlled benchmark for evaluating at- tribute discrimination in vision–language models, focusing on color, size, and texture within object categories. Apply- ing this benchmark to CVCL and CLIP revealed distinct strengths and limitations. CVCL, despite being trained on a small infant-scale dataset, encoded robust size represen...

work page
[9]

Color vision and hue categorization in young human infants.,

Marc H Bornstein, William Kessen, and Sally Weiskopf, “Color vision and hue categorization in young human infants.,”Journal of Experimental Psychology: Hu- man Perception and Performance, vol. 2, no. 1, pp. 115, 1976

work page 1976
[10]

Do infants show knowledge of the familiar size of ev- eryday objects?,

¨Ozlem Sensoy, Jody C Culham, and Gudrun Schwarzer, “Do infants show knowledge of the familiar size of ev- eryday objects?,”Journal of experimental child psychol- ogy, vol. 195, pp. 104848, 2020

work page 2020
[11]

Development of contrast sensitivity in the hu- man infant,

Anthony M Norcia, Christopher W Tyler, and Russell D Hamer, “Development of contrast sensitivity in the hu- man infant,”Vision research, vol. 30, no. 10, pp. 1475– 1486, 1990

work page 1990
[12]

Toddler-inspired visual object learning,

Sven Bambach, David J. Crandall, Linda B. Smith, and Chen Yu, “Toddler-inspired visual object learning,” in Advances in Neural Information Processing Systems, 2018

work page 2018
[13]

Discovering hidden visual concepts beyond lin- guistic input in infant learning,

Xueyi Ke, Satoshi Tsutsui, Yayun Zhang, and Bihan Wen, “Discovering hidden visual concepts beyond lin- guistic input in infant learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4343–4352

work page 2025
[14]

Models trained on infant views are more predictive of infant visual cortex,

Cliona O’Doherty, Aine T Dineen, Anna Truzzi, Gra- ham King, Enna-Louise D’Arcy, Chiara Caldinelli, Tamrin Holloway, Eleanor Molloy, and Rhodri Cusack, “Models trained on infant views are more predictive of infant visual cortex,”

work page
[15]

Curriculum learning with infant egocentric videos,

Saber Sheybani, Himanshu Hansaria, Justin Wood, Linda Smith, and Zoran Tiganj, “Curriculum learning with infant egocentric videos,”Advances in Neural In- formation Processing Systems, 2023

work page 2023
[16]

Grounded language acquisition through the eyes and ears of a single child,

Wai Keen V ong, Wentao Wang, A Emin Orhan, and Brenden M Lake, “Grounded language acquisition through the eyes and ears of a single child,”Science, vol. 383, no. 6682, pp. 504–511, 2024

work page 2024
[17]

Learning transferable visual models from natural lan- guage supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural lan- guage supervision,” inICML, 2021

work page 2021
[18]

Saycam: A large, lon- gitudinal audiovisual dataset recorded from the infant’s perspective,

Jessica Sullivan, Michelle Mei, Andrew Perfors, Erica Wojcik, and Michael C Frank, “Saycam: A large, lon- gitudinal audiovisual dataset recorded from the infant’s perspective,”Open mind, vol. 5, pp. 20–29, 2021

work page 2021
[19]

Infants’ use of featural information in the segregation of stationary objects,

Amy Needham, “Infants’ use of featural information in the segregation of stationary objects,”Infant Behavior and Development, vol. 21, no. 1, pp. 47–76, 1998

work page 1998
[20]

Conceptual distinctiveness supports de- tailed visual long-term memory for real-world objects.,

Talia Konkle, Timothy F Brady, George A Alvarez, and Aude Oliva, “Conceptual distinctiveness supports de- tailed visual long-term memory for real-world objects.,” Journal of experimental Psychology: general, vol. 139, no. 3, pp. 558, 2010

work page 2010
[21]

Learning to detect unseen object classes by between-class attribute transfer,

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” inCVPR, 2009

work page 2009
[22]

The caltech-ucsd birds-200-2011 dataset,

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011
[23]

Object individuation: Infants’ use of shape, size, pattern, and color,

Teresa Wilcox, “Object individuation: Infants’ use of shape, size, pattern, and color,”Cognition, vol. 72, no. 2, pp. 125–166, 1999

work page 1999
[24]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Visual seg- mentation of oriented textures by infants,

Janette Atkinson and Oliver Braddick, “Visual seg- mentation of oriented textures by infants,”Behavioural Brain Research, vol. 49, no. 1, pp. 123–131, 1992

work page 1992
[26]

Pro- totypical networks for few-shot learning,

Jake Snell, Kevin Swersky, and Richard Zemel, “Pro- totypical networks for few-shot learning,”Advances in Neural Information Processing Systems, 2017

work page 2017
[27]

Development of perceptual organization in infancy,

Paul C. Quinn and Ramesh S. Bhatt, “Development of perceptual organization in infancy,” inThe Oxford Handbook of Perceptual Organization, Johan Wage- mans, Ed., pp. 685–706. Oxford University Press, 2015

work page 2015
[28]

Words as in- vitations to form categories: Evidence from 12-to 13- month-old infants,

Sandra R Waxman and Dana B Markow, “Words as in- vitations to form categories: Evidence from 12-to 13- month-old infants,”Cognitive psychology, vol. 29, no. 3, pp. 257–302, 1995

work page 1995
[29]

Infants rapidly learn word- referent mappings via cross-situational statistics,

Linda Smith and Chen Yu, “Infants rapidly learn word- referent mappings via cross-situational statistics,”Cog- nition, vol. 106, no. 3, pp. 1558–1568, 2008

work page 2008
[30]

Aggregated residual transformations for deep neural networks,

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” inCVPR, 2017, pp. 1492– 1500

work page 2017
[31]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016

[1] [1]

Benchmarking Attribute Discrimination in Infant-Scale Vision-Language Models

INTRODUCTION Infants demonstrate remarkable efficiency in learning to rec- ognize not only object categories but also fine-grained visual attributes such as color, size, and texture within their first two years of life [1, 2, 3]. This developmental ability has in- spired research in computer vision that seeks to model learn- ing under similarly constraine...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Blue ball

BENCHMARK ATTRIBUTE DISCRIMINATION 2.1. Benchmark Design We construct a controlled benchmark that systematically varies three visual attributes—color, size, and texture—across multiple everyday object classes. These attributes were cho- sen because they are among the earliest perceptual features that infants reliably recognize and use to organize visual c...

work page

[3] [3]

Same-class, different color (SCDC)

work page

[4] [4]

Same-class, different size (SCDS)

work page

[5] [5]

red cup,

Same-class, different texture (SCDT) In order to evaluate attribute discrimination, each condi- tion is tested under two complementary modes: Prototype tests (image–only):A prototype embedding is computed as the mean of unit-normalized images sharing a target attribute (e.g., agreenball), excluding the query. The mean is re-normalized, and the query is co...

work page

[6] [6]

Raw Classification Accuracy We first evaluate overall classification performance without isolating specific attributes

RESULTS 3.1. Raw Classification Accuracy We first evaluate overall classification performance without isolating specific attributes. This serves two purposes: (i) to verify whether CVCL, trained on naturalistic infant-scale data, can generalize to synthetic images in our benchmark, and (ii) to establish a baseline before moving to fine-grained attribute-l...

work page

[7] [7]

4: Per-class classification accuracy in text–vision mode

RELATION TO PRIOR WORK Our study connects two strands of prior work: developmental psychology on infant perception and computational models Fig. 4: Per-class classification accuracy in text–vision mode. CLIP achieves high performance across categories, whereas CVCL remains near chance. Fig. 5: Attribute discrimination in prototype (image-only) mode. CVCL ...

work page

[8] [8]

Apply- ing this benchmark to CVCL and CLIP revealed distinct strengths and limitations

CONCLUSION We introduced a controlled benchmark for evaluating at- tribute discrimination in vision–language models, focusing on color, size, and texture within object categories. Apply- ing this benchmark to CVCL and CLIP revealed distinct strengths and limitations. CVCL, despite being trained on a small infant-scale dataset, encoded robust size represen...

work page

[9] [9]

Color vision and hue categorization in young human infants.,

Marc H Bornstein, William Kessen, and Sally Weiskopf, “Color vision and hue categorization in young human infants.,”Journal of Experimental Psychology: Hu- man Perception and Performance, vol. 2, no. 1, pp. 115, 1976

work page 1976

[10] [10]

Do infants show knowledge of the familiar size of ev- eryday objects?,

¨Ozlem Sensoy, Jody C Culham, and Gudrun Schwarzer, “Do infants show knowledge of the familiar size of ev- eryday objects?,”Journal of experimental child psychol- ogy, vol. 195, pp. 104848, 2020

work page 2020

[11] [11]

Development of contrast sensitivity in the hu- man infant,

Anthony M Norcia, Christopher W Tyler, and Russell D Hamer, “Development of contrast sensitivity in the hu- man infant,”Vision research, vol. 30, no. 10, pp. 1475– 1486, 1990

work page 1990

[12] [12]

Toddler-inspired visual object learning,

Sven Bambach, David J. Crandall, Linda B. Smith, and Chen Yu, “Toddler-inspired visual object learning,” in Advances in Neural Information Processing Systems, 2018

work page 2018

[13] [13]

Discovering hidden visual concepts beyond lin- guistic input in infant learning,

Xueyi Ke, Satoshi Tsutsui, Yayun Zhang, and Bihan Wen, “Discovering hidden visual concepts beyond lin- guistic input in infant learning,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4343–4352

work page 2025

[14] [14]

Models trained on infant views are more predictive of infant visual cortex,

Cliona O’Doherty, Aine T Dineen, Anna Truzzi, Gra- ham King, Enna-Louise D’Arcy, Chiara Caldinelli, Tamrin Holloway, Eleanor Molloy, and Rhodri Cusack, “Models trained on infant views are more predictive of infant visual cortex,”

work page

[15] [15]

Curriculum learning with infant egocentric videos,

Saber Sheybani, Himanshu Hansaria, Justin Wood, Linda Smith, and Zoran Tiganj, “Curriculum learning with infant egocentric videos,”Advances in Neural In- formation Processing Systems, 2023

work page 2023

[16] [16]

Grounded language acquisition through the eyes and ears of a single child,

Wai Keen V ong, Wentao Wang, A Emin Orhan, and Brenden M Lake, “Grounded language acquisition through the eyes and ears of a single child,”Science, vol. 383, no. 6682, pp. 504–511, 2024

work page 2024

[17] [17]

Learning transferable visual models from natural lan- guage supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural lan- guage supervision,” inICML, 2021

work page 2021

[18] [18]

Saycam: A large, lon- gitudinal audiovisual dataset recorded from the infant’s perspective,

Jessica Sullivan, Michelle Mei, Andrew Perfors, Erica Wojcik, and Michael C Frank, “Saycam: A large, lon- gitudinal audiovisual dataset recorded from the infant’s perspective,”Open mind, vol. 5, pp. 20–29, 2021

work page 2021

[19] [19]

Infants’ use of featural information in the segregation of stationary objects,

Amy Needham, “Infants’ use of featural information in the segregation of stationary objects,”Infant Behavior and Development, vol. 21, no. 1, pp. 47–76, 1998

work page 1998

[20] [20]

Conceptual distinctiveness supports de- tailed visual long-term memory for real-world objects.,

Talia Konkle, Timothy F Brady, George A Alvarez, and Aude Oliva, “Conceptual distinctiveness supports de- tailed visual long-term memory for real-world objects.,” Journal of experimental Psychology: general, vol. 139, no. 3, pp. 558, 2010

work page 2010

[21] [21]

Learning to detect unseen object classes by between-class attribute transfer,

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” inCVPR, 2009

work page 2009

[22] [22]

The caltech-ucsd birds-200-2011 dataset,

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be- longie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011

work page 2011

[23] [23]

Object individuation: Infants’ use of shape, size, pattern, and color,

Teresa Wilcox, “Object individuation: Infants’ use of shape, size, pattern, and color,”Cognition, vol. 72, no. 2, pp. 125–166, 1999

work page 1999

[24] [24]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Visual seg- mentation of oriented textures by infants,

Janette Atkinson and Oliver Braddick, “Visual seg- mentation of oriented textures by infants,”Behavioural Brain Research, vol. 49, no. 1, pp. 123–131, 1992

work page 1992

[26] [26]

Pro- totypical networks for few-shot learning,

Jake Snell, Kevin Swersky, and Richard Zemel, “Pro- totypical networks for few-shot learning,”Advances in Neural Information Processing Systems, 2017

work page 2017

[27] [27]

Development of perceptual organization in infancy,

Paul C. Quinn and Ramesh S. Bhatt, “Development of perceptual organization in infancy,” inThe Oxford Handbook of Perceptual Organization, Johan Wage- mans, Ed., pp. 685–706. Oxford University Press, 2015

work page 2015

[28] [28]

Words as in- vitations to form categories: Evidence from 12-to 13- month-old infants,

Sandra R Waxman and Dana B Markow, “Words as in- vitations to form categories: Evidence from 12-to 13- month-old infants,”Cognitive psychology, vol. 29, no. 3, pp. 257–302, 1995

work page 1995

[29] [29]

Infants rapidly learn word- referent mappings via cross-situational statistics,

Linda Smith and Chen Yu, “Infants rapidly learn word- referent mappings via cross-situational statistics,”Cog- nition, vol. 106, no. 3, pp. 1558–1568, 2008

work page 2008

[30] [30]

Aggregated residual transformations for deep neural networks,

Saining Xie, Ross Girshick, Piotr Doll ´ar, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” inCVPR, 2017, pp. 1492– 1500

work page 2017

[31] [31]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016

work page 2016