pith. sign in

arxiv: 2606.05497 · v1 · pith:7DFX62YXnew · submitted 2026-06-03 · 💻 cs.LG

LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")

Pith reviewed 2026-06-28 06:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords vision-language modelscognitive developmentbenchmarkchildrenalignmenterror distributionsreasoning tasksmultimodal models
0
0 comments X

The pith

Current vision-language models align only partially with children's cognitive abilities on reasoning and spatial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LEVANTE-bench, which applies six tasks from the LEVANTE network to vision-language models and compares their outputs to performance data from 1547 children aged 5-12 across three countries. It measures alignment at the levels of overall accuracy, tasks and items, and trial-by-trial error distributions. More capable models align better with children at task and item levels, yet matches to children's error patterns vary widely by task and sometimes favor smaller models for younger age groups. VLMs show particular difficulty on matrix reasoning and mental rotation, supporting the view that alignment with developing human cognition remains incomplete.

Core claim

Applying the six LEVANTE tasks to VLMs and contrasting results against children's data across ages and countries reveals heterogeneous alignment: task- and item-level matches improve with model scale, error distribution matches fluctuate by task and can favor smaller models for younger children, and even top VLMs underperform on matrix reasoning and mental rotation.

What carries the argument

LEVANTE-bench, a multi-scale evaluation that scores VLMs on accuracy, task/item alignment, and error distribution match against children's trial-level responses from the six LEVANTE tasks.

Load-bearing premise

The six LEVANTE tasks and their data validly measure children's cognitive development in a form that permits direct comparison to VLM outputs.

What would settle it

A result showing that the largest VLMs produce error distributions statistically identical to those of children on every task and age band would undermine the partial-alignment conclusion.

Figures

Figures reproduced from arXiv: 2606.05497 by Alvin Wei Ming Tan, David Cardinal, Laura Bravo-Sanchez, Michael C. Frank, Sunny Yu, Tania Lorido-Botran.

Figure 1
Figure 1. Figure 1: (A) Example items from the six tasks of LEVANTE-bench as presented to human partici [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: VLM accuracies plotted by log10 parameters (estimated for commercial models). Error bars indicate bootstrapped 95% confidence intervals. Colors denote model families. Dotted lines indicate chance levels, which vary between tasks due to varying numbers of response options. Dashed lines show best fitting logistic regressions. Evaluation configuration. To estimate error distributions and minimize response bia… view at source ↗
Figure 3
Figure 3. Figure 3: Correlation between model and human task accuracies plotted by (A) log [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between model accuracy and item easiness estimated from human performance. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tasks showed markedly different patterns of trial-level alignment. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of model accuracies on English versus German/Spanish versions of our tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DKL between model and human response distributions plotted by log10 number of parameters for all tasks. Lower values indicate greater model–human alignment. prompt—identical to the position-biased ceiling—and dropped to 38.6–48.2% with their recommended paper prompts, which elicit longer reasoning chains. 5. Stacking phases has diminishing or negative returns. Full-stack combinations often underperformed t… view at source ↗
Figure 8
Figure 8. Figure 8: DKL between model and human response distributions plotted by human ability bins for all tasks. Lower values indicate greater model–human alignment. believe questions, whereas models were broadly similar on these three subtypes. Interestingly, sentence understanding showed relatively similar deviations in subtype accuracy between models and humans. These disparities suggest that items may function differen… view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy deviation from overall task accuracy. Black dots and error bars indicate all-model [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Given the inherently multimodal nature of human experience, vision-language models (VLMs) hold substantial promise for modeling human cognition as it grows and develops with experience. Realizing their potential requires tools for comparing VLMs with human cognitive development across tasks, ages, and populations. We present LEVANTE-bench, a benchmark based on tasks and data from the Learning Variability Network (LEVANTE), which distributes open-source tasks and data measuring children's cognition across languages and cultures. In LEVANTE-bench, we systematically assess VLMs on six tasks, comparing their alignment with children aged 5-12 ($N$ = 1547) across three countries. We compare models at multiple scales, assessing their overall accuracy, their task- and item-level alignment with children, and how well they match children's trial-level error distributions. Alignment was heterogeneous across scales: at the level of tasks and items, more capable models aligned better with humans. However, match to human error distributions varied widely across tasks, and for several tasks, smaller models matched younger children's errors better. In addition, even the best-performing VLMs struggled on matrix reasoning and mental rotation tasks. Thus, current VLM architectures align only partially with the cognitive abilities of children.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LEVANTE-bench, a benchmark based on six tasks and data from the LEVANTE network measuring children's cognition (N=1547, ages 5-12 across three countries). It evaluates VLMs of varying scales on overall accuracy, task- and item-level alignment with children, and match to children's trial-level error distributions. Results indicate heterogeneous alignment: larger models align better at task/item levels, error-distribution matches vary (with smaller models sometimes closer to younger children on some tasks), and even top VLMs struggle on matrix reasoning and mental rotation. The central claim is that current VLM architectures align only partially with children's cognitive abilities.

Significance. If the task equivalence holds, this provides a multi-country, multi-age empirical benchmark for VLM cognitive modeling using real developmental data, identifying specific gaps (e.g., abstract reasoning) that could guide VLM improvements toward more human-like capabilities. The open-source basis and scale comparisons are strengths for reproducibility in the field.

major comments (2)
  1. [Methods / Experimental Setup] The methods description provides no details on VLM prompting strategies, visual input handling, response scoring, or controls for interface differences (e.g., tokenization and sampling vs. children's developmental experience). This is load-bearing for the partial-alignment claim, as the abstract's heterogeneous results (better task/item match for larger models, variable error matches) cannot be interpreted without evidence that the six LEVANTE tasks isolate equivalent constructs across humans and models.
  2. [Results] The results on heterogeneous alignment (task/item vs. error-distribution matches) report no statistical tests, error bars, or scale controls, making it impossible to assess whether observed patterns (e.g., smaller models matching younger children on some tasks) are reliable or confounded by model size and prompting. This directly affects the central claim in the abstract.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the six specific LEVANTE tasks to allow readers to immediately contextualize the matrix reasoning and mental rotation struggles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] The methods description provides no details on VLM prompting strategies, visual input handling, response scoring, or controls for interface differences (e.g., tokenization and sampling vs. children's developmental experience). This is load-bearing for the partial-alignment claim, as the abstract's heterogeneous results (better task/item match for larger models, variable error matches) cannot be interpreted without evidence that the six LEVANTE tasks isolate equivalent constructs across humans and models.

    Authors: We agree that the Methods section requires substantially more detail to support interpretation of the alignment results. In the revised manuscript we will add a dedicated subsection describing: (i) the exact prompting templates and few-shot examples used for each of the six tasks, (ii) image preprocessing, resolution, and encoding procedures, (iii) the deterministic response-scoring rules applied to model outputs, and (iv) any explicit attempts to equate interface conditions (e.g., presentation order, feedback absence). We will also reference the LEVANTE network's published validation studies that establish construct equivalence between the tasks as administered to children and the cognitive constructs they target. revision: yes

  2. Referee: [Results] The results on heterogeneous alignment (task/item vs. error-distribution matches) report no statistical tests, error bars, or scale controls, making it impossible to assess whether observed patterns (e.g., smaller models matching younger children on some tasks) are reliable or confounded by model size and prompting. This directly affects the central claim in the abstract.

    Authors: We accept that the absence of statistical quantification weakens the evidential basis for the reported patterns. In the revision we will add: (a) correlation coefficients and associated p-values (or permutation tests) for all task- and item-level alignment metrics, (b) bootstrap or analytic error bars on the alignment figures, and (c) supplementary analyses that regress alignment scores on model scale while controlling for prompting variations. These additions will allow readers to evaluate the reliability of the heterogeneous alignment findings, including the observation that smaller models sometimes better matched younger children's error distributions. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical benchmark against external dataset

full rationale

The paper reports direct empirical comparisons of VLM outputs to children's trial-level responses on six LEVANTE tasks (N=1547 across countries), computing accuracy, item-level alignment, and error-distribution matches without any derivations, equations, fitted parameters, or predictions that reduce to the inputs by construction. No self-citations are load-bearing for the central claims, and the LEVANTE data source is described as open-source and external. The analysis contains no self-definitional steps, uniqueness theorems, or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark study with no free parameters, mathematical axioms, or invented entities; relies on standard assumptions that the chosen cognitive tasks measure relevant developmental constructs.

pith-pipeline@v0.9.1-grok · 5785 in / 1006 out tokens · 40030 ms · 2026-06-28T06:36:37.623497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Cognitive science: The newest science of the artificial.Cognitive science, 4 (1):33–46, 1980

    Herbert A Simon. Cognitive science: The newest science of the artificial.Cognitive science, 4 (1):33–46, 1980

  2. [2]

    The MIT press, 1986

    David E Rumelhart, James L McClelland, PDP Research Group, et al.Parallel distributed processing, volume 1: Explorations in the microstructure of cognition: Foundations. The MIT press, 1986

  3. [3]

    Emergent analogical reasoning in large language models.Nature Human Behaviour, 7(9):1526–1541, 2023

    Taylor Webb, Keith J Holyoak, and Hongjing Lu. Emergent analogical reasoning in large language models.Nature Human Behaviour, 7(9):1526–1541, 2023

  4. [4]

    Using cognitive psychology to understand GPT-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

    Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3.Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023

  5. [5]

    Cognitive modeling using artificial intelligence

    Michael C Frank and Noah D Goodman. Cognitive modeling using artificial intelligence. Annual Review of Psychology, 77, 2025

  6. [6]

    Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A, 381(2251):20220041, 2023

    Ellie Pavlick. Symbols and grounding in large language models.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2251), 2023. ISSN 1471-2962. doi: 10.1098/rsta.2022.0041. URL http://dx.doi.org/10.1098/rsta. 2022.0041

  7. [7]

    Bridging the data gap between children and large language models.Trends in Cognitive Sciences, 27(11):990–992, 2023

    Michael C Frank. Bridging the data gap between children and large language models.Trends in Cognitive Sciences, 27(11):990–992, 2023

  8. [8]

    Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora

    Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, et al. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language L...

  9. [9]

    Fast and robust visual object recognition in young children.Science Advances, 11(27):eads6821, 2025

    Vladislav Ayzenberg, Sukran Bahar Sener, Kylee Novick, and Stella F Lourenco. Fast and robust visual object recognition in young children.Science Advances, 11(27):eads6821, 2025

  10. [10]

    The developmental trajectory of object recognition robustness: Children are like small adults but unlike big deep neural networks

    Lukas S Huber, Robert Geirhos, and Felix A Wichmann. The developmental trajectory of object recognition robustness: Children are like small adults but unlike big deep neural networks. Journal of vision, 23(7):4–4, 2023. 10

  11. [11]

    Zero-shot World Models Are Developmentally Efficient Learners

    Khai Loong Aw, Klemen Kotar, Wanhee Lee, Seungwoo Kim, Khaled Jedoui, Rahul Venkatesh, Lilian Naing Chen, Michael C Frank, and Daniel LK Yamins. Zero-shot world models are developmentally efficient learners, 2026. URLhttps://arxiv.org/abs/2604.10333

  12. [12]

    Creation and validation of the LEV ANTE core tasks: Internationalized measures of learning and development for children ages 5-12 years, 2025

    George Kachergis, Fionnuala O’Reilly, Mika Braginsky, Xingyao Xiao, Amy Lightbody, KA Shannon, Zachary Watson, Lijin Zhang, Rebecca Zhu, AB Abutto, et al. Creation and validation of the LEV ANTE core tasks: Internationalized measures of learning and development for children ages 5-12 years, 2025. URL https://doi.org/10.31234/osf.io/r4dhw_v1

  13. [13]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  14. [14]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  15. [15]

    Visual cognition in multimodal large language models.Nature Machine Intelligence, 7(1):96–106, 2025

    Luca M Schulze Buschoff, Elif Akata, Matthias Bethge, and Eric Schulz. Visual cognition in multimodal large language models.Nature Machine Intelligence, 7(1):96–106, 2025

  16. [16]

    Grounded language acquisition through the eyes and ears of a single child.Science, 383(6682):504–511, 2024

    Wai Keen V ong, Wentao Wang, A Emin Orhan, and Brenden M Lake. Grounded language acquisition through the eyes and ears of a single child.Science, 383(6682):504–511, 2024

  17. [17]

    On the robustness of modeling grounded word learning through a child’s egocentric input.arXiv preprint arXiv:2507.14749, 2025

    Wai Keen V ong and Brenden M Lake. On the robustness of modeling grounded word learning through a child’s egocentric input.arXiv preprint arXiv:2507.14749, 2025

  18. [18]

    BabyVLM: Data-efficient pretraining of vlms inspired by infant learning

    Shengao Wang, Arjun Chandra, Aoming Liu, Venkatesh Saligrama, and Boqing Gong. BabyVLM: Data-efficient pretraining of vlms inspired by infant learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1380–1390, 2025

  19. [19]

    BabyVLM-V2: Toward develop- mentally grounded pretraining and benchmarking of vision foundation models.arXiv preprint arXiv:2512.10932, 2025

    Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, et al. BabyVLM-V2: Toward develop- mentally grounded pretraining and benchmarking of vision foundation models.arXiv preprint arXiv:2512.10932, 2025

  20. [20]

    Looking while listening.Language acquisition and language disorders, pages 97–135, 2008

    Anne Fernald, Renate Zangl, Ana Luz Portillo, and Virginia A Marchman. Looking while listening.Language acquisition and language disorders, pages 97–135, 2008

  21. [21]

    Baby steps in evaluating the capacities of large language models.Nature Reviews Psychology, 2(8):451–452, 2023

    Michael C Frank. Baby steps in evaluating the capacities of large language models.Nature Reviews Psychology, 2(8):451–452, 2023

  22. [22]

    MIT press, 1996

    Jeffrey L Elman.Rethinking innateness: A connectionist perspective on development, volume 10. MIT press, 1996

  23. [23]

    Innateness is (still) an orienting principle for language development, 2026

    Leher Singh, Marisa Casillas, Shanley Allen, Michael Frank, and Caroline Rowland. Innateness is (still) an orienting principle for language development, 2026. URL https://osf.io/ preprints/psyarxiv/ykz8j_v1

  24. [24]

    Xu Cao, Yifan Shen, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Meihuan Huang, Jianguo Cao, Aidong Zhang, and James M. Rehg. What is the visual cognition gap between humans and multimodal llms?, 2025. URL https://arxiv.org/abs/2406.10424

  25. [25]

    Tsaftaris

    Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models, 2025. URL https://arxiv.org/abs/2503. 19707

  26. [26]

    Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, and Hokin Deng

    Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, and Hokin Deng. Core knowledge deficits in multi-modal language models, 2025. URLhttps://arxiv.org/abs/2410.10855

  27. [27]

    BabyVision: Visual reasoning beyond language, 2026

    Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Y charles, Yiping Bao, Yuantao Fan, Guopeng Li, Haiyang Shen, Xuanzhong Chen, Wendong Xu, Shuzheng Si, Zefan Cai, Wenhao Chai, Ziqi Huang, Fangfu Liu, Tianyu Liu, Baobao Chang, Xiaobo Hu, Kaiyuan Chen, Yixin Ren, Yang Liu, Yuan Gong, and Kuan Li. Ba...

  28. [28]

    Assessing the alignment between infants’ visual and linguistic experience using multimodal language models.arXiv preprint arXiv:2511.18824, 2025

    Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri, Khai Loong Aw, Robert Z Sparks, Zi Yin, Virginia A Marchman, Michael C Frank, and Bria Long. Assessing the alignment between infants’ visual and linguistic experience using multimodal language models.arXiv preprint arXiv:2511.18824, 2025

  29. [29]

    KiV A: Kid-inspired visual analogies for testing large multimodal models, 2025

    Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, and Kate Saenko. KiV A: Kid-inspired visual analogies for testing large multimodal models, 2025. URLhttps://arxiv.org/abs/2407.17773

  30. [30]

    Over- reliance on English hinders cognitive science.Trends in cognitive sciences, 26(12):1153–1170, 2022

    Damián E Blasi, Joseph Henrich, Evangelia Adamou, David Kemmerer, and Asifa Majid. Over- reliance on English hinders cognitive science.Trends in cognitive sciences, 26(12):1153–1170, 2022

  31. [31]

    DevBench: A multimodal developmental benchmark for language learning

    Alvin Wei Ming Tan, Sunny Yu, Bria Long, Wanjing Anya Ma, Tonya Murray, Rebecca D Silverman, Jason D Yeatman, and Michael C Frank. DevBench: A multimodal developmental benchmark for language learning. InAdvances in Neural Information Processing Systems, volume 37, pages 77445–77467, Vancouver, BC, January 2025

  32. [32]

    Fantastic bugs and where to find them in ai benchmarks, 2025

    Sang Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, Ben Domingue, Nick Haber, and Sanmi Koyejo. Fantastic bugs and where to find them in ai benchmarks, 2025. URLhttps://arxiv.org/abs/2511. 16842

  33. [33]

    Michael C Frank, Heidi A Baumgartner, Mika Braginsky, George Kachergis, Amy A Lightbody, Robert Z Sparks, Rebecca Zhu, Stephanie M Carlson, Sandra Graham, Sebastián J Lipina, et al. Learning Variability Network Exchange (LEV ANTE): A global framework for measuring children’s learning variability through collaborative data sharing.Child development, 96(6):...

  34. [34]

    Yutong Xie, Qiaozhu Mei, Walter Yuan, and Matthew O Jackson. Using large language models to categorize strategic situations and decipher motivations behind human behaviors.Proceedings of the National Academy of Sciences, 122(35):e2512075122, 2025

  35. [35]

    Predicting results of social science experiments using large language models.Preprint, 2024

    Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

  36. [36]

    Re-evaluating theory of mind evaluation in large language models.Philosophical Transactions of the Royal Society B: Biological Sciences, 380 (1932), 2025

    Jennifer Hu, Felix Sosa, and Tomer Ullman. Re-evaluating theory of mind evaluation in large language models.Philosophical Transactions of the Royal Society B: Biological Sciences, 380 (1932), 2025

  37. [37]

    AIPsychoBench: Understanding the psychometric differences between llms and humans.Topics in Cognitive Science, 18(2):e70041, 2026

    Wei Xie, Zhenhua Wang, Shuoyoucheng Ma, Xiaobing Sun, Kai Chen, Enze Wang, Wei Liu, and Hanying Tong. AIPsychoBench: Understanding the psychometric differences between llms and humans.Topics in Cognitive Science, 18(2):e70041, 2026

  38. [38]

    arXiv preprint arXiv:2306.09479 , year=

    Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, et al. Inverse scaling: When bigger isn’t better.arXiv preprint arXiv:2306.09479, 2023

  39. [39]

    Development of cognitive intelligence in pre-trained language models

    Raj Sanjay Shah, Khushi Bhardwaj, and Sashank Varma. Development of cognitive intelligence in pre-trained language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9632–9657, 2024

  40. [40]

    Eunice Yiu, Eliza Kosoy, and Alison Gopnik. Transmission versus truth, imitation versus innovation: What children can do that large language and language-and-vision models cannot (yet).Perspectives on Psychological Science, 19(5):874–883, 2024

  41. [41]

    Meta-analysis of theory-of-mind development: The truth about false belief.Child development, 72(3):655–684, 2001

    Henry M Wellman, David Cross, and Julanne Watson. Meta-analysis of theory-of-mind development: The truth about false belief.Child development, 72(3):655–684, 2001

  42. [42]

    Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024

    Michal Kosinski. Evaluating large language models in theory of mind tasks.Proceedings of the National Academy of Sciences, 121(45):e2405460121, 2024. 12

  43. [43]

    Benchmarking progress to infant-level physical reasoning in ai.Transactions on Machine Learning Research, 2022

    Luca Weihs, Amanda Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Benchmarking progress to infant-level physical reasoning in ai.Transactions on Machine Learning Research, 2022

  44. [44]

    ModelVsBaby: A develop- mentally motivated benchmark of out-of-distribution object recognition.Preprint at https://osf

    Saber Sheybani, LB Smith, Z Tiganj, SS Maini, and A Dendukuri. ModelVsBaby: A develop- mentally motivated benchmark of out-of-distribution object recognition.Preprint at https://osf. io/preprints/psyarxiv/83gae_v1, 2024

  45. [45]

    MEWL: Few-shot multimodal word learning with referential uncertainty

    Guangyuan Jiang, Manjie Xu, Shiji Xin, Wei Liang, Yujia Peng, Chi Zhang, and Yixin Zhu. MEWL: Few-shot multimodal word learning with referential uncertainty. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofP...

  46. [46]

    The NIH Infant and Toddler Toolbox: A new standardized tool for assessing neurodevelopment in children ages 1–42 months.Child Development, 95(6):2252–2254, 2024

    Richard Gershon, Miriam A Novack, and Aaron J Kaat. The NIH Infant and Toddler Toolbox: A new standardized tool for assessing neurodevelopment in children ages 1–42 months.Child Development, 95(6):2252–2254, 2024

  47. [47]

    Psychology Press, 2013

    Susan E Embretson and Steven P Reise.Item response theory for psychologists. Psychology Press, 2013

  48. [48]

    A measurement science roadmap: From human assessment to ai evaluation, 2026

    Sang Truong, Noah Goodman, Emma Brunskill, Ben Domingue, Nick Haber, and Sanmi Koyejo. A measurement science roadmap: From human assessment to ai evaluation, 2026

  49. [49]

    Multiple group irt

    R Darrell Bock and Michele F Zimowski. Multiple group irt. InHandbook of modern item response theory, pages 433–448. Springer, 1997

  50. [50]

    Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

    Bojie Li. Incompressible knowledge probes: Estimating black-box LLM parameter counts via factual capacity, 2026. URLhttps://arxiv.org/abs/2604.24827

  51. [51]

    incompressible knowledge probes

    Benjamin Sturgeon and Lawrence Chan. Sanity-checking “incompressible knowledge probes”, Apr 2026. URL https://www.lesswrong.com/posts/veFMEzDDyWaer2Sms/ sanity-checking-incompressible-knowledge-probes

  52. [52]

    Questioning the survey responses of large language models.Advances in Neural Information Processing Systems, 37:45850–45878, 2024

    Ricardo Dominguez-Olmedo, Moritz Hardt, and Celestine Mendler-Dünner. Questioning the survey responses of large language models.Advances in Neural Information Processing Systems, 37:45850–45878, 2024

  53. [53]

    Observational scaling laws and the predictability of language model performance, October 2024

    Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance, October 2024

  54. [54]

    Large vision models can solve mental rotation problems, 2026

    Sebastian Ray Mason, Anders Gjølbye, Phillip Chavarria Højbjerg, Lenka Tˇetková, and Lars Kai Hansen. Large vision models can solve mental rotation problems, 2026. URL https://arxiv. org/abs/2509.15271

  55. [55]

    Lvlm- count: Enhancing the counting ability of large vision-language models, 2026

    Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, and Kimon Fountoulakis. Lvlm- count: Enhancing the counting ability of large vision-language models, 2026. URL https: //arxiv.org/abs/2412.00686

  56. [56]

    Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

    Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

  57. [57]

    Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418, 2024

    Jennifer Hu and Michael C Frank. Auxiliary task demands mask the capabilities of smaller language models.arXiv preprint arXiv:2404.02418, 2024

  58. [58]

    What can language models tell us about human cognition? Current Directions in Psychological Science, 33(3):181–189, 2024

    Louise Connell and Dermot Lynott. What can language models tell us about human cognition? Current Directions in Psychological Science, 33(3):181–189, 2024

  59. [59]

    How can deep neural networks inform theory in psychological science?Current directions in psychological science, 33(5):325–333, 2024

    Sam Whitman McGrath, Jacob Russin, Ellie Pavlick, and Roman Feiman. How can deep neural networks inform theory in psychological science?Current directions in psychological science, 33(5):325–333, 2024. 13

  60. [60]

    You are a visual vocabulary expert

    Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan Wilcox, and Chengxu Zhuang. Call for papers—The BabyLM challenge: Sample-efficient pretraining on a develop- mentally plausible corpus.arXiv preprint arXiv:2301.11796, 2023. 14 A Prompt sensitivity analysis We conducted a systematic exploration of prompt design across all six tasks and mu...

  61. [61]

    Extended reasoning consumed the token budget without producing a parseable answer

    CoT hurts small models.For sub-2B models, chain-of-thought instructions reduced both accuracy and parse rate on vocabulary ( −23.5 pp), sentence understanding ( −10.1 pp), and matrix reasoning (−7.6 pp). Extended reasoning consumed the token budget without producing a parseable answer

  62. [62]

    Similarly, the describe-first strategy that helped Qwen 2B on sentence understanding (+13.1 pp) was not effective for the 0.8B variant

    Prompt gains are model-specific.The expert system prompt that boosted InternVL 8B on matrix reasoning by +12.7 ppdecreasedInternVL 2B accuracy by −20.3 pp on the same task. Similarly, the describe-first strategy that helped Qwen 2B on sentence understanding (+13.1 pp) was not effective for the 0.8B variant

  63. [63]

    The apparent best result (62.7% via self-consistency on Qwen2.5-VL-3B) was not significant under a bias-aware null model (p≈0.183)

    Mental rotation resists prompting.Across Qwen 0.8B, InternVL 2B, InternVL 8B, and three spatial fine-tuned models (SpaceThinker, SpaceOm, SpatialThinker), no prompt strategy reliably exceeded chance after controlling for position bias via answer-permutation debiasing. The apparent best result (62.7% via self-consistency on Qwen2.5-VL-3B) was not significa...

  64. [64]

    Lower values indicate greater model–human alignment

    Spatial fine-tuning does not help.Three models fine-tuned for spatial reasoning (Space- Thinker, SpaceOm, SpatialThinker-Oxford) all scored 59.0% with the baseline elimination 15 Figure 7: DKL between model and human response distributions plotted by log 10 number of parameters for all tasks. Lower values indicate greater model–human alignment. prompt—ide...

  65. [65]

    In sentence understanding (0.8B), combining all phases yielded 35.4% vs

    Stacking phases has diminishing or negative returns.Full-stack combinations often underperformed the best single phase. In sentence understanding (0.8B), combining all phases yielded 35.4% vs. 43.4% for the structural stack alone. In vocabulary, the full stack was the exception that improved over individual phases (+12.4 pp), driven by synergistic interac...