pith. sign in

arxiv: 2605.19130 · v1 · pith:3VQFEKT7new · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords egocentric videovision-language modelslanguage groundinginfant developmentmultimodal learningbenchmarkweak alignmentcross-modal learning
0
0 comments X

The pith

Vision-language models succeed only with tightly aligned curated data and cannot exploit the weakly aligned egocentric videos that let humans learn language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why current vision-language models fall short of children's ability to ground language from limited and messy real-world input. It trains models on egocentric video datasets that vary in how closely visual and linguistic signals match, ranging from curated web collections to naturalistic infant and adult head-cam recordings. A central evaluation tool is Machine-DevBench, which creates test items directly from each model's training vocabulary across frequency bands to measure lexical and grammatical competence without the usual train-test gaps. Results demonstrate that standard VLM training depends on strong semantic alignment and ignores the sparse, misaligned signals that dominate everyday egocentric streams. The work ends by launching the EgoBabyVLM Challenge to encourage models that can learn grounded language the way infants do.

Core claim

Training VLMs on naturalistic egocentric videos with weak visual-linguistic alignment and evaluating them across multimodal grounding plus the Machine-DevBench suite shows that present paradigms require tight semantic matches in training data and fail to use the dominant weakly aligned signals present in the input regime where humans succeed.

What carries the argument

Machine-DevBench, a corpus-grounded test set that automatically generates lexical and grammatical items from the model's own training vocabulary across logarithmic frequency bins.

If this is right

  • VLMs will continue to underperform on real-world egocentric grounding tasks unless training methods change to handle weak alignment.
  • Evaluation benchmarks for multimodal models must incorporate naturalistic infant-style video streams to track genuine progress.
  • The EgoBabyVLM Challenge provides a concrete target for developing models that learn language grounding from sparse, misaligned input.
  • Current scaling of curated datasets will not close the gap with human-like robustness on wearable or embodied data streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embodied agents and wearable devices may require entirely new training objectives rather than simply more data.
  • Insights from this benchmark could inform robotics work that relies on first-person video for language understanding.
  • The results raise the possibility that architectural changes, not just data changes, are needed to capture weak cross-modal signals.

Load-bearing premise

Automatically generating Machine-DevBench items from the model's training vocabulary across logarithmic frequency bins fully eliminates train/eval mismatch and supplies a statistically powerful measure of lexical and grammatical competence.

What would settle it

A VLM trained only on naturalistic egocentric video that reaches or exceeds the language-grounding and Machine-DevBench scores of models trained on curated, tightly aligned data.

read the original abstract

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the EgoBabyVLM benchmark and challenge for evaluating vision-language models on naturalistic egocentric video data from infant and adult head-cams. It trains VLMs on corpora with varying degrees of visual-linguistic semantic alignment (including weakly-aligned egocentric streams) and evaluates them on multimodal grounding tasks plus unimodal vision/language benchmarks. Central to the evaluation is the automatically generated Machine-DevBench, which samples from the model's training vocabulary across logarithmic frequency bins. The main claim is that current VLM paradigms depend on tightly curated aligned data and cannot exploit the sparse, weakly-aligned signals that dominate naturalistic egocentric input.

Significance. If the central empirical comparison holds after controlling for confounds, the work would usefully document a gap between standard VLM pretraining and the data regime in which human infants succeed, while supplying a new corpus-grounded benchmark and open challenge to encourage progress on grounded language learning from wearable video.

major comments (2)
  1. [§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.
  2. [§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.
minor comments (2)
  1. [Abstract / §3] The abstract states results are 'clear' but the methods section omits model architectures, optimizer settings, and statistical test details; adding these would improve reproducibility.
  2. [Figure 2] Figure captions for the benchmark overview could more explicitly label the frequency-bin sampling procedure and any filtering steps applied to generated items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical controls and methodological transparency.

read point-by-point responses
  1. Referee: [§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.

    Authors: We agree that data scale is a potential confound that must be isolated from alignment strength. The original experiments compared training regimes that naturally differ in both alignment and volume, but did not include explicit token-matched subsampling. In the revised manuscript we have added these controls in §4: we subsample the larger web-scale corpora to match the exact token count of the egocentric sets and re-train the models under otherwise identical conditions. The revised results show that the performance gap attributable to weaker alignment persists even at matched scale, supporting the central claim while acknowledging the original limitation. revision: yes

  2. Referee: [§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.

    Authors: We appreciate the request for greater methodological detail. The original §5.1 described the overall procedure but omitted the precise implementation steps. The revised version now includes: (i) vocabulary extraction performed by running the model’s tokenizer over the entire training corpus and retaining the top 50k tokens; (ii) grammatical template validation via manual inspection of 500 generated sentences by two independent annotators (inter-annotator agreement 93 %); and (iii) a per-bin power analysis confirming at least 250 items per logarithmic frequency bin, yielding 95 % confidence intervals narrower than ±4 % accuracy. We also report that generation quality metrics remained stable across the model scales and domains tested. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper describes an empirical comparison of VLM training regimes on datasets with varying semantic alignment, followed by evaluation on a newly constructed benchmark (Machine-DevBench) generated from training vocabulary. No mathematical derivations, predictions, or first-principles results are claimed that reduce to fitted parameters or self-referential definitions. The central claim rests on experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The study is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is an empirical benchmarking study rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5870 in / 1143 out tokens · 41633 ms · 2026-05-20T12:08:51.301002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022. URL https://dl.acm.org/doi/10.5555/3600270.3601993

  2. [2]

    L ong T ail-swap: benchmarking language models' abilities on rare words

    Algayres, R., Saint-James, C.- \'E ., Luthra, M., Shen, J., Benchekroun, Y., Lin, D., Moritz, R., Pino, J., and Dupoux, E. L ong T ail-swap: benchmarking language models' abilities on rare words. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 11231--112...

  3. [3]

    Whisperx: Time-accurate speech transcription of long-form audio

    Bain, M., Huh, J., Han, T., and Zisserman, A. Whisperx: Time-accurate speech transcription of long-form audio. In Interspeech 2023, pp.\ 4489--4493, 2023. URL https://www.isca-archive.org/interspeech_2023/bain23_interspeech

  4. [4]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs . FLUX.2: Frontier Visual Intelligence . Blog post, 2025. URL https://bfl.ai/blog/flux-2

  5. [5]

    H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H

    Bolya, D., Huang, P.-Y., Sun, P., Cho, J. H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H. A., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.-W., Dollar, P., and Feichtenhofer, C. Perception encoder: The best visual embeddings are not at the output of the network. In The Thirty-ninth Annual Conference on Neural Informati...

  6. [6]

    Coco-stuff: Thing and stuff classes in context

    Caesar, H., Uijlings, J., and Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 1209--1218, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Caesar_COCO-Stuff_Thing_and_CVPR_2018_paper.pdf

  7. [7]

    R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D

    Chang, E., Huang, Z., Liao, Y., Bhavsar, S. R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D. P., Li, E., SCHEFFER, N., Kirmani, A., Damavandi, B., Wanga, R., Kumar, A., Patel, R., Moon, S., and Dong, X. L. Wear VQA : A visual question answering benchmark for wearables in egocentric authentic real-wor...

  8. [8]

    Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings

    Charlot, T., Kunze, T., Poli, M., Cristia, A., Dupoux, E., and Lavechin, M. Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings. arXiv preprint, 2025. URL https://arxiv.org/abs/2509.15001

  9. [9]

    Babyvision: Visual reasoning beyond language

    Chen, L., Xie, W., Liang, Y., He, H., Zhao, H., Yang, Z., Huang, Z., Wu, H., Lu, H., Bao, Y., et al. Babyvision: Visual reasoning beyond language. arXiv preprint, 2026 a . URL https://arxiv.org/abs/2601.06521

  10. [10]

    Egoplan-bench: Benchmarking multimodal large language models for human-level planning

    Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. Egoplan-bench: Benchmarking multimodal large language models for human-level planning. Int. J. Comput. Vision, 134 0 (3), February 2026 b . ISSN 0920-5691. URL https://doi.org/10.1007/s11263-025-02676-0

  11. [11]

    Egothink: Evaluating first-person perspective thinking capability of vision-language models

    Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., and Liu, Y. Egothink: Evaluating first-person perspective thinking capability of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, June 2024. URL https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_E...

  12. [12]

    H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H

    Cho, J. H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H. A., Sun, P., Huang, P.-Y., Bolya, D., Ravi, N., Jain, S., Stark, T., Moon, S., Damavandi, B., Lee, V., Westbury, A., Khan, S., Kraehenbuehl, P., Dollar, P., Torresani, L., Grauman, K., and Feichtenhofer, C. Per...

  13. [13]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. URL https://ieeexplore.ieee.org/document/5206848

  14. [14]

    BERT : Pre-training of deep bidirectional transformers for language understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

  15. [15]

    Sonar: Sentence-level multimodal and language-agnostic representations

    Duquenne, P.-A., Schwenk, H., and Sagot, B. Sonar: Sentence-level multimodal and language-agnostic representations. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.11466

  16. [16]

    Depth map prediction from a single image using a multi-scale deep network

    Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp.\ 2366–2374, Cambridge, MA, USA, 2014. MIT Press. URL https://dl.acm.org/doi/10.5555/2969033.2969091

  17. [17]

    M., Jayaraman, S., and Smith, L

    Fausey, C. M., Jayaraman, S., and Smith, L. B. From faces to hands: Changing visual input in the first two years. Cognition, 152: 0 101--107, 2016. URL https://pubmed.ncbi.nlm.nih.gov/27043744/

  18. [18]

    Frank, M. C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27 0 (11): 0 990--992, 2023. ISSN 1364-6613. URL https://www.sciencedirect.com/science/article/pii/S1364661323002036

  19. [19]

    C., Goodman, N

    Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. Using speakers' referential intentions to model early cross-situational word learning. Psychological science, 20 0 (5): 0 578--585, 2009

  20. [20]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

  21. [21]

    Development differentially sculpts receptive fields across early and high-level human visual cortex

    Gomez, J., Natu, V., Jeska, B., Barnett, M., and Grill-Spector, K. Development differentially sculpts receptive fields across early and high-level human visual cortex. Nature Communications, 9 0 (1): 0 788, 2018. ISSN 2041-1723. URL https://www.nature.com/articles/s41467-018-03166-3

  22. [22]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., et al. The llama 3 herd of models. arXiv preprint, 2024. URL https://arxiv.org/abs/2407.21783

  23. [23]

    Temporal alignment networks for long-term video

    Han, T., Xie, W., and Zisserman, A. Temporal alignment networks for long-term video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2896--2906, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf

  24. [24]

    CLIPS core: A reference-free evaluation metric for image captioning

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y. CLIPS core: A reference-free evaluation metric for image captioning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7514--7528, Online and Punta Cana, Dominican Republic, 2021. Asso...

  25. [25]

    Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

    Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 24905--24916, 2025. URL https://openaccess.thecvf.com...

  26. [26]

    and Fei-Fei, L

    Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf

  27. [27]

    and Räsänen, O

    Khorrami, K. and Räsänen, O. A model of early word acquisition based on realistic-scale audiovisual naming events. Speech Communication, 167: 0 103169, 2025. ISSN 0167-6393. URL https://www.sciencedirect.com/science/article/pii/S0167639324001407

  28. [28]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), Proceedings of the 3rd International Conference on Learning Representations ( ICLR ) , San Diego, CA, USA, 2015. URL http://arxiv.org/abs/1412.6980

  29. [29]

    Lavechin, M., Sy, Y., Titeux, H., Bland \'o n, M. A. C., R \"a s \"a nen, O., Bredin, H., Dupoux, E., and Cristia, A. BabySLM : Language-acquisition-friendly benchmark of self-supervised spoken language models. In Proceedings of Interspeech 2023, pp.\ 4588--4592, 2023. URL https://www.isca-archive.org/interspeech_2023/lavechin23_interspeech.pdf

  30. [30]

    Gradient-based learning applied to document recognition

    Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998. URL https://ieeexplore.ieee.org/document/726791

  31. [31]

    Stacked cross attention for image-text matching

    Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf

  32. [32]

    Multi-granularity correspondence learning from long-term noisy videos

    Lin, Y., Zhang, J., Huang, Z., Liu, J., zujie wen, and Peng, X. Multi-granularity correspondence learning from long-term noisy videos. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=9Cu8MRmhq2

  33. [33]

    Evaluating text-to-visual generation with image-to-text generation

    Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., and Ramanan, D. Evaluating text-to-visual generation with image-to-text generation. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX, pp.\ 366–384, Berlin, Heidelberg, 2024 b . Springer-Verlag. URL https://doi.org/...

  34. [34]

    Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw

  35. [35]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., and Lin, D. Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VI, pp.\ 216–233, Berlin, Heidelberg, 2024. Springer-Verlag. I...

  36. [36]

    L., Sparks, R

    Long, B. L., Sparks, R. Z., Xiang, V., Stojanov, S., ZiYin, Keene, G., Tan, A. W. M., Feng, S. Y., Nag, A., Zhuang, C., Marchman, V. A., Yamins, D. L., and Frank, M. The babyview dataset: High-resolution egocentric videos of infants and young children s everyday experiences. In 8th Annual Conference on Cognitive Computational Neuroscience, 2025. URL https...

  37. [37]

    and Hutter, F

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations ( ICLR ) , New Orleans, LA, USA, 2019. OpenReview.net. URL https://openreview.net/forum?id=Bkg6RiCqY7

  38. [38]

    Openeqa: Embodied question answering in the era of foundation models

    Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, A., and Rajeswaran, A. Openeqa: Embodied question answering in the era of foundation mo...

  39. [39]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Mangalam, K., Akshulakov, R., and Malik, J. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=JVlWseddak

  40. [40]

    End-to-end learning of visual representations from uncurated instructional videos

    Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9876--9886, 2020. URL https://ieeexplore.ieee.org/document/9157128

  41. [41]

    Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. URL https://arxiv.org/abs/1807.03748

  42. [42]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. DINO v2: Learning robust visu...

  43. [43]

    Teaching clip to count to ten

    Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., and Dekel, T. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3170--3180, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Paiss_Teaching_CLIP_to_Count_to_Ten_ICCV_2023_paper.pdf

  44. [44]

    fastabx: A library for efficient computation of abx discriminability

    Poli, M., Chemla, E., and Dupoux, E. fastabx: A library for efficient computation of abx discriminability. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.02692

  45. [45]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of M...

  46. [46]

    Vision transformers for dense prediction

    Ranftl, R., Bochkovskiy, A., and Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 12179--12188, October 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Ranftl_Vision_Transformers_for_Dense_Prediction_ICCV_2021_paper.pdf

  47. [47]

    A., and Dorman, M

    Sharma, A., Nash, A. A., and Dorman, M. Cortical development, plasticity and re-organization in children with cochlear implants. Journal of Communication Disorders, 42 0 (4): 0 272--279, 2009. ISSN 0021-9924. URL https://www.sciencedirect.com/science/article/pii/S0021992409000306

  48. [48]

    Indoor segmentation and support inference from rgbd images

    Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., and Schmid, C. (eds.), Computer Vision -- ECCV 2012, pp.\ 746--760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33715-4. URL https://link.springer.com/chapter/10.100...

  49. [49]

    and Gasser, M

    Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 11 0 (1-2): 0 13--29, 2005

  50. [50]

    Sullivan, J., Mei, M., Perfors, A., Wojcik, E., and Frank, M. C. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5: 0 20--29, 05 2021. ISSN 2470-2986. URL https://doi.org/10.1162/opmi_a_00039

  51. [51]

    Tan, A. W. M., Yu, C., Long, B. L., Ma, W. A., Murray, T., Silverman, R. D., Yeatman, J. D., and Frank, M. Devbench: A multimodal developmental benchmark for language learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=zogaeVpbaE

  52. [52]

    Tan, A. W. M., Yang, J., Sepuri, T., Aw, K. L., Sparks, R. Z., Yin, Z., Marchman, V. A., Frank, M. C., and Long, B. Assessing the alignment between infants' visual and linguistic experience using multimodal language models. arXiv preprint, 2025. URL https://arxiv.org/abs/2511.18824

  53. [53]

    D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K

    Veerabadran, V., Xiao, F., Kamra, N., Matias, P., Chen, J., Drooff, C., Roads, B. D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K. Benchmarking egocentric multimodal goal inference for assistive wearable agents. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

  54. [54]

    Position: will we run out of data? limits of llm scaling based on human-generated data

    Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. Position: will we run out of data? limits of llm scaling based on human-generated data. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024. URL https://dl.acm.org/doi/10.5555/3692070.3694094

  55. [55]

    Vong, W. K. and Lake, B. M. On the robustness of modeling grounded word learning through a child's egocentric input. arXiv preprint, 2025. URL https://arxiv.org/abs/2507.14749

  56. [56]

    K., Wang, W., Orhan, A

    Vong, W. K., Wang, W., Orhan, A. E., and Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science, 383 0 (6682): 0 504--511, 2024. URL https://www.science.org/doi/abs/10.1126/science.adi1374

  57. [57]

    Babyvlm: Data-efficient pretraining of vlms inspired by infant learning

    Wang, S., Chandra, A., Liu, A., Saligrama, V., and Gong, B. Babyvlm: Data-efficient pretraining of vlms inspired by infant learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1380--1390, October 2025 a . URL https://openaccess.thecvf.com/content/ICCV2025/papers/Wang_BabyVLM_Data-Efficient_Pretraining_of_VLMs_I...

  58. [58]

    Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models

    Wang, S., Wang, W., Wang, Z., Whitton, M., Wakeham, M., Chandra, A., Huang, J., Zhu, P., Chen, H., Li, D., et al. Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models. arXiv preprint, 2025 b . URL https://arxiv.org/abs/2512.10932

  59. [59]

    Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. URL https://aclanthology.org/2020.tacl-1.25/

  60. [60]

    Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora

    Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., and Cotterell, R. Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., W...

  61. [61]

    Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

    Wiles, O., Zhang, C., Albuquerque, I., Kajic, I., Wang, S., Bugliarello, E., Onoe, Y., Papalampidi, P., Ktena, I., Knutsen, C., Rashtchian, C., Nawalgaria, A., Pont-Tuset, J., and Nematzadeh, A. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. In The Thirteenth International Conference on Learning Representations, 202...

  62. [62]

    V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding

    Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6787--6...

  63. [63]

    Altogether: Image captioning via re-aligning alt-text

    Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., and Feichtenhofer, C. Altogether: Image captioning via re-aligning alt-text. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

  64. [64]

    Demystifying CLIP data

    Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. Demystifying CLIP data. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=5BCFlnfE1g

  65. [65]

    Temp CLR : Temporal alignment representation with contrastive learning

    Yang, Y., Ma, J., Huang, S., Chen, L., Lin, X., Han, G., and Chang, S.-F. Temp CLR : Temporal alignment representation with contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CIFOsnhZvON

  66. [66]

    N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K

    Yiu, E., Qraitem, M., Majhi, A. N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K. Ki VA : Kid-inspired visual analogies for testing large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vNATZfmY6R

  67. [67]

    and Ballard, D

    Yu, C. and Ballard, D. H. A unified model of early word learning: Integrating statistical and social cues. Neurocomput., 70 0 (13–15): 0 2149–2165, August 2007. ISSN 0925-2312. URL https://doi.org/10.1016/j.neucom.2006.01.034

  68. [68]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9556--9567, 2024. URL https://openaccess.thecvf.com/content/...

  69. [69]

    Lit: Zero-shot transfer with locked-image text tuning

    Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Zhai_LiT_Zero-Shot_Transfer_With_Locked-Imag...

  70. [70]

    Visual grounding helps learn word meanings in low-data regimes

    Zhuang, C., Fedorenko, E., and Andreas, J. Visual grounding helps learn word meanings in low-data regimes. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1311--1329, Mexico City, Mexic...