EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Alvin W. M. Tan; Angel Villar Corrales; Bal\'azs K\'egl; Charles-\'Eric Saint-James; Dongyan Lin; Emmanuel Dupoux; Jiayi Shen; Juan Pino; Mahi Luthra; Manel Khentout

arxiv: 2605.19130 · v1 · pith:3VQFEKT7new · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin , Phillip Rust , Angel Villar Corrales , Alvin W. M. Tan , Mahi Luthra , Charles-\'Eric Saint-James , Rashel Moritz , Sheila Krogh-Jespersen

show 14 more authors

Vanessa Stark Surya Parimi Jiayi Shen Youssef Benchekroun Yosuke Higuchi Martin Gleize Tom Fizycki Nicolas Hamilakis Manel Khentout Sho Tsuji Bal\'azs K\'egl Juan Pino Michael C. Frank Emmanuel Dupoux

This is my paper

Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords egocentric videovision-language modelslanguage groundinginfant developmentmultimodal learningbenchmarkweak alignmentcross-modal learning

0 comments

The pith

Vision-language models succeed only with tightly aligned curated data and cannot exploit the weakly aligned egocentric videos that let humans learn language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why current vision-language models fall short of children's ability to ground language from limited and messy real-world input. It trains models on egocentric video datasets that vary in how closely visual and linguistic signals match, ranging from curated web collections to naturalistic infant and adult head-cam recordings. A central evaluation tool is Machine-DevBench, which creates test items directly from each model's training vocabulary across frequency bands to measure lexical and grammatical competence without the usual train-test gaps. Results demonstrate that standard VLM training depends on strong semantic alignment and ignores the sparse, misaligned signals that dominate everyday egocentric streams. The work ends by launching the EgoBabyVLM Challenge to encourage models that can learn grounded language the way infants do.

Core claim

Training VLMs on naturalistic egocentric videos with weak visual-linguistic alignment and evaluating them across multimodal grounding plus the Machine-DevBench suite shows that present paradigms require tight semantic matches in training data and fail to use the dominant weakly aligned signals present in the input regime where humans succeed.

What carries the argument

Machine-DevBench, a corpus-grounded test set that automatically generates lexical and grammatical items from the model's own training vocabulary across logarithmic frequency bins.

If this is right

VLMs will continue to underperform on real-world egocentric grounding tasks unless training methods change to handle weak alignment.
Evaluation benchmarks for multimodal models must incorporate naturalistic infant-style video streams to track genuine progress.
The EgoBabyVLM Challenge provides a concrete target for developing models that learn language grounding from sparse, misaligned input.
Current scaling of curated datasets will not close the gap with human-like robustness on wearable or embodied data streams.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embodied agents and wearable devices may require entirely new training objectives rather than simply more data.
Insights from this benchmark could inform robotics work that relies on first-person video for language understanding.
The results raise the possibility that architectural changes, not just data changes, are needed to capture weak cross-modal signals.

Load-bearing premise

Automatically generating Machine-DevBench items from the model's training vocabulary across logarithmic frequency bins fully eliminates train/eval mismatch and supplies a statistically powerful measure of lexical and grammatical competence.

What would settle it

A VLM trained only on naturalistic egocentric video that reaches or exceeds the language-grounding and Machine-DevBench scores of models trained on curated, tightly aligned data.

read the original abstract

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new Machine-DevBench benchmark is the clearest addition here, but the central claim about alignment versus weak signals looks open to a data-scale explanation.

read the letter

The paper's useful move is building Machine-DevBench directly from the model's own training vocabulary, split by frequency, so the test items actually match what the model saw. That avoids the usual problem where developmental benchmarks test words the model never encountered. They pair it with the EgoBabyVLM Challenge and run comparisons across training sets that differ in how tightly vision and language line up, including real infant head-cam footage. That setup directly targets the gap between curated web data and the noisy, sparse streams that embodied systems or children actually get. The results line up with the idea that current VLMs lean hard on strong alignments and do not pick up much from weaker ones. That observation is worth having on record. The main soft spot is scale. Infant and adult egocentric corpora are typically much smaller than the web-scale sets used for standard VLM pretraining. Without explicit token-matched or hour-matched ablations, any performance gap could come from simply having less data rather than from an inability to use weak cross-modal correlations. The abstract does not detail statistical controls or hyperparameter matching, so it is difficult to separate those factors. This work is aimed at people building models for robotics or developmental learning who need evaluation tools that reflect real sensory streams. A reader already thinking about how to move VLMs past curated data would find the benchmark and challenge concrete enough to try. The paper deserves a serious referee because the benchmark is a new, reproducible artifact and the question matters, even though the experiments would benefit from tighter controls on data volume.

Referee Report

2 major / 2 minor

Summary. The paper introduces the EgoBabyVLM benchmark and challenge for evaluating vision-language models on naturalistic egocentric video data from infant and adult head-cams. It trains VLMs on corpora with varying degrees of visual-linguistic semantic alignment (including weakly-aligned egocentric streams) and evaluates them on multimodal grounding tasks plus unimodal vision/language benchmarks. Central to the evaluation is the automatically generated Machine-DevBench, which samples from the model's training vocabulary across logarithmic frequency bins. The main claim is that current VLM paradigms depend on tightly curated aligned data and cannot exploit the sparse, weakly-aligned signals that dominate naturalistic egocentric input.

Significance. If the central empirical comparison holds after controlling for confounds, the work would usefully document a gap between standard VLM pretraining and the data regime in which human infants succeed, while supplying a new corpus-grounded benchmark and open challenge to encourage progress on grounded language learning from wearable video.

major comments (2)

[§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.
[§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.

minor comments (2)

[Abstract / §3] The abstract states results are 'clear' but the methods section omits model architectures, optimizer settings, and statistical test details; adding these would improve reproducibility.
[Figure 2] Figure captions for the benchmark overview could more explicitly label the frequency-bin sampling procedure and any filtering steps applied to generated items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical controls and methodological transparency.

read point-by-point responses

Referee: [§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.

Authors: We agree that data scale is a potential confound that must be isolated from alignment strength. The original experiments compared training regimes that naturally differ in both alignment and volume, but did not include explicit token-matched subsampling. In the revised manuscript we have added these controls in §4: we subsample the larger web-scale corpora to match the exact token count of the egocentric sets and re-train the models under otherwise identical conditions. The revised results show that the performance gap attributable to weaker alignment persists even at matched scale, supporting the central claim while acknowledging the original limitation. revision: yes
Referee: [§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.

Authors: We appreciate the request for greater methodological detail. The original §5.1 described the overall procedure but omitted the precise implementation steps. The revised version now includes: (i) vocabulary extraction performed by running the model’s tokenizer over the entire training corpus and retaining the top 50k tokens; (ii) grammatical template validation via manual inspection of 500 generated sentences by two independent annotators (inter-annotator agreement 93 %); and (iii) a per-bin power analysis confirming at least 250 items per logarithmic frequency bin, yielding 95 % confidence intervals narrower than ±4 % accuracy. We also report that generation quality metrics remained stable across the model scales and domains tested. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper describes an empirical comparison of VLM training regimes on datasets with varying semantic alignment, followed by evaluation on a newly constructed benchmark (Machine-DevBench) generated from training vocabulary. No mathematical derivations, predictions, or first-principles results are claimed that reduce to fitted parameters or self-referential definitions. The central claim rests on experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The study is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is an empirical benchmarking study rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5870 in / 1143 out tokens · 41633 ms · 2026-05-20T12:08:51.301002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Machine-DevBench... automatically generated from the model's training vocabulary across logarithmic frequency bins

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022. URL https://dl.acm.org/doi/10.5555/3600270.3601993

work page doi:10.5555/3600270.3601993 2022
[2]

L ong T ail-swap: benchmarking language models' abilities on rare words

Algayres, R., Saint-James, C.- \'E ., Luthra, M., Shen, J., Benchekroun, Y., Lin, D., Moritz, R., Pino, J., and Dupoux, E. L ong T ail-swap: benchmarking language models' abilities on rare words. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 11231--112...

work page 2025
[3]

Whisperx: Time-accurate speech transcription of long-form audio

Bain, M., Huh, J., Han, T., and Zisserman, A. Whisperx: Time-accurate speech transcription of long-form audio. In Interspeech 2023, pp.\ 4489--4493, 2023. URL https://www.isca-archive.org/interspeech_2023/bain23_interspeech

work page 2023
[4]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs . FLUX.2: Frontier Visual Intelligence . Blog post, 2025. URL https://bfl.ai/blog/flux-2

work page 2025
[5]

H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H

Bolya, D., Huang, P.-Y., Sun, P., Cho, J. H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H. A., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.-W., Dollar, P., and Feichtenhofer, C. Perception encoder: The best visual embeddings are not at the output of the network. In The Thirty-ninth Annual Conference on Neural Informati...

work page 2025
[6]

Coco-stuff: Thing and stuff classes in context

Caesar, H., Uijlings, J., and Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 1209--1218, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Caesar_COCO-Stuff_Thing_and_CVPR_2018_paper.pdf

work page 2018
[7]

R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D

Chang, E., Huang, Z., Liao, Y., Bhavsar, S. R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D. P., Li, E., SCHEFFER, N., Kirmani, A., Damavandi, B., Wanga, R., Kumar, A., Patel, R., Moon, S., and Dong, X. L. Wear VQA : A visual question answering benchmark for wearables in egocentric authentic real-wor...

work page 2026
[8]

Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings

Charlot, T., Kunze, T., Poli, M., Cristia, A., Dupoux, E., and Lavechin, M. Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings. arXiv preprint, 2025. URL https://arxiv.org/abs/2509.15001

work page arXiv 2025
[9]

Babyvision: Visual reasoning beyond language

Chen, L., Xie, W., Liang, Y., He, H., Zhao, H., Yang, Z., Huang, Z., Wu, H., Lu, H., Bao, Y., et al. Babyvision: Visual reasoning beyond language. arXiv preprint, 2026 a . URL https://arxiv.org/abs/2601.06521

work page arXiv 2026
[10]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning

Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. Egoplan-bench: Benchmarking multimodal large language models for human-level planning. Int. J. Comput. Vision, 134 0 (3), February 2026 b . ISSN 0920-5691. URL https://doi.org/10.1007/s11263-025-02676-0

work page doi:10.1007/s11263-025-02676-0 2026
[11]

Egothink: Evaluating first-person perspective thinking capability of vision-language models

Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., and Liu, Y. Egothink: Evaluating first-person perspective thinking capability of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, June 2024. URL https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_E...

work page 2024
[12]

H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H

Cho, J. H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H. A., Sun, P., Huang, P.-Y., Bolya, D., Ravi, N., Jain, S., Stark, T., Moon, S., Damavandi, B., Lee, V., Westbury, A., Khan, S., Kraehenbuehl, P., Dollar, P., Torresani, L., Grauman, K., and Feichtenhofer, C. Per...

work page 2025
[13]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. URL https://ieeexplore.ieee.org/document/5206848

work page arXiv 2009
[14]

BERT : Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

work page 2019
[15]

Sonar: Sentence-level multimodal and language-agnostic representations

Duquenne, P.-A., Schwenk, H., and Sagot, B. Sonar: Sentence-level multimodal and language-agnostic representations. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.11466

work page arXiv 2023
[16]

Depth map prediction from a single image using a multi-scale deep network

Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp.\ 2366–2374, Cambridge, MA, USA, 2014. MIT Press. URL https://dl.acm.org/doi/10.5555/2969033.2969091

work page doi:10.5555/2969033.2969091 2014
[17]

M., Jayaraman, S., and Smith, L

Fausey, C. M., Jayaraman, S., and Smith, L. B. From faces to hands: Changing visual input in the first two years. Cognition, 152: 0 101--107, 2016. URL https://pubmed.ncbi.nlm.nih.gov/27043744/

work page arXiv 2016
[18]

Frank, M. C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27 0 (11): 0 990--992, 2023. ISSN 1364-6613. URL https://www.sciencedirect.com/science/article/pii/S1364661323002036

work page 2023
[19]

C., Goodman, N

Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. Using speakers' referential intentions to model early cross-situational word learning. Psychological science, 20 0 (5): 0 578--585, 2009

work page 2009
[20]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Development differentially sculpts receptive fields across early and high-level human visual cortex

Gomez, J., Natu, V., Jeska, B., Barnett, M., and Grill-Spector, K. Development differentially sculpts receptive fields across early and high-level human visual cortex. Nature Communications, 9 0 (1): 0 788, 2018. ISSN 2041-1723. URL https://www.nature.com/articles/s41467-018-03166-3

work page 2018
[22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., et al. The llama 3 herd of models. arXiv preprint, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Temporal alignment networks for long-term video

Han, T., Xie, W., and Zisserman, A. Temporal alignment networks for long-term video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2896--2906, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf

work page 2022
[24]

CLIPS core: A reference-free evaluation metric for image captioning

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y. CLIPS core: A reference-free evaluation metric for image captioning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7514--7528, Online and Punta Cana, Dominican Republic, 2021. Asso...

work page 2021
[25]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 24905--24916, 2025. URL https://openaccess.thecvf.com...

work page 2025
[26]

and Fei-Fei, L

Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf

work page 2015
[27]

and Räsänen, O

Khorrami, K. and Räsänen, O. A model of early word acquisition based on realistic-scale audiovisual naming events. Speech Communication, 167: 0 103169, 2025. ISSN 0167-6393. URL https://www.sciencedirect.com/science/article/pii/S0167639324001407

work page 2025
[28]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), Proceedings of the 3rd International Conference on Learning Representations ( ICLR ) , San Diego, CA, USA, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Lavechin, M., Sy, Y., Titeux, H., Bland \'o n, M. A. C., R \"a s \"a nen, O., Bredin, H., Dupoux, E., and Cristia, A. BabySLM : Language-acquisition-friendly benchmark of self-supervised spoken language models. In Proceedings of Interspeech 2023, pp.\ 4588--4592, 2023. URL https://www.isca-archive.org/interspeech_2023/lavechin23_interspeech.pdf

work page 2023
[30]

Gradient-based learning applied to document recognition

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998. URL https://ieeexplore.ieee.org/document/726791

work page 1998
[31]

Stacked cross attention for image-text matching

Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf

work page 2018
[32]

Multi-granularity correspondence learning from long-term noisy videos

Lin, Y., Zhang, J., Huang, Z., Liu, J., zujie wen, and Peng, X. Multi-granularity correspondence learning from long-term noisy videos. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=9Cu8MRmhq2

work page 2024
[33]

Evaluating text-to-visual generation with image-to-text generation

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., and Ramanan, D. Evaluating text-to-visual generation with image-to-text generation. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX, pp.\ 366–384, Berlin, Heidelberg, 2024 b . Springer-Verlag. URL https://doi.org/...

work page doi:10.1007/978-3-031-72673-6_20 2024
[34]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw

work page 2023
[35]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., and Lin, D. Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VI, pp.\ 216–233, Berlin, Heidelberg, 2024. Springer-Verlag. I...

work page doi:10.1007/978-3-031-72658-3_13 2024
[36]

L., Sparks, R

Long, B. L., Sparks, R. Z., Xiang, V., Stojanov, S., ZiYin, Keene, G., Tan, A. W. M., Feng, S. Y., Nag, A., Zhuang, C., Marchman, V. A., Yamins, D. L., and Frank, M. The babyview dataset: High-resolution egocentric videos of infants and young children s everyday experiences. In 8th Annual Conference on Cognitive Computational Neuroscience, 2025. URL https...

work page 2025
[37]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations ( ICLR ) , New Orleans, LA, USA, 2019. OpenReview.net. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[38]

Openeqa: Embodied question answering in the era of foundation models

Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, A., and Rajeswaran, A. Openeqa: Embodied question answering in the era of foundation mo...

work page 2024
[39]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Mangalam, K., Akshulakov, R., and Malik, J. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=JVlWseddak

work page 2023
[40]

End-to-end learning of visual representations from uncurated instructional videos

Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9876--9886, 2020. URL https://ieeexplore.ieee.org/document/9157128

work page arXiv 2020
[41]

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. URL https://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. DINO v2: Learning robust visu...

work page 2024
[43]

Teaching clip to count to ten

Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., and Dekel, T. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3170--3180, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Paiss_Teaching_CLIP_to_Count_to_Ten_ICCV_2023_paper.pdf

work page 2023
[44]

fastabx: A library for efficient computation of abx discriminability

Poli, M., Chemla, E., and Dupoux, E. fastabx: A library for efficient computation of abx discriminability. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.02692

work page arXiv 2025
[45]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of M...

work page 2021
[46]

Vision transformers for dense prediction

Ranftl, R., Bochkovskiy, A., and Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 12179--12188, October 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Ranftl_Vision_Transformers_for_Dense_Prediction_ICCV_2021_paper.pdf

work page 2021
[47]

A., and Dorman, M

Sharma, A., Nash, A. A., and Dorman, M. Cortical development, plasticity and re-organization in children with cochlear implants. Journal of Communication Disorders, 42 0 (4): 0 272--279, 2009. ISSN 0021-9924. URL https://www.sciencedirect.com/science/article/pii/S0021992409000306

work page 2009
[48]

Indoor segmentation and support inference from rgbd images

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., and Schmid, C. (eds.), Computer Vision -- ECCV 2012, pp.\ 746--760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33715-4. URL https://link.springer.com/chapter/10.100...

work page doi:10.1007/978-3-642-33715-4_54 2012
[49]

and Gasser, M

Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 11 0 (1-2): 0 13--29, 2005

work page 2005
[50]

Sullivan, J., Mei, M., Perfors, A., Wojcik, E., and Frank, M. C. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5: 0 20--29, 05 2021. ISSN 2470-2986. URL https://doi.org/10.1162/opmi_a_00039

work page doi:10.1162/opmi_a_00039 2021
[51]

Tan, A. W. M., Yu, C., Long, B. L., Ma, W. A., Murray, T., Silverman, R. D., Yeatman, J. D., and Frank, M. Devbench: A multimodal developmental benchmark for language learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=zogaeVpbaE

work page 2024
[52]

Tan, A. W. M., Yang, J., Sepuri, T., Aw, K. L., Sparks, R. Z., Yin, Z., Marchman, V. A., Frank, M. C., and Long, B. Assessing the alignment between infants' visual and linguistic experience using multimodal language models. arXiv preprint, 2025. URL https://arxiv.org/abs/2511.18824

work page arXiv 2025
[53]

D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K

Veerabadran, V., Xiao, F., Kamra, N., Matias, P., Chen, J., Drooff, C., Roads, B. D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K. Benchmarking egocentric multimodal goal inference for assistive wearable agents. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page 2026
[54]

Position: will we run out of data? limits of llm scaling based on human-generated data

Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. Position: will we run out of data? limits of llm scaling based on human-generated data. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024. URL https://dl.acm.org/doi/10.5555/3692070.3694094

work page doi:10.5555/3692070.3694094 2024
[55]

Vong, W. K. and Lake, B. M. On the robustness of modeling grounded word learning through a child's egocentric input. arXiv preprint, 2025. URL https://arxiv.org/abs/2507.14749

work page arXiv 2025
[56]

K., Wang, W., Orhan, A

Vong, W. K., Wang, W., Orhan, A. E., and Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science, 383 0 (6682): 0 504--511, 2024. URL https://www.science.org/doi/abs/10.1126/science.adi1374

work page doi:10.1126/science.adi1374 2024
[57]

Babyvlm: Data-efficient pretraining of vlms inspired by infant learning

Wang, S., Chandra, A., Liu, A., Saligrama, V., and Gong, B. Babyvlm: Data-efficient pretraining of vlms inspired by infant learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1380--1390, October 2025 a . URL https://openaccess.thecvf.com/content/ICCV2025/papers/Wang_BabyVLM_Data-Efficient_Pretraining_of_VLMs_I...

work page 2025
[58]

Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models

Wang, S., Wang, W., Wang, Z., Whitton, M., Wakeham, M., Chandra, A., Huang, J., Zhu, P., Chen, H., Li, D., et al. Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models. arXiv preprint, 2025 b . URL https://arxiv.org/abs/2512.10932

work page arXiv 2025
[59]

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. URL https://aclanthology.org/2020.tacl-1.25/

work page 2020
[60]

Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora

Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., and Cotterell, R. Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., W...

work page 2023
[61]

Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

Wiles, O., Zhang, C., Albuquerque, I., Kajic, I., Wang, S., Bugliarello, E., Onoe, Y., Papalampidi, P., Ktena, I., Knutsen, C., Rashtchian, C., Nawalgaria, A., Pont-Tuset, J., and Nematzadeh, A. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. In The Thirteenth International Conference on Learning Representations, 202...

work page 2025
[62]

V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6787--6...

work page 2021
[63]

Altogether: Image captioning via re-aligning alt-text

Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., and Feichtenhofer, C. Altogether: Image captioning via re-aligning alt-text. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

work page 2024
[64]

Demystifying CLIP data

Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. Demystifying CLIP data. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=5BCFlnfE1g

work page 2024
[65]

Temp CLR : Temporal alignment representation with contrastive learning

Yang, Y., Ma, J., Huang, S., Chen, L., Lin, X., Han, G., and Chang, S.-F. Temp CLR : Temporal alignment representation with contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CIFOsnhZvON

work page 2023
[66]

N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K

Yiu, E., Qraitem, M., Majhi, A. N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K. Ki VA : Kid-inspired visual analogies for testing large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vNATZfmY6R

work page 2025
[67]

and Ballard, D

Yu, C. and Ballard, D. H. A unified model of early word learning: Integrating statistical and social cues. Neurocomput., 70 0 (13–15): 0 2149–2165, August 2007. ISSN 0925-2312. URL https://doi.org/10.1016/j.neucom.2006.01.034

work page doi:10.1016/j.neucom.2006.01.034 2007
[68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9556--9567, 2024. URL https://openaccess.thecvf.com/content/...

work page 2024
[69]

Lit: Zero-shot transfer with locked-image text tuning

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Zhai_LiT_Zero-Shot_Transfer_With_Locked-Imag...

work page 2022
[70]

Visual grounding helps learn word meanings in low-data regimes

Zhuang, C., Fedorenko, E., and Andreas, J. Visual grounding helps learn word meanings in low-data regimes. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1311--1329, Mexico City, Mexic...

work page 2024

[1] [1]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022. URL https://dl.acm.org/doi/10.5555/3600270.3601993

work page doi:10.5555/3600270.3601993 2022

[2] [2]

L ong T ail-swap: benchmarking language models' abilities on rare words

Algayres, R., Saint-James, C.- \'E ., Luthra, M., Shen, J., Benchekroun, Y., Lin, D., Moritz, R., Pino, J., and Dupoux, E. L ong T ail-swap: benchmarking language models' abilities on rare words. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 11231--112...

work page 2025

[3] [3]

Whisperx: Time-accurate speech transcription of long-form audio

Bain, M., Huh, J., Han, T., and Zisserman, A. Whisperx: Time-accurate speech transcription of long-form audio. In Interspeech 2023, pp.\ 4489--4493, 2023. URL https://www.isca-archive.org/interspeech_2023/bain23_interspeech

work page 2023

[4] [4]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs . FLUX.2: Frontier Visual Intelligence . Blog post, 2025. URL https://bfl.ai/blog/flux-2

work page 2025

[5] [5]

H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H

Bolya, D., Huang, P.-Y., Sun, P., Cho, J. H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H. A., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.-W., Dollar, P., and Feichtenhofer, C. Perception encoder: The best visual embeddings are not at the output of the network. In The Thirty-ninth Annual Conference on Neural Informati...

work page 2025

[6] [6]

Coco-stuff: Thing and stuff classes in context

Caesar, H., Uijlings, J., and Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 1209--1218, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Caesar_COCO-Stuff_Thing_and_CVPR_2018_paper.pdf

work page 2018

[7] [7]

R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D

Chang, E., Huang, Z., Liao, Y., Bhavsar, S. R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D. P., Li, E., SCHEFFER, N., Kirmani, A., Damavandi, B., Wanga, R., Kumar, A., Patel, R., Moon, S., and Dong, X. L. Wear VQA : A visual question answering benchmark for wearables in egocentric authentic real-wor...

work page 2026

[8] [8]

Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings

Charlot, T., Kunze, T., Poli, M., Cristia, A., Dupoux, E., and Lavechin, M. Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings. arXiv preprint, 2025. URL https://arxiv.org/abs/2509.15001

work page arXiv 2025

[9] [9]

Babyvision: Visual reasoning beyond language

Chen, L., Xie, W., Liang, Y., He, H., Zhao, H., Yang, Z., Huang, Z., Wu, H., Lu, H., Bao, Y., et al. Babyvision: Visual reasoning beyond language. arXiv preprint, 2026 a . URL https://arxiv.org/abs/2601.06521

work page arXiv 2026

[10] [10]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning

Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. Egoplan-bench: Benchmarking multimodal large language models for human-level planning. Int. J. Comput. Vision, 134 0 (3), February 2026 b . ISSN 0920-5691. URL https://doi.org/10.1007/s11263-025-02676-0

work page doi:10.1007/s11263-025-02676-0 2026

[11] [11]

Egothink: Evaluating first-person perspective thinking capability of vision-language models

Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., and Liu, Y. Egothink: Evaluating first-person perspective thinking capability of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, June 2024. URL https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_E...

work page 2024

[12] [12]

H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H

Cho, J. H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H. A., Sun, P., Huang, P.-Y., Bolya, D., Ravi, N., Jain, S., Stark, T., Moon, S., Damavandi, B., Lee, V., Westbury, A., Khan, S., Kraehenbuehl, P., Dollar, P., Torresani, L., Grauman, K., and Feichtenhofer, C. Per...

work page 2025

[13] [13]

Imagenet: A large-scale hierarchical image database

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. URL https://ieeexplore.ieee.org/document/5206848

work page arXiv 2009

[14] [14]

BERT : Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...

work page 2019

[15] [15]

Sonar: Sentence-level multimodal and language-agnostic representations

Duquenne, P.-A., Schwenk, H., and Sagot, B. Sonar: Sentence-level multimodal and language-agnostic representations. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.11466

work page arXiv 2023

[16] [16]

Depth map prediction from a single image using a multi-scale deep network

Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp.\ 2366–2374, Cambridge, MA, USA, 2014. MIT Press. URL https://dl.acm.org/doi/10.5555/2969033.2969091

work page doi:10.5555/2969033.2969091 2014

[17] [17]

M., Jayaraman, S., and Smith, L

Fausey, C. M., Jayaraman, S., and Smith, L. B. From faces to hands: Changing visual input in the first two years. Cognition, 152: 0 101--107, 2016. URL https://pubmed.ncbi.nlm.nih.gov/27043744/

work page arXiv 2016

[18] [18]

Frank, M. C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27 0 (11): 0 990--992, 2023. ISSN 1364-6613. URL https://www.sciencedirect.com/science/article/pii/S1364661323002036

work page 2023

[19] [19]

C., Goodman, N

Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. Using speakers' referential intentions to model early cross-situational word learning. Psychological science, 20 0 (5): 0 578--585, 2009

work page 2009

[20] [20]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Development differentially sculpts receptive fields across early and high-level human visual cortex

Gomez, J., Natu, V., Jeska, B., Barnett, M., and Grill-Spector, K. Development differentially sculpts receptive fields across early and high-level human visual cortex. Nature Communications, 9 0 (1): 0 788, 2018. ISSN 2041-1723. URL https://www.nature.com/articles/s41467-018-03166-3

work page 2018

[22] [22]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., et al. The llama 3 herd of models. arXiv preprint, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Temporal alignment networks for long-term video

Han, T., Xie, W., and Zisserman, A. Temporal alignment networks for long-term video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2896--2906, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf

work page 2022

[24] [24]

CLIPS core: A reference-free evaluation metric for image captioning

Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y. CLIPS core: A reference-free evaluation metric for image captioning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7514--7528, Online and Punta Cana, Dominican Republic, 2021. Asso...

work page 2021

[25] [25]

Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 24905--24916, 2025. URL https://openaccess.thecvf.com...

work page 2025

[26] [26]

and Fei-Fei, L

Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf

work page 2015

[27] [27]

and Räsänen, O

Khorrami, K. and Räsänen, O. A model of early word acquisition based on realistic-scale audiovisual naming events. Speech Communication, 167: 0 103169, 2025. ISSN 0167-6393. URL https://www.sciencedirect.com/science/article/pii/S0167639324001407

work page 2025

[28] [28]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), Proceedings of the 3rd International Conference on Learning Representations ( ICLR ) , San Diego, CA, USA, 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015

[29] [29]

Lavechin, M., Sy, Y., Titeux, H., Bland \'o n, M. A. C., R \"a s \"a nen, O., Bredin, H., Dupoux, E., and Cristia, A. BabySLM : Language-acquisition-friendly benchmark of self-supervised spoken language models. In Proceedings of Interspeech 2023, pp.\ 4588--4592, 2023. URL https://www.isca-archive.org/interspeech_2023/lavechin23_interspeech.pdf

work page 2023

[30] [30]

Gradient-based learning applied to document recognition

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998. URL https://ieeexplore.ieee.org/document/726791

work page 1998

[31] [31]

Stacked cross attention for image-text matching

Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf

work page 2018

[32] [32]

Multi-granularity correspondence learning from long-term noisy videos

Lin, Y., Zhang, J., Huang, Z., Liu, J., zujie wen, and Peng, X. Multi-granularity correspondence learning from long-term noisy videos. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=9Cu8MRmhq2

work page 2024

[33] [33]

Evaluating text-to-visual generation with image-to-text generation

Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., and Ramanan, D. Evaluating text-to-visual generation with image-to-text generation. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX, pp.\ 366–384, Berlin, Heidelberg, 2024 b . Springer-Verlag. URL https://doi.org/...

work page doi:10.1007/978-3-031-72673-6_20 2024

[34] [34]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw

work page 2023

[35] [35]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., and Lin, D. Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VI, pp.\ 216–233, Berlin, Heidelberg, 2024. Springer-Verlag. I...

work page doi:10.1007/978-3-031-72658-3_13 2024

[36] [36]

L., Sparks, R

Long, B. L., Sparks, R. Z., Xiang, V., Stojanov, S., ZiYin, Keene, G., Tan, A. W. M., Feng, S. Y., Nag, A., Zhuang, C., Marchman, V. A., Yamins, D. L., and Frank, M. The babyview dataset: High-resolution egocentric videos of infants and young children s everyday experiences. In 8th Annual Conference on Cognitive Computational Neuroscience, 2025. URL https...

work page 2025

[37] [37]

and Hutter, F

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations ( ICLR ) , New Orleans, LA, USA, 2019. OpenReview.net. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019

[38] [38]

Openeqa: Embodied question answering in the era of foundation models

Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, A., and Rajeswaran, A. Openeqa: Embodied question answering in the era of foundation mo...

work page 2024

[39] [39]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Mangalam, K., Akshulakov, R., and Malik, J. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=JVlWseddak

work page 2023

[40] [40]

End-to-end learning of visual representations from uncurated instructional videos

Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9876--9886, 2020. URL https://ieeexplore.ieee.org/document/9157128

work page arXiv 2020

[41] [41]

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. URL https://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. DINO v2: Learning robust visu...

work page 2024

[43] [43]

Teaching clip to count to ten

Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., and Dekel, T. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3170--3180, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Paiss_Teaching_CLIP_to_Count_to_Ten_ICCV_2023_paper.pdf

work page 2023

[44] [44]

fastabx: A library for efficient computation of abx discriminability

Poli, M., Chemla, E., and Dupoux, E. fastabx: A library for efficient computation of abx discriminability. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.02692

work page arXiv 2025

[45] [45]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of M...

work page 2021

[46] [46]

Vision transformers for dense prediction

Ranftl, R., Bochkovskiy, A., and Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 12179--12188, October 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Ranftl_Vision_Transformers_for_Dense_Prediction_ICCV_2021_paper.pdf

work page 2021

[47] [47]

A., and Dorman, M

Sharma, A., Nash, A. A., and Dorman, M. Cortical development, plasticity and re-organization in children with cochlear implants. Journal of Communication Disorders, 42 0 (4): 0 272--279, 2009. ISSN 0021-9924. URL https://www.sciencedirect.com/science/article/pii/S0021992409000306

work page 2009

[48] [48]

Indoor segmentation and support inference from rgbd images

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., and Schmid, C. (eds.), Computer Vision -- ECCV 2012, pp.\ 746--760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33715-4. URL https://link.springer.com/chapter/10.100...

work page doi:10.1007/978-3-642-33715-4_54 2012

[49] [49]

and Gasser, M

Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 11 0 (1-2): 0 13--29, 2005

work page 2005

[50] [50]

Sullivan, J., Mei, M., Perfors, A., Wojcik, E., and Frank, M. C. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5: 0 20--29, 05 2021. ISSN 2470-2986. URL https://doi.org/10.1162/opmi_a_00039

work page doi:10.1162/opmi_a_00039 2021

[51] [51]

Tan, A. W. M., Yu, C., Long, B. L., Ma, W. A., Murray, T., Silverman, R. D., Yeatman, J. D., and Frank, M. Devbench: A multimodal developmental benchmark for language learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=zogaeVpbaE

work page 2024

[52] [52]

Tan, A. W. M., Yang, J., Sepuri, T., Aw, K. L., Sparks, R. Z., Yin, Z., Marchman, V. A., Frank, M. C., and Long, B. Assessing the alignment between infants' visual and linguistic experience using multimodal language models. arXiv preprint, 2025. URL https://arxiv.org/abs/2511.18824

work page arXiv 2025

[53] [53]

D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K

Veerabadran, V., Xiao, F., Kamra, N., Matias, P., Chen, J., Drooff, C., Roads, B. D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K. Benchmarking egocentric multimodal goal inference for assistive wearable agents. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page 2026

[54] [54]

Position: will we run out of data? limits of llm scaling based on human-generated data

Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. Position: will we run out of data? limits of llm scaling based on human-generated data. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024. URL https://dl.acm.org/doi/10.5555/3692070.3694094

work page doi:10.5555/3692070.3694094 2024

[55] [55]

Vong, W. K. and Lake, B. M. On the robustness of modeling grounded word learning through a child's egocentric input. arXiv preprint, 2025. URL https://arxiv.org/abs/2507.14749

work page arXiv 2025

[56] [56]

K., Wang, W., Orhan, A

Vong, W. K., Wang, W., Orhan, A. E., and Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science, 383 0 (6682): 0 504--511, 2024. URL https://www.science.org/doi/abs/10.1126/science.adi1374

work page doi:10.1126/science.adi1374 2024

[57] [57]

Babyvlm: Data-efficient pretraining of vlms inspired by infant learning

Wang, S., Chandra, A., Liu, A., Saligrama, V., and Gong, B. Babyvlm: Data-efficient pretraining of vlms inspired by infant learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1380--1390, October 2025 a . URL https://openaccess.thecvf.com/content/ICCV2025/papers/Wang_BabyVLM_Data-Efficient_Pretraining_of_VLMs_I...

work page 2025

[58] [58]

Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models

Wang, S., Wang, W., Wang, Z., Whitton, M., Wakeham, M., Chandra, A., Huang, J., Zhu, P., Chen, H., Li, D., et al. Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models. arXiv preprint, 2025 b . URL https://arxiv.org/abs/2512.10932

work page arXiv 2025

[59] [59]

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. URL https://aclanthology.org/2020.tacl-1.25/

work page 2020

[60] [60]

Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora

Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., and Cotterell, R. Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., W...

work page 2023

[61] [61]

Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating

Wiles, O., Zhang, C., Albuquerque, I., Kajic, I., Wang, S., Bugliarello, E., Onoe, Y., Papalampidi, P., Ktena, I., Knutsen, C., Rashtchian, C., Nawalgaria, A., Pont-Tuset, J., and Nematzadeh, A. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. In The Thirteenth International Conference on Learning Representations, 202...

work page 2025

[62] [62]

V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6787--6...

work page 2021

[63] [63]

Altogether: Image captioning via re-aligning alt-text

Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., and Feichtenhofer, C. Altogether: Image captioning via re-aligning alt-text. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ ...

work page 2024

[64] [64]

Demystifying CLIP data

Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. Demystifying CLIP data. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=5BCFlnfE1g

work page 2024

[65] [65]

Temp CLR : Temporal alignment representation with contrastive learning

Yang, Y., Ma, J., Huang, S., Chen, L., Lin, X., Han, G., and Chang, S.-F. Temp CLR : Temporal alignment representation with contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CIFOsnhZvON

work page 2023

[66] [66]

N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K

Yiu, E., Qraitem, M., Majhi, A. N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K. Ki VA : Kid-inspired visual analogies for testing large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vNATZfmY6R

work page 2025

[67] [67]

and Ballard, D

Yu, C. and Ballard, D. H. A unified model of early word learning: Integrating statistical and social cues. Neurocomput., 70 0 (13–15): 0 2149–2165, August 2007. ISSN 0925-2312. URL https://doi.org/10.1016/j.neucom.2006.01.034

work page doi:10.1016/j.neucom.2006.01.034 2007

[68] [68]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9556--9567, 2024. URL https://openaccess.thecvf.com/content/...

work page 2024

[69] [69]

Lit: Zero-shot transfer with locked-image text tuning

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Zhai_LiT_Zero-Shot_Transfer_With_Locked-Imag...

work page 2022

[70] [70]

Visual grounding helps learn word meanings in low-data regimes

Zhuang, C., Fedorenko, E., and Andreas, J. Visual grounding helps learn word meanings in low-data regimes. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1311--1329, Mexico City, Mexic...

work page 2024