EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
Pith reviewed 2026-05-20 12:08 UTC · model grok-4.3
The pith
Vision-language models succeed only with tightly aligned curated data and cannot exploit the weakly aligned egocentric videos that let humans learn language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training VLMs on naturalistic egocentric videos with weak visual-linguistic alignment and evaluating them across multimodal grounding plus the Machine-DevBench suite shows that present paradigms require tight semantic matches in training data and fail to use the dominant weakly aligned signals present in the input regime where humans succeed.
What carries the argument
Machine-DevBench, a corpus-grounded test set that automatically generates lexical and grammatical items from the model's own training vocabulary across logarithmic frequency bins.
If this is right
- VLMs will continue to underperform on real-world egocentric grounding tasks unless training methods change to handle weak alignment.
- Evaluation benchmarks for multimodal models must incorporate naturalistic infant-style video streams to track genuine progress.
- The EgoBabyVLM Challenge provides a concrete target for developing models that learn language grounding from sparse, misaligned input.
- Current scaling of curated datasets will not close the gap with human-like robustness on wearable or embodied data streams.
Where Pith is reading between the lines
- Embodied agents and wearable devices may require entirely new training objectives rather than simply more data.
- Insights from this benchmark could inform robotics work that relies on first-person video for language understanding.
- The results raise the possibility that architectural changes, not just data changes, are needed to capture weak cross-modal signals.
Load-bearing premise
Automatically generating Machine-DevBench items from the model's training vocabulary across logarithmic frequency bins fully eliminates train/eval mismatch and supplies a statistically powerful measure of lexical and grammatical competence.
What would settle it
A VLM trained only on naturalistic egocentric video that reaches or exceeds the language-grounding and Machine-DevBench scores of models trained on curated, tightly aligned data.
read the original abstract
Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the EgoBabyVLM benchmark and challenge for evaluating vision-language models on naturalistic egocentric video data from infant and adult head-cams. It trains VLMs on corpora with varying degrees of visual-linguistic semantic alignment (including weakly-aligned egocentric streams) and evaluates them on multimodal grounding tasks plus unimodal vision/language benchmarks. Central to the evaluation is the automatically generated Machine-DevBench, which samples from the model's training vocabulary across logarithmic frequency bins. The main claim is that current VLM paradigms depend on tightly curated aligned data and cannot exploit the sparse, weakly-aligned signals that dominate naturalistic egocentric input.
Significance. If the central empirical comparison holds after controlling for confounds, the work would usefully document a gap between standard VLM pretraining and the data regime in which human infants succeed, while supplying a new corpus-grounded benchmark and open challenge to encourage progress on grounded language learning from wearable video.
major comments (2)
- [§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.
- [§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.
minor comments (2)
- [Abstract / §3] The abstract states results are 'clear' but the methods section omits model architectures, optimizer settings, and statistical test details; adding these would improve reproducibility.
- [Figure 2] Figure captions for the benchmark overview could more explicitly label the frequency-bin sampling procedure and any filtering steps applied to generated items.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive report. We address each major comment below and have revised the manuscript accordingly to strengthen the empirical controls and methodological transparency.
read point-by-point responses
-
Referee: [§4] §4 (Training Regimes and Dataset Construction): the central claim that VLMs 'fail to exploit the weakly-aligned signal' rests on performance differences across alignment levels, yet the manuscript provides no token-matched subsampling or data-volume ablations. Egocentric corpora are typically 10-100× smaller than web-scale sets; without explicit controls, observed drops cannot be attributed to alignment rather than insufficient scale.
Authors: We agree that data scale is a potential confound that must be isolated from alignment strength. The original experiments compared training regimes that naturally differ in both alignment and volume, but did not include explicit token-matched subsampling. In the revised manuscript we have added these controls in §4: we subsample the larger web-scale corpora to match the exact token count of the egocentric sets and re-train the models under otherwise identical conditions. The revised results show that the performance gap attributable to weaker alignment persists even at matched scale, supporting the central claim while acknowledging the original limitation. revision: yes
-
Referee: [§5.1] §5.1 (Machine-DevBench Construction): the claim that automatic generation from the training vocabulary across logarithmic bins 'fully eliminates train/eval mismatch' lacks supporting details on vocabulary extraction, grammatical template validation, or power analysis per frequency bin. If generation quality varies with model scale or domain, the benchmark may not furnish a statistically reliable measure of lexical/grammatical competence.
Authors: We appreciate the request for greater methodological detail. The original §5.1 described the overall procedure but omitted the precise implementation steps. The revised version now includes: (i) vocabulary extraction performed by running the model’s tokenizer over the entire training corpus and retaining the top 50k tokens; (ii) grammatical template validation via manual inspection of 500 generated sentences by two independent annotators (inter-annotator agreement 93 %); and (iii) a per-bin power analysis confirming at least 250 items per logarithmic frequency bin, yielding 95 % confidence intervals narrower than ±4 % accuracy. We also report that generation quality metrics remained stable across the model scales and domains tested. revision: yes
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
The paper describes an empirical comparison of VLM training regimes on datasets with varying semantic alignment, followed by evaluation on a newly constructed benchmark (Machine-DevBench) generated from training vocabulary. No mathematical derivations, predictions, or first-principles results are claimed that reduce to fitted parameters or self-referential definitions. The central claim rests on experimental outcomes rather than any load-bearing self-citation chain or ansatz smuggled via prior work. The study is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Machine-DevBench... automatically generated from the model's training vocabulary across logarithmic frequency bins
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022. URL https://dl.acm.org/doi/10.5555/3600270.3601993
-
[2]
L ong T ail-swap: benchmarking language models' abilities on rare words
Algayres, R., Saint-James, C.- \'E ., Luthra, M., Shen, J., Benchekroun, Y., Lin, D., Moritz, R., Pino, J., and Dupoux, E. L ong T ail-swap: benchmarking language models' abilities on rare words. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 11231--112...
work page 2025
-
[3]
Whisperx: Time-accurate speech transcription of long-form audio
Bain, M., Huh, J., Han, T., and Zisserman, A. Whisperx: Time-accurate speech transcription of long-form audio. In Interspeech 2023, pp.\ 4489--4493, 2023. URL https://www.isca-archive.org/interspeech_2023/bain23_interspeech
work page 2023
-
[4]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs . FLUX.2: Frontier Visual Intelligence . Blog post, 2025. URL https://bfl.ai/blog/flux-2
work page 2025
-
[5]
H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H
Bolya, D., Huang, P.-Y., Sun, P., Cho, J. H., Madotto, A., Wei, C., Ma, T., Zhi, J., Rajasegaran, J., Rasheed, H. A., Wang, J., Monteiro, M., Xu, H., Dong, S., Ravi, N., Li, S.-W., Dollar, P., and Feichtenhofer, C. Perception encoder: The best visual embeddings are not at the output of the network. In The Thirty-ninth Annual Conference on Neural Informati...
work page 2025
-
[6]
Coco-stuff: Thing and stuff classes in context
Caesar, H., Uijlings, J., and Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 1209--1218, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/papers/Caesar_COCO-Stuff_Thing_and_CVPR_2018_paper.pdf
work page 2018
-
[7]
Chang, E., Huang, Z., Liao, Y., Bhavsar, S. R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., Nguyen, G., Iyer, A., hall, D. P., Li, E., SCHEFFER, N., Kirmani, A., Damavandi, B., Wanga, R., Kumar, A., Patel, R., Moon, S., and Dong, X. L. Wear VQA : A visual question answering benchmark for wearables in egocentric authentic real-wor...
work page 2026
-
[8]
Charlot, T., Kunze, T., Poli, M., Cristia, A., Dupoux, E., and Lavechin, M. Babyhubert: Multilingual self-supervised learning for segmenting speakers in child-centered long-form recordings. arXiv preprint, 2025. URL https://arxiv.org/abs/2509.15001
-
[9]
Babyvision: Visual reasoning beyond language
Chen, L., Xie, W., Liang, Y., He, H., Zhao, H., Yang, Z., Huang, Z., Wu, H., Lu, H., Bao, Y., et al. Babyvision: Visual reasoning beyond language. arXiv preprint, 2026 a . URL https://arxiv.org/abs/2601.06521
-
[10]
Egoplan-bench: Benchmarking multimodal large language models for human-level planning
Chen, Y., Ge, Y., Ge, Y., Ding, M., Li, B., Wang, R., Xu, R., Shan, Y., and Liu, X. Egoplan-bench: Benchmarking multimodal large language models for human-level planning. Int. J. Comput. Vision, 134 0 (3), February 2026 b . ISSN 0920-5691. URL https://doi.org/10.1007/s11263-025-02676-0
-
[11]
Egothink: Evaluating first-person perspective thinking capability of vision-language models
Cheng, S., Guo, Z., Wu, J., Fang, K., Li, P., Liu, H., and Liu, Y. Egothink: Evaluating first-person perspective thinking capability of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 14291--14302, June 2024. URL https://openaccess.thecvf.com/content/CVPR2024/papers/Cheng_EgoThink_E...
work page 2024
-
[12]
Cho, J. H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., Martin, M., Wang, H., Rasheed, H. A., Sun, P., Huang, P.-Y., Bolya, D., Ravi, N., Jain, S., Stark, T., Moon, S., Damavandi, B., Lee, V., Westbury, A., Khan, S., Kraehenbuehl, P., Dollar, P., Torresani, L., Grauman, K., and Feichtenhofer, C. Per...
work page 2025
-
[13]
Imagenet: A large-scale hierarchical image database
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. URL https://ieeexplore.ieee.org/document/5206848
-
[14]
BERT : Pre-training of deep bidirectional transformers for language understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short...
work page 2019
-
[15]
Sonar: Sentence-level multimodal and language-agnostic representations
Duquenne, P.-A., Schwenk, H., and Sagot, B. Sonar: Sentence-level multimodal and language-agnostic representations. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.11466
-
[16]
Depth map prediction from a single image using a multi-scale deep network
Eigen, D., Puhrsch, C., and Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14, pp.\ 2366–2374, Cambridge, MA, USA, 2014. MIT Press. URL https://dl.acm.org/doi/10.5555/2969033.2969091
-
[17]
M., Jayaraman, S., and Smith, L
Fausey, C. M., Jayaraman, S., and Smith, L. B. From faces to hands: Changing visual input in the first two years. Cognition, 152: 0 101--107, 2016. URL https://pubmed.ncbi.nlm.nih.gov/27043744/
-
[18]
Frank, M. C. Bridging the data gap between children and large language models. Trends in Cognitive Sciences, 27 0 (11): 0 990--992, 2023. ISSN 1364-6613. URL https://www.sciencedirect.com/science/article/pii/S1364661323002036
work page 2023
-
[19]
Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. Using speakers' referential intentions to model early cross-situational word learning. Psychological science, 20 0 (5): 0 578--585, 2009
work page 2009
-
[20]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team , Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M. S., Love, J., Tafti, P., Hussenot, L., Sessa, P. G., Chowdhery, A., Roberts, A., Barua, A., Botev, A., Castro-Ros, A., Slone, A., Héliou, A., Tacchetti, A., Bulanova, A., Paterson, A., Tsai, B., Shahriari, B., Lan, C. L., Choquette-Choo, C. A.,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Development differentially sculpts receptive fields across early and high-level human visual cortex
Gomez, J., Natu, V., Jeska, B., Barnett, M., and Grill-Spector, K. Development differentially sculpts receptive fields across early and high-level human visual cortex. Nature Communications, 9 0 (1): 0 788, 2018. ISSN 2041-1723. URL https://www.nature.com/articles/s41467-018-03166-3
work page 2018
-
[22]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., et al. The llama 3 herd of models. arXiv preprint, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Temporal alignment networks for long-term video
Han, T., Xie, W., and Zisserman, A. Temporal alignment networks for long-term video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2896--2906, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf
work page 2022
-
[24]
CLIPS core: A reference-free evaluation metric for image captioning
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y. CLIPS core: A reference-free evaluation metric for image captioning. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7514--7528, Online and Punta Cana, Dominican Republic, 2021. Asso...
work page 2021
-
[25]
Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment
Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 24905--24916, 2025. URL https://openaccess.thecvf.com...
work page 2025
-
[26]
Karpathy, A. and Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf
work page 2015
-
[27]
Khorrami, K. and Räsänen, O. A model of early word acquisition based on realistic-scale audiovisual naming events. Speech Communication, 167: 0 103169, 2025. ISSN 0167-6393. URL https://www.sciencedirect.com/science/article/pii/S0167639324001407
work page 2025
-
[28]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), Proceedings of the 3rd International Conference on Learning Representations ( ICLR ) , San Diego, CA, USA, 2015. URL http://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[29]
Lavechin, M., Sy, Y., Titeux, H., Bland \'o n, M. A. C., R \"a s \"a nen, O., Bredin, H., Dupoux, E., and Cristia, A. BabySLM : Language-acquisition-friendly benchmark of self-supervised spoken language models. In Proceedings of Interspeech 2023, pp.\ 4588--4592, 2023. URL https://www.isca-archive.org/interspeech_2023/lavechin23_interspeech.pdf
work page 2023
-
[30]
Gradient-based learning applied to document recognition
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86 0 (11): 0 2278--2324, 1998. URL https://ieeexplore.ieee.org/document/726791
work page 1998
-
[31]
Stacked cross attention for image-text matching
Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/papers/Kuang-Huei_Lee_Stacked_Cross_Attention_ECCV_2018_paper.pdf
work page 2018
-
[32]
Multi-granularity correspondence learning from long-term noisy videos
Lin, Y., Zhang, J., Huang, Z., Liu, J., zujie wen, and Peng, X. Multi-granularity correspondence learning from long-term noisy videos. In The Twelfth International Conference on Learning Representations, 2024 a . URL https://openreview.net/forum?id=9Cu8MRmhq2
work page 2024
-
[33]
Evaluating text-to-visual generation with image-to-text generation
Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., and Ramanan, D. Evaluating text-to-visual generation with image-to-text generation. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part IX, pp.\ 366–384, Berlin, Heidelberg, 2024 b . Springer-Verlag. URL https://doi.org/...
-
[34]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw
work page 2023
-
[35]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., and Lin, D. Mmbench: Is your multi-modal model an all-around player? In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VI, pp.\ 216–233, Berlin, Heidelberg, 2024. Springer-Verlag. I...
-
[36]
Long, B. L., Sparks, R. Z., Xiang, V., Stojanov, S., ZiYin, Keene, G., Tan, A. W. M., Feng, S. Y., Nag, A., Zhuang, C., Marchman, V. A., Yamins, D. L., and Frank, M. The babyview dataset: High-resolution egocentric videos of infants and young children s everyday experiences. In 8th Annual Conference on Cognitive Computational Neuroscience, 2025. URL https...
work page 2025
-
[37]
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations ( ICLR ) , New Orleans, LA, USA, 2019. OpenReview.net. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[38]
Openeqa: Embodied question answering in the era of foundation models
Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Silwal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, A., and Rajeswaran, A. Openeqa: Embodied question answering in the era of foundation mo...
work page 2024
-
[39]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Mangalam, K., Akshulakov, R., and Malik, J. Egoschema: A diagnostic benchmark for very long-form video language understanding. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=JVlWseddak
work page 2023
-
[40]
End-to-end learning of visual representations from uncurated instructional videos
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J., and Zisserman, A. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9876--9886, 2020. URL https://ieeexplore.ieee.org/document/9157128
-
[41]
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint, 2018. URL https://arxiv.org/abs/1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H. V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P. DINO v2: Learning robust visu...
work page 2024
-
[43]
Paiss, R., Ephrat, A., Tov, O., Zada, S., Mosseri, I., Irani, M., and Dekel, T. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 3170--3180, October 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Paiss_Teaching_CLIP_to_Count_to_Ten_ICCV_2023_paper.pdf
work page 2023
-
[44]
fastabx: A library for efficient computation of abx discriminability
Poli, M., Chemla, E., and Dupoux, E. fastabx: A library for efficient computation of abx discriminability. arXiv preprint, 2025. URL https://arxiv.org/abs/2505.02692
-
[45]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of M...
work page 2021
-
[46]
Vision transformers for dense prediction
Ranftl, R., Bochkovskiy, A., and Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 12179--12188, October 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Ranftl_Vision_Transformers_for_Dense_Prediction_ICCV_2021_paper.pdf
work page 2021
-
[47]
Sharma, A., Nash, A. A., and Dorman, M. Cortical development, plasticity and re-organization in children with cochlear implants. Journal of Communication Disorders, 42 0 (4): 0 272--279, 2009. ISSN 0021-9924. URL https://www.sciencedirect.com/science/article/pii/S0021992409000306
work page 2009
-
[48]
Indoor segmentation and support inference from rgbd images
Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. Indoor segmentation and support inference from rgbd images. In Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., and Schmid, C. (eds.), Computer Vision -- ECCV 2012, pp.\ 746--760, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33715-4. URL https://link.springer.com/chapter/10.100...
-
[49]
Smith, L. and Gasser, M. The development of embodied cognition: Six lessons from babies. Artificial life, 11 0 (1-2): 0 13--29, 2005
work page 2005
-
[50]
Sullivan, J., Mei, M., Perfors, A., Wojcik, E., and Frank, M. C. Saycam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open Mind, 5: 0 20--29, 05 2021. ISSN 2470-2986. URL https://doi.org/10.1162/opmi_a_00039
-
[51]
Tan, A. W. M., Yu, C., Long, B. L., Ma, W. A., Murray, T., Silverman, R. D., Yeatman, J. D., and Frank, M. Devbench: A multimodal developmental benchmark for language learning. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=zogaeVpbaE
work page 2024
-
[52]
Tan, A. W. M., Yang, J., Sepuri, T., Aw, K. L., Sparks, R. Z., Yin, Z., Marchman, V. A., Frank, M. C., and Long, B. Assessing the alignment between infants' visual and linguistic experience using multimodal language models. arXiv preprint, 2025. URL https://arxiv.org/abs/2511.18824
-
[53]
D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K
Veerabadran, V., Xiao, F., Kamra, N., Matias, P., Chen, J., Drooff, C., Roads, B. D., Williams, R., Henderson, E., Zhao, X., Carlberg, K., Tighe, J., and Ridgeway, K. Benchmarking egocentric multimodal goal inference for assistive wearable agents. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...
work page 2026
-
[54]
Position: will we run out of data? limits of llm scaling based on human-generated data
Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., and Hobbhahn, M. Position: will we run out of data? limits of llm scaling based on human-generated data. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org, 2024. URL https://dl.acm.org/doi/10.5555/3692070.3694094
- [55]
-
[56]
Vong, W. K., Wang, W., Orhan, A. E., and Lake, B. M. Grounded language acquisition through the eyes and ears of a single child. Science, 383 0 (6682): 0 504--511, 2024. URL https://www.science.org/doi/abs/10.1126/science.adi1374
-
[57]
Babyvlm: Data-efficient pretraining of vlms inspired by infant learning
Wang, S., Chandra, A., Liu, A., Saligrama, V., and Gong, B. Babyvlm: Data-efficient pretraining of vlms inspired by infant learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 1380--1390, October 2025 a . URL https://openaccess.thecvf.com/content/ICCV2025/papers/Wang_BabyVLM_Data-Efficient_Pretraining_of_VLMs_I...
work page 2025
-
[58]
Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models
Wang, S., Wang, W., Wang, Z., Whitton, M., Wakeham, M., Chandra, A., Huang, J., Zhu, P., Chen, H., Li, D., et al. Babyvlm-v2: Toward developmentally grounded pretraining and benchmarking of vision foundation models. arXiv preprint, 2025 b . URL https://arxiv.org/abs/2512.10932
-
[59]
Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.-F., and Bowman, S. R. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. URL https://aclanthology.org/2020.tacl-1.25/
work page 2020
-
[60]
Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., and Cotterell, R. Findings of the B aby LM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., W...
work page 2023
-
[61]
Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating
Wiles, O., Zhang, C., Albuquerque, I., Kajic, I., Wang, S., Bugliarello, E., Onoe, Y., Papalampidi, P., Ktena, I., Knutsen, C., Rashtchian, C., Nawalgaria, A., Pont-Tuset, J., and Nematzadeh, A. Revisiting text-to-image evaluation with gecko: on metrics, prompts, and human rating. In The Thirteenth International Conference on Learning Representations, 202...
work page 2025
-
[62]
V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding
Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. V ideo CLIP : Contrastive pre-training for zero-shot video-text understanding. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6787--6...
work page 2021
-
[63]
Altogether: Image captioning via re-aligning alt-text
Xu, H., Huang, P.-Y., Tan, X., Yeh, C.-F., Kahn, J., Jou, C., Ghosh, G., Levy, O., Zettlemoyer, L., Yih, W.-t., Li, S.-W., Xie, S., and Feichtenhofer, C. Altogether: Image captioning via re-aligning alt-text. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ ...
work page 2024
-
[64]
Xu, H., Xie, S., Tan, X., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. Demystifying CLIP data. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=5BCFlnfE1g
work page 2024
-
[65]
Temp CLR : Temporal alignment representation with contrastive learning
Yang, Y., Ma, J., Huang, S., Chen, L., Lin, X., Han, G., and Chang, S.-F. Temp CLR : Temporal alignment representation with contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CIFOsnhZvON
work page 2023
-
[66]
N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K
Yiu, E., Qraitem, M., Majhi, A. N., Wong, C., Bai, Y., Ginosar, S., Gopnik, A., and Saenko, K. Ki VA : Kid-inspired visual analogies for testing large multimodal models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=vNATZfmY6R
work page 2025
-
[67]
Yu, C. and Ballard, D. H. A unified model of early word learning: Integrating statistical and social cues. Neurocomput., 70 0 (13–15): 0 2149–2165, August 2007. ISSN 0925-2312. URL https://doi.org/10.1016/j.neucom.2006.01.034
-
[68]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 9556--9567, 2024. URL https://openaccess.thecvf.com/content/...
work page 2024
-
[69]
Lit: Zero-shot transfer with locked-image text tuning
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., and Beyer, L. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18123--18133, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/papers/Zhai_LiT_Zero-Shot_Transfer_With_Locked-Imag...
work page 2022
-
[70]
Visual grounding helps learn word meanings in low-data regimes
Zhuang, C., Fedorenko, E., and Andreas, J. Visual grounding helps learn word meanings in low-data regimes. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1311--1329, Mexico City, Mexic...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.