BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Alejandrina Cristia; Emmanuel Dupoux; Marvin Lavechin; Maxime Poli; Tarek Kunze; Th\'eo Charlot

arxiv: 2509.15001 · v3 · pith:FWTTS7TEnew · submitted 2025-09-18 · 📡 eess.AS · cs.LG· cs.SD

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

Th\'eo Charlot , Tarek Kunze , Maxime Poli , Alejandrina Cristia , Emmanuel Dupoux , Marvin Lavechin This is my paper

classification 📡 eess.AS cs.LGcs.SD

keywords child-centeredrecordingsspeechacrossadultbabyhubertcleanhubert

0 comments

read the original abstract

Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings from 40+ languages. Evaluated on voice type classification, the task of identifying who produces speech and when in child-centered recordings (key child, other children, male, and female adults), BabyHuBERT-VTC achieves F1-scores from 55.0% to 76.1% across six corpora, consistently outperforming W2V2-LL4300 and HuBERT (pretrained on English daylongs and clean adult speech, respectively). Notable gains include 14.0 and 18.3 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and models to support researchers working with child-centered recordings across diverse linguistic contexts.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
cs.LG 2026-05 unverdicted novelty 6.0

Current VLMs depend on tightly aligned curated data and cannot exploit the weakly-aligned egocentric video signals that dominate naturalistic infant input.
Context-aware child-directed speech detection from long-form recordings
eess.AS 2026-05 unverdicted novelty 5.0

Context from neighboring speech raises average F1 by 13.8 points for child-directed speech classification; in-domain pre-training on child recordings outperforms adult-speech models, and the pipeline still beats a rul...