BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings
read the original abstract
Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings from 40+ languages. Evaluated on voice type classification, the task of identifying who produces speech and when in child-centered recordings (key child, other children, male, and female adults), BabyHuBERT-VTC achieves F1-scores from 55.0% to 76.1% across six corpora, consistently outperforming W2V2-LL4300 and HuBERT (pretrained on English daylongs and clean adult speech, respectively). Notable gains include 14.0 and 18.3 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and models to support researchers working with child-centered recordings across diverse linguistic contexts.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
Current VLMs depend on tightly aligned curated data and cannot exploit the weakly-aligned egocentric video signals that dominate naturalistic infant input.
-
Context-aware child-directed speech detection from long-form recordings
Context from neighboring speech raises average F1 by 13.8 points for child-directed speech classification; in-domain pre-training on child recordings outperforms adult-speech models, and the pipeline still beats a rul...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.