pith. sign in

arxiv: 2509.15001 · v3 · pith:FWTTS7TEnew · submitted 2025-09-18 · 📡 eess.AS · cs.LG· cs.SD

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

classification 📡 eess.AS cs.LGcs.SD
keywords child-centeredrecordingsspeechacrossadultbabyhubertcleanhubert
0
0 comments X
read the original abstract

Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings from 40+ languages. Evaluated on voice type classification, the task of identifying who produces speech and when in child-centered recordings (key child, other children, male, and female adults), BabyHuBERT-VTC achieves F1-scores from 55.0% to 76.1% across six corpora, consistently outperforming W2V2-LL4300 and HuBERT (pretrained on English daylongs and clean adult speech, respectively). Notable gains include 14.0 and 18.3 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and models to support researchers working with child-centered recordings across diverse linguistic contexts.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

    cs.LG 2026-05 unverdicted novelty 6.0

    Current VLMs depend on tightly aligned curated data and cannot exploit the weakly-aligned egocentric video signals that dominate naturalistic infant input.

  2. Context-aware child-directed speech detection from long-form recordings

    eess.AS 2026-05 unverdicted novelty 5.0

    Context from neighboring speech raises average F1 by 13.8 points for child-directed speech classification; in-domain pre-training on child recordings outperforms adult-speech models, and the pipeline still beats a rul...