CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

Alexander Polok; Anuj Diwan; Brian Yan; David R. Mortensen; Injy Hamed; Iris Emerman Thomas Hain; Matthew Wiesner; Olga Iakovenko; Peter Viechnicki; Qingzheng Wang

arxiv: 2606.11514 · v1 · pith:4CQAKHTSnew · submitted 2026-06-09 · 💻 cs.SD

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

Brian Yan , Qingzheng Wang , Matthew Wiesner , Anuj Diwan , Olga Iakovenko , Alexander Polok , Injy Hamed , Shuichiro Shimizu

show 4 more authors

Iris Emerman Thomas Hain David R. Mortensen Peter Viechnicki Shinji Watanabe

This is my paper

classification 💻 cs.SD

keywords speechcode-switchedcs-yodasdatasetcode-switchingin-the-wildlanguagesmined

0 comments

read the original abstract

We present CS-YODAS, a Creative Commons-licensed dataset of in-the-wild code-switched speech mined from multilingual YouTube data. Code-switching (CS), or the alternation between languages within an utterance or conversation, is common in multilingual settings but remains underrepresented in existing CS speech resources, which are typically small, domain-specific, or artificially constructed. Building on the YODAS corpus, we develop a scalable, human-in-the-loop pipeline for identifying and validating naturally occurring code-switching. The resulting dataset, which totals 313 hours and spans 7 matrix languages, provides diverse, real-world examples of spontaneous code-switched speech. We further analyze the distribution and characteristics of code-switching in the wild, examining language-pair frequencies and switching patterns, and report baseline results for spoken language identification. We hope that CS-YODAS will encourage broader and more comprehensive research on code-switched speech. Dataset link: https://huggingface.co/datasets/byan/cs-yodas.

This paper has not been read by Pith yet.

CS-YODAS: A Mined Dataset of In-the-Wild Code-Switched Speech

discussion (0)