ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

Chien Dinh Huynh; Dung D. Le; Duy Mai Hoang; Giang-Son Nguyen; Inigo Jauregi Unanue; Massimo Piccardi; Nhu Vo; Tung X. Nguyen; Wray Buntine

arxiv: 2602.12911 · v2 · pith:FWVG5GCFnew · submitted 2026-02-13 · 💻 cs.CL

ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

Tung X. Nguyen , Nhu Vo , Giang-Son Nguyen , Duy Mai Hoang , Chien Dinh Huynh , Inigo Jauregi Unanue , Massimo Piccardi , Wray Buntine

show 1 more author

Dung D. Le

This is my paper

classification 💻 cs.CL

keywords medicalvietnamesecode-switchingdatasetenglishspeechbenchmarksystems

0 comments

read the original abstract

Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour Vietnamese Medical Code-Switching Speech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation
cs.CL 2026-06 unverdicted novelty 6.0

PiDA generates phonetically similar corruptions for fine-tuning NMT on FLEURS Vietnamese-English, improving translation of ASR errors by up to +2.04 BLEU while slightly boosting clean performance.