pith. sign in

arxiv: 2606.19381 · v1 · pith:HNSHKTK4new · submitted 2026-06-14 · 💻 cs.SD · cs.AI

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

Pith reviewed 2026-06-27 04:16 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords code-switching ASRsynthetic speechpreference learningcode mixing indextext-to-speech augmentationWhisper fine-tuningMandarin-English corpus
0
0 comments X

The pith

A code-mixing guided preference-learning framework using the Code Mixing Index steers TTS output to produce synthetic speech that improves code-switching ASR fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the scarcity of high-quality code-switching text-speech pairs that limits training of ASR systems. Current TTS methods for augmentation prioritize audio fidelity but leave language boundaries uncontrolled, which reduces their value for ASR. The authors introduce a preference-learning loop driven by the Code Mixing Index to rank and select synthetic utterances that better match desired mixing patterns. When this data is used to fine-tune Whisper Large, mixed error rates fall measurably on held-out portions of the SEAME Mandarin-English corpus. A reader would care because the method supplies a concrete, repeatable way to turn ordinary TTS into more useful training material for conversational multilingual speech.

Core claim

The authors propose a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

What carries the argument

The code-mixing guided preference-learning framework, which applies the Code Mixing Index to rank TTS outputs and retain samples whose language-boundary statistics better match target code-switching patterns.

If this is right

  • Synthetic speech for ASR augmentation can be optimized for language-boundary consistency in addition to acoustic fidelity.
  • Preference learning provides a practical mechanism to enforce code-mixing statistics without retraining the underlying TTS model from scratch.
  • Whisper Large fine-tuned on the resulting data achieves lower mixed error rates on conversational Mandarin-English test sets.
  • The same preference loop can be applied to other code-switching language pairs where real paired data remain scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may lower the cost of building ASR systems for additional low-resource code-switching pairs by substituting guided synthetic data for new recordings.
  • Directly embedding CMI into the TTS training objective rather than using it only for post-generation selection could produce even stronger gains.
  • The approach might transfer to other sequence tasks, such as machine translation of code-switched text, where boundary consistency also matters.

Load-bearing premise

The Code Mixing Index can be used inside a preference-learning loop to produce synthetic speech whose language-boundary properties measurably improve ASR fine-tuning utility.

What would settle it

Fine-tuning Whisper Large on CMI-guided synthetic data and observing no reduction (or an increase) in MER on both DevMAN and DevSGE would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19381 by Eng Siong Chng, Haoyang Li, Hardik B. Sailor, Hexin Liu, Jeremy H. M. Wong, Leibny Paola Garcia-Perera, Shreyas Gopal, Yizhou Peng, Yue Heng Yeo.

Figure 1
Figure 1. Figure 1: Overview of the proposed DPO-based TTS alignment framework. utterances, and prefer candidates with smaller ∆CMI. This method integrates an acoustic, frame-level code-mixing metric into preference learning for code-switched TTS, directly guid￾ing generation toward more realistic language-boundary syn￾thetic speech. 3.2.3. Preference Selection To combine heterogeneous critic signals, all metrics are first co… view at source ↗
read the original abstract

Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that existing CS TTS methods optimize for reconstruction fidelity but not language-boundary consistency, and proposes a code-mixing guided preference-learning framework that uses the Code Mixing Index (CMI) to steer synthetic speech generation. On the SEAME Mandarin-English corpus, fine-tuning Whisper Large with the resulting data is reported to reduce Mixed Error Rate from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets.

Significance. If the reported gains can be causally attributed to the CMI-guided preference loop rather than generic synthetic augmentation, the work would offer a concrete, metric-driven way to improve the utility of TTS data for code-switching ASR, addressing a known data-scarcity bottleneck. The approach combines an existing metric (CMI) with preference learning in a downstream-task-oriented manner.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline MER reductions (12.1%→8.9% on DevMAN, 17.8%→14.2% on DevSGE) are presented without any contrast to a plain TTS baseline or a non-CMI preference-learning baseline. Because the central claim is that CMI guidance specifically improves ASR utility, the absence of these controls makes it impossible to isolate the contribution of the proposed loop from the generic effect of adding more synthetic CS-like audio.
  2. [Method] Method section: the precise formulation of how CMI is turned into a preference reward or loss (e.g., the exact preference model, sampling strategy, or weighting) is not specified, preventing assessment of whether the framework is reproducible or whether the reported gains could be obtained with simpler CMI filtering rather than a full preference-learning loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments identify important gaps in experimental controls and methodological detail. We address each point below and will revise the manuscript to incorporate the requested baselines and clarifications.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline MER reductions (12.1%→8.9% on DevMAN, 17.8%→14.2% on DevSGE) are presented without any contrast to a plain TTS baseline or a non-CMI preference-learning baseline. Because the central claim is that CMI guidance specifically improves ASR utility, the absence of these controls makes it impossible to isolate the contribution of the proposed loop from the generic effect of adding more synthetic CS-like audio.

    Authors: We agree that the current experiments do not isolate the contribution of CMI guidance from generic synthetic data augmentation. In the revised version we will add two controls: (1) a plain TTS baseline that generates code-switched speech without any preference learning, and (2) a non-CMI preference-learning baseline that uses a different reward signal. These additions will allow direct attribution of gains to the CMI-guided loop. revision: yes

  2. Referee: [Method] Method section: the precise formulation of how CMI is turned into a preference reward or loss (e.g., the exact preference model, sampling strategy, or weighting) is not specified, preventing assessment of whether the framework is reproducible or whether the reported gains could be obtained with simpler CMI filtering rather than a full preference-learning loop.

    Authors: We acknowledge that the current method description lacks the precise mathematical formulation and implementation details needed for reproducibility. In the revision we will expand the Method section to specify the preference model (including how CMI is converted to a scalar reward), the sampling strategy for preference pairs, and all weighting hyperparameters. We will also discuss why a full preference-learning loop is used rather than simple CMI filtering. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies external CMI metric to TTS generation then measures downstream ASR utility

full rationale

The paper's core chain is: (1) apply existing Code Mixing Index (CMI) to rank or preference-tune TTS outputs for language-boundary properties, (2) use the resulting synthetic CS speech to fine-tune Whisper, (3) report MER on held-out DevMAN/DevSGE sets. CMI is an independently defined metric from prior literature, not redefined inside the paper; the ASR numbers are external evaluation metrics. No equation reduces a claimed prediction to a fitted parameter by construction, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior work by the same authors. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CMI optimization during synthesis improves ASR utility; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption CMI scores correlate with downstream ASR utility when used to guide preference learning
    This assumption is required for the framework to deliver the claimed benefit

pith-pipeline@v0.9.1-grok · 5715 in / 1093 out tokens · 42714 ms · 2026-06-27T04:16:36.985804+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 7 linked inside Pith

  1. [1]

    Introduction Code-Switching (CS), the alternation of multiple languages within a single utterance, is a phenomenon in multilingual com- munities [1, 2, 3]. Despite substantial advances in Automatic Speech Recognition (ASR), recognising conversational CS speech remains challenging due to language alternation, cross- lingual phonetic interference, and infor...

  2. [2]

    DPO is a methodology derived from preference learning that reformulates alignment as supervised learning over preferred and dis-preferred sample pairs, enabling stable optimisation

    Preference Learning Preference learning has recently emerged as a scalable prefer- ence alignment paradigm for generative models, offering an al- ternative to explicit reward modeling and reinforcement learn- ing [24, 25, 26, 27]. DPO is a methodology derived from preference learning that reformulates alignment as supervised learning over preferred and di...

  3. [3]

    Proposed Methodology 3.1. Acoustic-level CMI Speech In this paper, our proposed framework follows the language alignment strategy proposed in [29], generating pseudo frame- level language labels directly from the decoder cross-attention without requiring forced alignment or manual annotations. Specifically, averaged cross-attention from the last decoder l...

  4. [4]

    Experiment Setup The experiments in this paper are conducted on the SEAME corpus, a benchmark for conversational Mandarin-English code-switching speech recognition [31]. SEAME contains ap- proximately 192 hours of spontaneous conversational speech recorded from over 150 bilingual speakers in Singapore and Malaysia, where Mandarin and English are frequentl...

  5. [5]

    Results Table 1 shows how progressively adding critic signals to DPO better aligns CS TTS with our objective of generating better quality synthetic speech that effectively improves downstream CS-ASR. The table shows the results after DPO fine-tuning CosyV oice TTS on SEAME training set and reproducing Dev- Table 1:Cosyvoice TTS Performance Comparison Acro...

  6. [6]

    Our approach explicitly aligns syn- thetic speech with intelligibility, perceptual quality, and realis- tic language-mixing structure

    Conclusion In this work, we presented a CS metric guided DPO framework for improving CS TTS. Our approach explicitly aligns syn- thetic speech with intelligibility, perceptual quality, and realis- tic language-mixing structure. By integrating∆ CMI for measur- ing synthesised CS complexity, we construct contrastive pref- erence pairs through normalized sco...

  7. [7]

    Generative AI Use Disclosure Generative AI tools were used for limited assistance with manuscript editing and presentation (e.g., grammatical valida- tion, removal of redundant sentences and phrases, preparing La- TeX equations and LaTeX formatting suggestions). The litera- ture review and all scientific contributions, including but not limited to problem...

  8. [8]

    A survey of code-switching: Linguistic and social perspectives for language technologies,

    A. S. Do ˘gru¨oz, S. Sitaram, B. E. Bullock, and A. J. Toribio, “A survey of code-switching: Linguistic and social perspectives for language technologies,” inProc. ACL, 2021, pp. 1654–1666

  9. [9]

    End-to-end language diarization for bilingual code-switching speech,

    H. Liu, L. P. Garcia, X. Zhang, J. Dauwels, A. W. H. Khong, S. Khudanpur, and S. J. Styles, “End-to-end language diarization for bilingual code-switching speech,” inProc. Interspeech, 2021, pp. 1489–1493

  10. [10]

    Reducing language confusion for code-switching speech recognition with token-level language diarization,

    H. Liu, H. Xu, L. P. Garcia, A. W. H. Khong, Y . He, and S. Khu- danpur, “Reducing language confusion for code-switching speech recognition with token-level language diarization,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5

  11. [11]

    Decm: Evaluating bilin- gual asr performance on a code-switched speech dataset,

    E. Y . Ugan, N.-Q. Pham, and A. Waibel, “Decm: Evaluating bilin- gual asr performance on a code-switched speech dataset,” inProc. LREC-COLING, 2024

  12. [12]

    Enhancing code-switching speech recognition with interac- tive language biases,

    H. Liu, L. P. Garcia, X. Zhang, A. W. Khong, and S. Khudan- pur, “Enhancing code-switching speech recognition with interac- tive language biases,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process.IEEE, 2024, pp. 10 886–10 890

  13. [13]

    Seame: a mandarin- english code-switching speech corpus in south-east asia,

    D.-C. Lyu, T.-P. Tan, E. S. Chng, and H. Li, “Seame: a mandarin- english code-switching speech corpus in south-east asia,” inProc. Interspeech, 2010, pp. 1986–1989

  14. [14]

    Conformer: Convolution-augmented transformer for speech recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yuet al., “Conformer: Convolution-augmented transformer for speech recognition,” inProc. Interspeech, 2020, pp. 5036–5040

  15. [15]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023

  16. [16]

    Adapting whisper for parameter-efficient code-switching speech recognition via soft prompt tuning,

    H. Yang, Y . Peng, H. Huang, and S. Li, “Adapting whisper for parameter-efficient code-switching speech recognition via soft prompt tuning,” inProc. Interspeech, 2025

  17. [17]

    Enhancing low-resource asr through versatile tts: Bridging the data gap,

    G. Yang, F. Yu, Z. Ma, Z. Du, Z. Gao, S. Zhang, and X. Chen, “Enhancing low-resource asr through versatile tts: Bridging the data gap,”arXiv preprint arXiv:2410.16726, 2024

  18. [18]

    Text-to-speech data augmentation for low resource speech recognition,

    R. Zevallos, “Text-to-speech data augmentation for low resource speech recognition,”arXiv preprint arXiv:2204.00291, 2022

  19. [19]

    Data augmentation for asr using tts via a discrete representation,

    S. Ueno, K. Kawakami, H. Inaguma, and S. Nakamura, “Data augmentation for asr using tts via a discrete representation,” inProc. IEEE Autom. Speech Recognit. Understand. Workshop, 2021

  20. [20]

    Asr model adaptation for rare words using synthetic data generated by multiple text-to- speech systems,

    K. C. Yuen, L. Haoyang, and C. E. Siong, “Asr model adaptation for rare words using synthetic data generated by multiple text-to- speech systems,” inProc. APSIPA ASC, 2023, pp. 1771–1778

  21. [21]

    Asr data augmentation in low-resource settings using cross-lingual multi-speaker tts and cross-lingual voice conversion,

    E. Casanova, C. Shulby, A. Korolev, A. C. Junior, A. da Silva Soares, S. Alu ´ısio, and M. A. Ponti, “Asr data augmentation in low-resource settings using cross-lingual multi-speaker tts and cross-lingual voice conversion,” inProc. Interspeech, 2023, pp. 1244–1248

  22. [22]

    Neu- ral codec language models are zero-shot text to speech synthesiz- ers,

    C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu,et al., “Neu- ral codec language models are zero-shot text to speech synthesiz- ers,”arXiv preprint arXiv:2301.02111, 2023

  23. [23]

    Stepaudio 2.5 technical report,

    B. Lin, B. Zhao, B. Wu, C. Yan, C. Wu, C. Yi, C. Yao, D. Liu, F. Tian, F. Tianet al., “Stepaudio 2.5 technical report,”arXiv preprint arXiv:2605.23463, 2026

  24. [24]

    Improving code-switching speech recognition with tts data aug- mentation,

    Y . H. Yeo, Y . Hu, S. Gopal, Y . Peng, H. Liu, and E. S. Chng, “Improving code-switching speech recognition with tts data aug- mentation,” inProc. APSIPA ASC, 2025

  25. [25]

    Can we train asr systems on code- switch without real code-switch data? case study for singapore’s languages,

    T. Nguyen and H.-D. Tran, “Can we train asr systems on code- switch without real code-switch data? case study for singapore’s languages,”arXiv preprint arXiv:2506.14177, 2025

  26. [26]

    Cs-fleurs: A massively multilingual and code-switched speech dataset,

    B. Yan, I. Hamed, S. Shimizu, V . Lodagala, W. Chen, O. Iakovenkoet al., “Cs-fleurs: A massively multilingual and code-switched speech dataset,” inProc. Interspeech, 2025, pp. 743–747

  27. [27]

    On measuring the complexity of code- mixing,

    B. Gamb ¨ack and A. Das, “On measuring the complexity of code- mixing,” inProc. ICON, Goa, India, 2014, pp. 1–7

  28. [28]

    Challenges and limitations with the metrics measuring the com- plexity of code-mixed text,

    V . Srivastava, M. Singh, M. Shrivastava, and D. M. Sharma, “Challenges and limitations with the metrics measuring the com- plexity of code-mixed text,” inProc. Workshop Comput. Ap- proaches Linguist. Code-Switching, 2021

  29. [29]

    Automatic detec- tion of code-switching style from acoustics,

    S. K. Rallabandi, S. Sitaram, and A. W. Black, “Automatic detec- tion of code-switching style from acoustics,” inProc. Interspeech, 2018

  30. [30]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Man- ning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”arXiv preprint arXiv:2305.18290, 2023

  31. [31]

    Training language models to follow instruc- tions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin,et al., “Training language models to follow instruc- tions with human feedback,” inProc. NeurIPS, 2022

  32. [32]

    Gentse: Enhancing target speaker extraction via a coarse-to-fine generative language model,

    H. Li, X. Zhuang, A. Adnan, Y . Ni, W. Rao, S. Gopal, and E. S. Chng, “Gentse: Enhancing target speaker extraction via a coarse-to-fine generative language model,”arXiv preprint arXiv:2512.20978, 2025

  33. [33]

    Aligning generative speech enhancement with human preferences via direct preference optimization,

    H. Li, N. Hou, Y . Hu, J. Yao, S. M. Siniscalchi, and E. S. Chng, “Aligning generative speech enhancement with human preferences via direct preference optimization,”arXiv preprint arXiv:2507.09929, 2025

  34. [34]

    Mind-paced speaking: A dual- brain approach to real-time reasoning in spoken language mod- els,

    D. Wu, H. Zhang, J. Chen, H. Liu, E. S. Chng, F. Tian, X. Yang, X. Zhang, D. Jiang, G. Yuet al., “Mind-paced speaking: A dual- brain approach to real-time reasoning in spoken language mod- els,”arXiv preprint arXiv:2510.09592, 2025

  35. [35]

    Preference alignment improves language model-based tts,

    J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Preference alignment improves language model-based tts,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2025, pp. 1–5

  36. [36]

    Aligning speech to languages to enhance code-switching speech recognition,

    H. Liu, X. Zhang, H. Zhang, L. P. Garcia-Perera, A. W. H. Khong, E. S. Chng, and S. Watanabe, “Aligning speech to languages to enhance code-switching speech recognition,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4712–4725, 2025

  37. [37]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” inProc. Interspeech, 2022

  38. [38]

    Code-switching speech recognition under the lens: Model- and data-centric perspectives,

    H. Liu, H. Zhang, Q. Zhang, X. Zhang, D. Shi, E. S. Chng, and H. Li, “Code-switching speech recognition under the lens: Model- and data-centric perspectives,”IEEE Trans. Audio, Speech, Lang. Process., vol. 34, pp. 1853–1865, 2026

  39. [39]

    Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao,et al., “Cosyvoice 2: Scalable streaming speech synthesis with large lan- guage models,”arXiv preprint arXiv:2412.10117, 2024

  40. [40]

    Decoupled weight decay regulariza- tion,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,” inProc. ICLR, 2019

  41. [41]

    Adam: A method for stochastic opti- mization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” inProc. ICLR, 2015

  42. [42]

    ESPnet: End-to-End Speech Processing Toolkit,

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno et al., “ESPnet: End-to-End Speech Processing Toolkit,” inProc. Interspeech, 2018, pp. 2207–2211