pith. sign in

arxiv: 2606.29031 · v1 · pith:4RH6VKKNnew · submitted 2026-06-27 · 💻 cs.CL · cs.AI

How to Leverage Synthetic Speech for LLM-Based ASR Systems?

Pith reviewed 2026-06-30 09:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords synthetic speechASR trainingLLM backboneroom impulse responselayer selectiondistributional gapdata efficiencyprivacy constraints
0
0 comments X

The pith

Identifying that real-synthetic speech differences concentrate in early-to-middle layers of an LLM backbone enables a layer-selection module plus RIR augmentation to match full real-data ASR performance with only 25% real speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In privacy-regulated domains such as banking and healthcare, real speech data is expensive to collect and store, so synthetic speech generated by TTS offers a substitute for training ASR systems. The paper probes the layers of an LLM-based SLAM-ASR model to locate where it distinguishes real from synthetic inputs and finds the signal concentrated in early-to-middle layers, where temporal and prosodic perturbations affect it most. Representation separability between the two data types helps training but does not directly forecast final ASR accuracy. Convolving synthetic audio with room impulse responses narrows the gap by adding the acoustic irregularities typical of real recordings. These observations are turned into a training procedure that reaches the accuracy of a full real-speech baseline using just 25% real data and exceeds the baseline at higher real-data fractions.

Core claim

The paper establishes that the discriminative signal between real and synthetic speech in the SLAM-ASR architecture is localized to early-to-middle layers, where temporal and prosodic perturbations disrupt it most. Representation-level separability aids but does not directly predict downstream ASR gains. Convolving synthetic audio with room impulse responses narrows the distributional gap by reproducing the acoustic irregularities of real recordings rather than by increasing naturalness or cleanliness. Adding a layer-selection module combined with RIR augmentation matches a fully real-data baseline using only 25% of the real speech (13.6h) and surpasses it at all higher proportions.

What carries the argument

Layer-wise probing of the LLM backbone to localize the real-synthetic discriminative signal, followed by a layer-selection module combined with RIR augmentation.

If this is right

  • Matches a fully real-data baseline using only 25% of the real speech.
  • Surpasses the real-data baseline at all higher proportions of real speech.
  • Representation separability between real and synthetic speech does not directly predict downstream ASR performance gains.
  • RIR convolution narrows the gap by reproducing acoustic irregularities of real recordings rather than by improving perceived naturalness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same probing approach could identify minimal real-data fractions needed for other LLM-based speech tasks.
  • In regulated industries the method reduces both collection costs and privacy exposure by minimizing stored real recordings.
  • Testing the layer localization on additional ASR architectures would show how far the early-to-middle concentration holds.

Load-bearing premise

The localization of the real-synthetic signal to early-to-middle layers and the gap-narrowing effect of RIR convolution will generalize beyond the specific SLAM-ASR architecture and evaluation conditions tested.

What would settle it

Training the identical SLAM-ASR system without the layer-selection module or without RIR augmentation and measuring whether accuracy still matches or exceeds the full real-speech baseline at 25% and higher real-data proportions.

Figures

Figures reproduced from arXiv: 2606.29031 by Andreas Stolcke, Dairazalia Sanchez-Cortes, Esa\'u Villatoro-Tello, Kadri Hacio\u{g}lu, Manjunath K E, Old\v{r}ich Plchot, Petr Motlicek, Sergio Burdisso, S\'everin Baroudi, Shashi Kumar, Srikanth Madikeri, Yanis Labrak.

Figure 1
Figure 1. Figure 1: Layer-wise Weighted Pooling inside of Llama architecture. All LLM [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Within-corpus speaker diversity (pairwise cosine distance between [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise overlap metrics across all ablation conditions for 28 Llama layers. Lower values indicate greater real/synthetic overlap. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text-to-speech (TTS) is an appealing alternative for training automatic speech recognition (ASR) without exposing sensitive customer recordings. Yet a persistent distributional gap between synthetic and real data limits how far it can replace genuine recordings. Prior work largely treats this gap as a black box to be engineered around, but in our work, we instead examine its origin directly by probing a SLAM-ASR architecture. Then, we localise where its LLM backbone separates real from synthetic speech and find the discriminative signal concentrated in the early-to-middle layers, where temporal and prosodic perturbations disrupt it most. We further show that representation-level separability, help, but does not directly predict downstream ASR gains. On the other hand, convolving synthetic audio with room impulse responses (RIRs) narrows the gap not by making synthetic speech sound cleaner or more natural, but by reproducing the acoustic irregularities of real recordings. Translating these findings into the training procedure, by adding a layer-selection module combined with RIR augmentation matches a fully real-data baseline using only 25% of the real speech (13.6h) and surpasses it at all higher proportions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper examines the use of synthetic speech to train LLM-based ASR systems under privacy constraints. By probing a SLAM-ASR model, it localizes the real-synthetic discriminative signal to early-to-middle layers of the LLM backbone, shows that this signal is disrupted by temporal/prosodic perturbations, demonstrates that RIR convolution narrows the distributional gap by reproducing acoustic irregularities (rather than improving naturalness), and reports that a layer-selection module plus RIR augmentation matches a full real-speech baseline using only 25% of the real data (13.6 h) while surpassing it at higher proportions.

Significance. If the empirical outcomes are robust, the work offers a concrete path to reduce real-speech requirements in regulated domains, with potential cost and privacy benefits. The layer-localization and RIR mechanism findings could inform architecture-aware augmentation strategies. The 25% efficiency result, if reproducible across conditions, would be a notable data-efficiency advance for LLM-based ASR.

major comments (2)
  1. Abstract: the central claim that the layer-selection module + RIR augmentation 'matches a fully real-data baseline using only 25% of the real speech' is presented without accompanying ablation tables, statistical tests, dataset sizes, or variance estimates, leaving the robustness of the 25% figure unverified and the translation from probing results to the training procedure unsupported.
  2. Abstract (probing and translation paragraph): the localization of the real-synthetic signal to early-to-middle layers and the specific RIR mechanism are demonstrated only inside the SLAM-ASR backbone; no cross-architecture experiments are supplied, so the assumption that the same layer range and RIR effect will produce the reported gains in other LLM-based ASR systems remains untested and load-bearing for the efficiency claim.
minor comments (1)
  1. Abstract: the phrasing 'representation-level separability, help, but does not directly predict' appears to contain a grammatical or typographical error and should be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, clarifying the scope of our claims and the evidence provided in the manuscript.

read point-by-point responses
  1. Referee: Abstract: the central claim that the layer-selection module + RIR augmentation 'matches a fully real-data baseline using only 25% of the real speech' is presented without accompanying ablation tables, statistical tests, dataset sizes, or variance estimates, leaving the robustness of the 25% figure unverified and the translation from probing results to the training procedure unsupported.

    Authors: The full manuscript reports the 25% result (13.6 h) with accompanying ablation tables in the experimental section that compare layer-selection + RIR against full real-data and other synthetic baselines across multiple real-speech proportions. Dataset sizes are stated explicitly in the data section and figure captions. Statistical significance is assessed via paired tests on the primary WER metrics, and variance across training seeds is reported for the key configurations. The link from probing to the training procedure is detailed in Sections 3–4, where the layer-localization results directly motivate the selection module. We will add a consolidated variance table in the revision to make these elements more immediately visible from the abstract claim. revision: partial

  2. Referee: Abstract (probing and translation paragraph): the localization of the real-synthetic signal to early-to-middle layers and the specific RIR mechanism are demonstrated only inside the SLAM-ASR backbone; no cross-architecture experiments are supplied, so the assumption that the same layer range and RIR effect will produce the reported gains in other LLM-based ASR systems remains untested and load-bearing for the efficiency claim.

    Authors: All reported results, including layer localization, RIR mechanism analysis, and the 25% efficiency figure, are explicitly tied to the SLAM-ASR backbone, as stated in the abstract and experimental setup. The paper does not assume or claim that the identical layer range or RIR effect will transfer unchanged to other LLM-based ASR architectures; the probing methodology itself is presented as a general tool that practitioners can apply to identify architecture-specific layers. The efficiency claim is therefore scoped to SLAM-ASR. We will add an explicit limitations paragraph reinforcing this scope and noting that cross-architecture validation is left for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from probing and augmentation trials

full rationale

The paper presents an empirical study involving layer probing in SLAM-ASR, localization of real-synthetic separability, and downstream ASR experiments with layer-selection and RIR augmentation. The central performance claim (matching real-data baseline at 25% real speech) is obtained directly from training and evaluation runs on held-out data, not from any equation or parameter fit that reduces by construction to the inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is used to derive the reported gains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; primary unexamined premises are that the probed architecture is representative and that RIR augmentation specifically reproduces the relevant acoustic irregularities without side effects.

axioms (1)
  • domain assumption The SLAM-ASR architecture is representative of LLM-based ASR systems for the purpose of layer-wise separability analysis.
    Probing results on this model are used to design the layer-selection module.

pith-pipeline@v0.9.1-grok · 5820 in / 1218 out tokens · 59986 ms · 2026-06-30T09:32:08.788565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (arti- ficial intelligence act),

    European Parliament and Council of the European Union, “Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (arti- ficial intelligence act),” Official Journal of the European Union, OJ L, 2024/1689, 12.7.2024, 2024, http://data.europa.eu/eli/reg/2024/1689/oj

  2. [2]

    Towards explicit acoustic evidence perception in audio llms for speech deepfake detection,

    X. Guo, Y . Xie, H. Cheng, J. Zhou, J. Liu, H. Huang, L. Ye, and Q. Zhang, “Towards explicit acoustic evidence perception in audio llms for speech deepfake detection,” 2026. [Online]. Available: https://arxiv.org/abs/2601.23066

  3. [3]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk

  4. [4]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07919

  5. [5]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, J. Qin, D. Rozenberg, T. Sainath, J. Schalkwyk, M. Sharifi, M. T. Ramanovich, M. Tagliasacchi, A. Tudor, M. Velimirovi´c, D. Vincent, J. Yu, Y . Wang, V . Zayats, N. Zeghidour, Y . Zhang, ...

  6. [6]

    Proceedings of the AAAI Conference on Artifi- cial Intelligence38(3), 2148–2156 (Mar 2024).https://doi.org/10.1609/aaai

    Z. Ma, G. Yang, Y . Yang, Z. Gao, J. Wang, Z. Du, F. Yu, Q. Chen, S. Zheng, S. Zhang, and X. Chen, “Speech recognition meets large language model: benchmarking, models, and exploration,” inProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifte...

  7. [7]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  8. [8]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

  9. [9]

    How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,

    K.-H. Lu, S.-W. Fu, C.-H. H. Yang, Z. Chen, S.-F. Huang, C.-K. Yang, Y .-C. Lin, C.-Y . Hsiao, W. Ren, E.-P. Hu, Y .-H. Huang, A.-Y . Cheng, C.-H. Chiang, Y . Tsao, Y .-C. F. Wang, and H. yi Lee, “How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.19195

  10. [10]

    Generating synthetic audio data for attention-based speech recognition systems,

    N. Rossenbach, A. Zeyer, R. Schl ¨uter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7069–7073. 1Repository withheld for blind review

  11. [11]

    Speech recognition with augmented synthesized speech,

    A. Rosenberg, Y . Zhang, B. Ramabhadran, Y . Jia, P. Moreno, Y . Wu, and Z. Wu, “Speech recognition with augmented synthesized speech,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 996–1002

  12. [12]

    Large- Scale Self- and Semi-Supervised Learning for Speech Translation,

    C. Wang, A. Wu, J. Pino, A. Baevski, M. Auli, and A. Conneau, “Large- Scale Self- and Semi-Supervised Learning for Speech Translation,” in Interspeech 2021, 2021, pp. 2242–2246

  13. [13]

    Enhancing low-resource asr through versatile tts: Bridging the data gap,

    G. Yang, F. Yu, Z. Ma, Z. Du, Z. Gao, S. Zhang, and X. Chen, “Enhancing low-resource asr through versatile tts: Bridging the data gap,” 2024. [Online]. Available: https://arxiv.org/abs/2410.16726

  14. [14]

    Towards improved speech recognition through optimized synthetic data generation,

    Y . Perrin and G. Boulianne, “Towards improved speech recognition through optimized synthetic data generation,” 2025. [Online]. Available: https://arxiv.org/abs/2508.21631

  15. [15]

    The State Of TTS: A Case Study with Human Fooling Rates,

    P. Srinivasa Varadhan, S. Thomas, S. Teja M S, S. Bhooshan, and M. M. Khapra, “The State Of TTS: A Case Study with Human Fooling Rates,” inInterspeech 2025, 2025, pp. 2285–2289

  16. [16]

    Task arithmetic can mitigate synthetic-to-real gap in automatic speech recognition,

    H. Su, H. Farn, F.-Y . Sun, S.-T. Chen, and H.-y. Lee, “Task arithmetic can mitigate synthetic-to-real gap in automatic speech recognition,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp...

  17. [17]

    A self-refining framework for enhancing asr using tts-synthesized data,

    C.-K. Chou, C.-J. Hsu, H.-L. Chung, L.-H. Tseng, H.-C. Cheng, Y .-K. Fu, K. P. Huang, and H.-Y . Lee, “A self-refining framework for enhancing asr using tts-synthesized data,” 2025. [Online]. Available: https://arxiv.org/abs/2506.11130

  18. [18]

    Naturalspeech: End-to-end text-to-speech synthesis with human-level quality,

    X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y . Liu, X. Wang, Y . Leng, Y . Yi, L. He, S. Zhao, T. Qin, F. Soong, and T.-Y . Liu, “Naturalspeech: End-to-end text-to-speech synthesis with human-level quality,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 6, pp. 4234–4245, 2024

  19. [19]

    Towards explainable spoofed speech attribution and detection:a probabilistic approach for characterizing speech synthesizer components,

    J. Mishra, M. Chhibber, H. jin Shim, and T. H. Kinnunen, “Towards explainable spoofed speech attribution and detection:a probabilistic approach for characterizing speech synthesizer components,” 2025. [Online]. Available: https://arxiv.org/abs/2502.04049

  20. [20]

    Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

    X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.-C. Huang, T. Toda, K....

  21. [21]

    Towards robust speech deepfake detection via human-inspired reasoning,

    A. Dvirniak, E. Kushnir, D. Tarasov, A. Iudin, O. Kiriukhin, M. Pautov, D. Korzh, and O. Y . Rogov, “Towards robust speech deepfake detection via human-inspired reasoning,” 2026. [Online]. Available: https://arxiv.org/abs/2603.10725

  22. [22]

    Specializing Self-Supervised Speech Representations for Speaker Segmentation,

    S. Baroudi, T. Pellegrini, and H. Bredin, “Specializing Self-Supervised Speech Representations for Speaker Segmentation,” inInterspeech 2024, 2024, pp. 3769–3773

  23. [23]

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    S. Zaiem, Y . Kemiche, T. Parcollet, S. Essid, and M. Ravanelli, “Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?” inInterspeech 2023, 2023, pp. 2873–2877

  24. [24]

    On the use of self- supervised representation learning for speaker diarization and separa- tion,

    S. Baroudi, H. Bredin, J. Razik, and R. Marxer, “On the use of self- supervised representation learning for speaker diarization and separa- tion,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025, pp. 1–7

  25. [25]

    Layer-wise analysis of a self-supervised speech representation model,

    A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 914–921

  26. [26]

    SUPERB: Speech Processing Universal PERformance Benchmark,

    S. wen Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y . Y . Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” inInterspeech 2021, 2021, pp. 1194–1198

  27. [27]

    Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms,

    M.-H. Hsu, X. Zhang, X. Tian, J. Zhang, and Z. Wu, “Anatomy of the modality gap: Dissecting the internal states of end-to-end speech llms,”

  28. [28]

    Available: https://arxiv.org/abs/2603.01502

    [Online]. Available: https://arxiv.org/abs/2603.01502

  29. [29]

    The Llama 3 Herd of Models

    A. Grattafioriet al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  30. [30]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

  31. [31]

    Distilling conversations: Abstract compression of conversational audio context for llm-based asr,

    S. Kumar, E. Villatoro-Tello, S. Burdisso, K. Hacioglu, T. Ba ˜neras- Roux, H. Watawana, D. Sanchez-Cortes, S. Madikeri, P. Motlicek, and A. Stolcke, “Distilling conversations: Abstract compression of conversational audio context for llm-based asr,” 2026. [Online]. Available: https://arxiv.org/abs/2603.26246

  32. [32]

    Text-only adaptation in llm-based asr through text denoising,

    A. Carofilis, S. Burdisso, E. Villatoro-Tello, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm-based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900

  33. [33]

    Qwen3-TTS Technical Report

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-tts technical report,” 2026. [Online]. Available: https://arxiv.org/abs/2601.15621

  34. [34]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma, Z. Gao, and Z. Yan, “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05407

  35. [35]

    XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Alja- fari, J. Meyer, R. Morais, S. Olayemi, and J. Weber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

  36. [36]

    Parler-tts,

    Y . Lacombe, V . Srivastav, and S. Gandhi, “Parler-tts,” https://github.com/ huggingface/parler-tts, 2024

  37. [37]

    Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech

    S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration- controlled auto-regressive zero-shot text-to-speech,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21619

  38. [38]

    OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

    H. Zhu, L. Ye, W. Kang, Z. Yao, L. Guo, F. Kuang, Z. Han, W. Zhuang, L. Lin, and D. Povey, “Omnivoice: Towards omnilingual zero-shot text- to-speech with diffusion language models,” 2026. [Online]. Available: https://arxiv.org/abs/2604.00688

  39. [39]

    Chatterbox-TTS,

    Resemble AI, “Chatterbox-TTS,” https://github.com/resemble-ai/ chatterbox, 2025, gitHub repository

  40. [40]

    A study on data augmentation of reverberant speech for robust speech recognition,

    T. Ko, V . Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224

  41. [41]

    Building and evaluation of a real room impulse response dataset,

    I. Sz ¨oke, M. Sk ´acel, L. Mo ˇsner, J. Paliesek, and J. ˇCernock´y, “Building and evaluation of a real room impulse response dataset,”IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019

  42. [42]

    pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,

    H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” inInterspeech 2023, 2023, pp. 1983–1987

  43. [43]

    UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,

    T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for V oiceMOS Chal- lenge 2022,” inInterspeech 2022, 2022, pp. 4521–4525

  44. [44]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300