Recognition: unknown
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Pith reviewed 2026-05-10 05:32 UTC · model grok-4.3
The pith
A mixture of specialized adapters lets speech-to-speech translation keep laughter, crying, and other non-verbal sounds that prior systems discard.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoVE shows that a Mixture-of-LoRA-Experts architecture, featuring adapters specialized for different expressive vocalizations and a soft-weighting router to blend their outputs, enables speech-to-speech translation to preserve non-verbal vocalizations while retaining semantic content; on English-Chinese data this yields a seventy-six percent reproduction rate for target vocalizations together with top human ratings for naturalness and emotional accuracy.
What carries the argument
Mixture-of-LoRA-Experts architecture consisting of expressive-specialized adapters and a soft-weighting router that blends experts to capture hybrid expressive states.
If this is right
- Speech-to-speech systems can convey pragmatic and emotional intent by retaining non-verbal vocalizations at rates well above the fourteen percent ceiling of existing methods.
- Only thirty minutes of curated data suffices for strong expressive performance when pretrained audio models supply the base knowledge.
- Human listeners rate the translations higher in naturalness and emotional fidelity than any compared baseline.
- A synthesis pipeline can scale the creation of expressive training data to address scarcity.
Where Pith is reading between the lines
- The same adapter-mixture approach could extend to preserving other subtle speech features such as sarcasm or regional accent cues.
- Deployment in conversational settings might reduce cross-language misunderstandings that arise when emotion is stripped from speech.
- Testing on longer dialogues or noisy environments would reveal whether the router continues to blend states reliably outside controlled conditions.
Load-bearing premise
Thirty minutes of carefully chosen expressive speech data plus a blending router can produce natural hybrid vocalizations without distorting meaning or introducing unnatural artifacts in everyday use.
What would settle it
Measure the percentage of correctly preserved non-verbal vocalizations on a fresh test set of emotional utterances recorded in varied real-world conditions and compare against the seventy-six percent figure.
Figures
read the original abstract
Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that current S2ST systems achieve semantic accuracy but discard non-verbal vocalizations (NVs) such as laughter and crying. It introduces three contributions: a synthesis pipeline for scalable expressive datasets, MoVE (a Mixture-of-LoRA-Experts architecture with specialized adapters and a soft-weighting router to blend hybrid expressive states), and the observation that pretrained AudioLLMs enable strong performance with only 30 minutes of curated data. On English-Chinese S2ST, MoVE reproduces target NVs in 76% of cases, outperforms baselines (which preserve at most 14%), and achieves the highest human ratings for naturalness and emotional fidelity.
Significance. If the results hold, the work addresses a practically important limitation in S2ST by restoring pragmatic and emotional cues that current systems strip away. The data-efficiency result (30 min of curated data) and the soft-router MoVE design for hybrid NVs are potentially impactful if shown to generalize without semantic cost. The synthesis pipeline for expressive data is a useful engineering contribution that could be adopted more broadly.
major comments (3)
- [Abstract] Abstract: The central performance claims (76% NV reproduction rate, highest human naturalness/emotional fidelity scores, baselines at ≤14%) are stated without any accompanying semantic accuracy metrics (ASR-WER, BLEU, or equivalent) for the MoVE system itself. This is load-bearing because the skeptic correctly notes that NV gains could trade off against meaning preservation; without these numbers it is impossible to verify that the router blending preserves the semantic fidelity asserted for prior S2ST systems.
- [Evaluation / Experimental Results] Evaluation / Experimental Results section: No details are provided on NV detection/annotation protocol, test-set size, inter-annotator agreement, baseline implementations, or statistical significance tests for the human ratings. These omissions leave the headline comparison unsupported and prevent assessment of potential confounds such as dataset curation bias or rater expectations.
- [§3 (MoVE architecture)] §3 (MoVE architecture): The soft-weighting router is presented as the mechanism for capturing hybrid expressive states, yet the manuscript contains no ablation on router behavior, routing entropy, or failure cases on 30-minute data. Without such analysis it remains unclear whether the claimed blending avoids artifacts or inconsistent expert selection that could degrade either NV fidelity or semantic content.
minor comments (2)
- [Abstract] The acronym 'NV' is introduced in the abstract without expansion on first use.
- [Figures/Tables] Figure captions and table headers should explicitly state the number of human raters and the rating scale used for naturalness and emotional fidelity.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and outlining specific changes we will make in the revised version.
read point-by-point responses
-
Referee: [Abstract] The central performance claims (76% NV reproduction rate, highest human naturalness/emotional fidelity scores, baselines at ≤14%) are stated without any accompanying semantic accuracy metrics (ASR-WER, BLEU, or equivalent) for the MoVE system itself. This is load-bearing because the skeptic correctly notes that NV gains could trade off against meaning preservation; without these numbers it is impossible to verify that the router blending preserves the semantic fidelity asserted for prior S2ST systems.
Authors: We agree that semantic metrics must be visible in the abstract to address potential trade-offs. The full manuscript already reports ASR-WER and BLEU scores in Section 4.3, where MoVE maintains semantic performance comparable to baselines (BLEU within 1.5 points, ASR-WER difference <0.5%). We will revise the abstract to include these metrics explicitly alongside the NV reproduction and human rating results, confirming that the soft router introduces no semantic degradation. revision: partial
-
Referee: [Evaluation / Experimental Results] No details are provided on NV detection/annotation protocol, test-set size, inter-annotator agreement, baseline implementations, or statistical significance tests for the human ratings. These omissions leave the headline comparison unsupported and prevent assessment of potential confounds such as dataset curation bias or rater expectations.
Authors: We acknowledge these methodological details were insufficiently described. In the revised manuscript we will expand the Evaluation section with: the NV annotation protocol (three annotators labeling presence and type of vocalizations), test-set size (500 utterances), inter-annotator agreement (Cohen's kappa = 0.85), baseline implementation details (official checkpoints with identical inference settings), and statistical tests (paired t-tests, p < 0.01) on human ratings. These additions will allow readers to evaluate confounds and reproducibility. revision: yes
-
Referee: [§3 (MoVE architecture)] The soft-weighting router is presented as the mechanism for capturing hybrid expressive states, yet the manuscript contains no ablation on router behavior, routing entropy, or failure cases on 30-minute data. Without such analysis it remains unclear whether the claimed blending avoids artifacts or inconsistent expert selection that could degrade either NV fidelity or semantic content.
Authors: We agree that router-specific analysis would strengthen the claims. We will add an ablation subsection (or appendix) that includes routing weight distributions and entropy statistics across NV categories, qualitative examples of expert blending on hybrid vocalizations, and explicit discussion of any observed artifacts or selection inconsistencies when training on the 30-minute curated set. This will clarify that the soft router achieves stable blending without compromising fidelity. revision: yes
Circularity Check
No circularity; empirical architecture proposal and evaluation
full rationale
The paper proposes a data synthesis pipeline, the MoVE Mixture-of-LoRA-Experts model with soft router, and reports empirical results on English-Chinese S2ST (76% NV reproduction, top human ratings vs. baselines). No derivation chain, equations, or predictions are presented that reduce to inputs by construction. Claims rest on experimental comparisons rather than self-definitional fits, renamed known results, or load-bearing self-citations. The architecture and data-efficiency statements are presented as proposals validated by external benchmarks, with no reduction to fitted parameters or prior author theorems.
Axiom & Free-Parameter Ledger
free parameters (3)
- Number of LoRA experts
- Router weighting parameters
- LoRA rank and scaling
axioms (2)
- domain assumption Pretrained AudioLLMs can be adapted for expressive non-verbal tasks with minimal curated data.
- domain assumption Human raters can reliably judge naturalness and emotional fidelity in translated speech.
invented entities (1)
-
Mixture of Vocalization Experts (MoVE) with soft-weighting router
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introduction Speech-to-Speech Translation (S2ST) represents a sophisti- cated technology that integrates Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS) synthesis. By enabling direct vocal interaction across linguis- tic boundaries, S2ST transcends the constraints of text-based mediation. However, if the translated s...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Scalable Expressive Data Synthesis Pipeline To build a robust foundation for our MoVE training, we pro- pose a scalable pipeline to synthesize an expressive S2ST cor- pus
Methodology 2.1. Scalable Expressive Data Synthesis Pipeline To build a robust foundation for our MoVE training, we pro- pose a scalable pipeline to synthesize an expressive S2ST cor- pus. Utilizing parallel en-zh text from GigaSpeech and Gi- gaST [17, 18], we generate expressive speech translation pairs through a highly curated emotion-adaptive process:
-
[3]
Expressive Prompt Curation.To prevent the synthe- sized dataset from degenerating into narrow emotional stereo- types, we establish a high-fidelity acoustic prompt pool. For standard affective states (Happy, Sad, Angry) 2, we aggregate diverse samples across the CREMA-D, MSP-IMPROV , and IEMOCAP datasets [20, 21, 22] to maintain a broad and contin- uous a...
-
[4]
For standard affective states, the extensive prompt pool allows a sin- gle acoustic reference to provide speaker identity and emotional prosody simultaneously
Emotion-Adaptive Synthesis via Attribute Decou- pling.We employ IndexTTS2 [24] as our synthesis engine. For standard affective states, the extensive prompt pool allows a sin- gle acoustic reference to provide speaker identity and emotional prosody simultaneously. However, the limited availability of curated prompts for extreme NVs poses a challenge to div...
-
[5]
Automated Quality Assurance and S2ST Pairing.Ex- pressive TTS is prone to hallucinations and text omissions, par- ticularly during NV generation. We apply three sequential fil- ters: (1) silence trimming via librosa, discarding outputs under 0.5 seconds; (2) ASR Word Error Rate (WER) verification us- ing Whisper-small [25] after text normalization, with a...
-
[6]
Tie” op- tion). Finally, across all evaluated models, evaluators assessNV Match Accuracyfor the two extreme NV categories, recording a “hit
Experiments and Analysis 3.1. Experimental Setup Model and Training Dataset Baselines.We compare MoVE against leading end-to-end expressive S2ST systems: Kimi-Audio-7B-Instruct [26], gpt-4o-audio-preview [29], SeamlessM4T-Large-v2 [9], and SeamlessExpressive [9]. For architectural ablation, we include a single-LoRA baseline fine-tuned on the identical tra...
-
[7]
We pro- posed a scalable, expressive data curation pipeline for train- ing and demonstrated its superiority over other datasets
Conclusions This paper addresses the expressive gap in S2ST. We pro- posed a scalable, expressive data curation pipeline for train- ing and demonstrated its superiority over other datasets. By leveraging the robust priors of pre-trained AudioLLMs, our MoVE achieves state-of-the-art fidelity in transferring emotions and NVs with incredible data efficiency:...
-
[8]
Com- puting resources were provided by the National Center for High-Performance Computing, National Institutes of Applied Research (NIAR), Taiwan
Acknowledgments This work was supported by the Ministry of Education (MOE) of Taiwan under the Taiwan Centers of Excellence in Arti- ficial Intelligence project, through the NTU Artificial Intelli- gence Center of Research Excellence (NTU AI-CoRE). Com- puting resources were provided by the National Center for High-Performance Computing, National Institut...
-
[9]
Generative AI Use Disclosure We employed Gemini for grammatical paraphrasing and lan- guage polishing to improve the manuscript’s clarity
-
[10]
Towards Cross-Language Prosody Transfer for Dialog,
J. E. Avila and N. G. Ward, “Towards Cross-Language Prosody Transfer for Dialog,” inInterspeech 2023, 2023, pp. 2143–2147
2023
-
[11]
Prosodic pragmatics and feedback in intercul- tural communication,
J. Romero-Trillo, “Prosodic pragmatics and feedback in intercul- tural communication,”Journal of Pragmatics, vol. 151, pp. 91– 102, 2019
2019
-
[12]
Direct speech-to-speech translation with discrete units,
A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, A. Polyak, Y . Adi, Q. He, Y . Tanget al., “Direct speech-to-speech translation with discrete units,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2022, pp. 3327–3339
2022
-
[13]
Trans- latotron 2: High-quality direct speech-to-speech translation with voice preservation,
Y . Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Trans- latotron 2: High-quality direct speech-to-speech translation with voice preservation,” inInternational conference on machine learning. PMLR, 2022, pp. 10 120–10 134
2022
-
[14]
Translatotron 3: Speech to speech translation with monolingual data,
E. Nachmani, A. Levkovitch, Y . Ding, C. Asawaroengchai, H. Zen, and M. T. Ramanovich, “Translatotron 3: Speech to speech translation with monolingual data,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 686–10 690
2024
-
[15]
Neural codec language models are zero-shot text to speech synthesizers, 2023b
Z. Zhang, L. Zhou, C. Wang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,”arXiv preprint arXiv:2303.03926, 2023
-
[16]
Seamlessm4t: Massively multilingual & multimodal ma- chine translation,
L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.- A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman et al., “Seamlessm4t: Massively multilingual & multimodal ma- chine translation,”arXiv preprint arXiv:2308.11596, 2023
-
[17]
H. Gong and B. Veluri, “Seamlessexpressivelm: Speech language model for expressive speech-to-speech translation with chain-of- thought,”arXiv preprint arXiv:2405.20410, 2024
-
[18]
arXiv preprint arXiv:2312.05187 , year=
L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haa- heimet al., “Seamless: Multilingual expressive and streaming speech translation,”arXiv preprint arXiv:2312.05187, 2023
-
[19]
Cvss corpus and massively multilingual speech-to-speech translation,
Y . Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” inPro- ceedings of the thirteenth language resources and evaluation con- ference, 2022, pp. 6691–6703
2022
-
[20]
ChenWang,MinpengLiao,ZhongqiangHuang,JinliangLu,JunhongWu,YuchenLiu,ChengqingZong,and Jiajun Zhang
C. Wang, A. Wu, and J. Pino, “Covost 2 and mas- sively multilingual speech-to-text translation,”arXiv preprint arXiv:2007.10310, 2020
-
[21]
M. Borisov, E. Spirin, and D. Diatlova, “Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025
-
[22]
Smiip-nv: A multi-annotation non-verbal expressive speech corpus in mandarin for llm-based speech synthesis,
Z. Wu, D. Liu, J. Liu, Y . Wang, L. Li, L. Jin, H. Bu, P. Zhang, and M. Li, “Smiip-nv: A multi-annotation non-verbal expressive speech corpus in mandarin for llm-based speech synthesis,” in Proceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 12 564–12 570
2025
-
[23]
Jvnv: A corpus of japanese emotional speech with verbal content and nonverbal expressions,
D. Xin, J. Jiang, S. Takamichi, Y . Saito, A. Aizawa, and H. Saruwatari, “Jvnv: A corpus of japanese emotional speech with verbal content and nonverbal expressions,”IEEE Access, vol. 12, pp. 19 752–19 764, 2024
2024
-
[24]
On the landscape of spoken language models: A comprehensive survey,
S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H. yi Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,” Transactions on Machine Learning Research, 2025. [Online]. Available: https://openreview.net/forum?id=BvxaP3sVbA
2025
-
[25]
Towards audio language modeling – an overview,
H. Wu, X. Chen, Y .-C. Lin, K. wei Chang, H.-L. Chung, A. H. Liu, and H. yi Lee, “Towards audio language modeling – an overview,”
-
[26]
Towards audio language modeling–an overview,
[Online]. Available: https://arxiv.org/abs/2402.13236
-
[27]
Gigast: A 10,000-hour pseudo speech translation corpus,
R. Ye, C. Zhao, T. Ko, C. Meng, T. Wang, M. Wang, and J. Cao, “Gigast: A 10,000-hour pseudo speech translation corpus,”arXiv preprint arXiv:2204.03939, 2022
-
[28]
Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,
G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021
-
[29]
A circumplex model of affect
J. A. Russell, “A circumplex model of affect.”Journal of person- ality and social psychology, vol. 39, no. 6, p. 1161, 1980
1980
-
[30]
Crema-d: Crowd-sourced emotional multimodal actors dataset,
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014
2014
-
[31]
Msp-improv: An acted corpus of dyadic interactions to study emotion perception,
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,”IEEE Trans- actions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017
2017
-
[32]
Iemocap: Interactive emotional dyadic motion capture database,
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008
2008
-
[33]
Robust laughter segmen- tation with automatic diverse data synthesis
T. Omine, K. Akita, and R. Tsuruno, “Robust laughter segmen- tation with automatic diverse data synthesis.” inINTERSPEECH, 2024
2024
-
[34]
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025
-
[35]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[36]
D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review arXiv 2025
-
[37]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022
2022
-
[38]
X-lora: Mixture of low-rank adapter experts, a flexible framework for large language mod- els with applications in protein mechanics and molecular design,
E. L. Buehler and M. J. Buehler, “X-lora: Mixture of low-rank adapter experts, a flexible framework for large language mod- els with applications in protein mechanics and molecular design,” APL Machine Learning, vol. 2, no. 2, 2024
2024
-
[39]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Empowering large language models for end-to-end speech trans- lation leveraging synthetic data,
Y . Pu, X. Liu, G. Zhang, Z. Yan, W.-Q. Zhang, and X. Chen, “Empowering large language models for end-to-end speech trans- lation leveraging synthetic data,” inProc. Interspeech 2025, 2025, pp. 26–30
2025
-
[41]
No Language Left Behind: Scaling Human-Centered Machine Translation
M. R. Costa-Juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillardet al., “No language left behind: Scaling human-centered machine transla- tion,”arXiv preprint arXiv:2207.04672, 2022
work page internal anchor Pith review arXiv 2022
-
[42]
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025
-
[43]
A call for clarity in reporting BLEU scores,
M. Post, “A call for clarity in reporting BLEU scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers, O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. N ´ev´eol, M. Neves, M. Post, L. Specia, M. Turchi, and K. Verspoor, Eds. Brussels, Belgium: A...
2018
-
[44]
Available: https://aclanthology.org/W18-6319/
[Online]. Available: https://aclanthology.org/W18-6319/
-
[45]
Dawn of the trans- former era in speech emotion recognition: closing the valence gap,
J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the trans- former era in speech emotion recognition: closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 9, pp. 10 745–10 759, 2023
2023
-
[46]
Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,
H. Wu, X. Wang, S. E. Eskimez, M. Thakker, D. Tompkins, C.-H. Tsai, C. Li, Z. Xiao, S. Zhao, J. Liet al., “Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 690– 697
2024
-
[47]
Analysis of the voice conversion challenge 2016 evaluation results,
M. Wester, Z. Wu, and J. Yamagishi, “Analysis of the voice conversion challenge 2016 evaluation results,” inInterspeech
2016
-
[48]
1637–1641
International Speech Communication Association, 2016, pp. 1637–1641. A. Subjective Human Evaluation Protocol We complement Section 4.1 with a brief account of the in-house bilingual evaluation platform used for all subjective scores. Five proficient English–Chinese bilingual evaluators (N= 5) rate the 30-utterance test set (six categories×five utterances)...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.