pith. machine review for the scientific record. sign in

arxiv: 2604.16211 · v2 · submitted 2026-04-17 · 💻 cs.SD

Recognition: unknown

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:33 UTC · model grok-4.3

classification 💻 cs.SD
keywords non-verbal vocalizationsspeech synthesisTTS benchmarkcontrollability evaluationplacement and saliencebilingual datasetmulti-axis protocolaudio naturalness
0
0 comments X

The pith

NVBench introduces a unified benchmark with a 45-type taxonomy to evaluate non-verbal vocalizations like laughs and sighs in speech synthesis separately from overall quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NVBench as a bilingual benchmark that pairs a standardized 45-type taxonomy of non-verbal vocalizations with a curated dataset and a multi-axis evaluation protocol. This setup separates general speech naturalness from specific measures of NVV controllability, correct placement, and salience. By applying the benchmark to 15 TTS systems through objective metrics, listening tests, and LLM-based ratings, the work shows that NVV control often operates independently of speech quality while highlighting persistent difficulties with low-SNR cues and extended affective sounds. A reader would care because without such a shared framework, progress on expressive, human-like synthetic speech remains hard to measure or compare across different system designs.

Core claim

NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework by combining a 45-type taxonomy of non-verbal vocalizations, a bilingual English/Chinese dataset, and a multi-axis protocol that isolates NVV controllability, placement, and salience from general speech quality.

What carries the argument

The multi-axis evaluation protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience.

If this is right

  • Controllability of non-verbal vocalizations often decouples from general speech quality, allowing targeted improvements.
  • Low-SNR oral cues and long-duration affective NVVs remain consistent bottlenecks across systems.
  • Objective metrics, listening tests, and LLM-based multi-rater methods can be combined for more reliable assessment.
  • Diverse control interfaces in TTS can now be compared directly within one standardized setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could serve as a template for evaluating other expressive elements such as prosody or emotion in synthetic audio.
  • Results on English and Chinese data suggest the framework might adapt to additional languages to test cultural differences in vocalization use.
  • Integration with existing TTS quality benchmarks could produce more complete evaluation pipelines for end-to-end system development.

Load-bearing premise

The 45-type taxonomy together with the separation of quality from NVV controllability, placement, and salience fully captures the important dimensions of non-verbal vocalization performance without missing key cases or introducing evaluation biases.

What would settle it

A set of human preference ratings or real-world listening scenarios where systems ranked high on the NVBench axes perform poorly on naturalness or expressiveness, or where important vocalization behaviors fall outside the 45 types.

Figures

Figures reproduced from arXiv: 2604.16211 by Boyi Kang, Hung-yi Lee, Jiahao Pan, Jingbin Hu, Liumeng Xue, Shuai Wang, Weizhen Bian, Wenxuan Wang, Xinyuan Qian, Yike Guo, Yilin Ren, Ziyang Ma.

Figure 1
Figure 1. Figure 1: The overview of NVBench. representative TTS systems (8 tag-based systems and 7 prompt￾based systems), measured via objective metrics, human listen￾ing tests, and an LLM-based multi-rater evaluation. Results in￾dicate that NVV controllability often decouples from overall speech quality and differs markedly across control interfaces, and highlight low-SNR oral cues and long-duration affective NVVs as persist… view at source ↗
Figure 2
Figure 2. Figure 2: NVV perceptual effect heatmaps for EN (left) and ZH (right) under tag-based (upper) and prompt-based (bottom) TTS systems. For prompt-based TTS, we synthesize paired samples from NVV-aware captions and their neutral counterparts, where NVV cues are removed while preserving the same proposi￾tional content. For tag-based TTS, we synthesize paired sam￾ples from plain text and the same text with inserted NVV t… view at source ↗
read the original abstract

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces NVBench, a bilingual (English/Chinese) benchmark for speech synthesis with non-verbal vocalizations (NVVs). It defines a unified 45-type taxonomy, curates a corresponding dataset, and proposes a multi-axis evaluation protocol that separates general speech naturalness/quality from NVV-specific controllability, placement, and salience. The authors benchmark 15 TTS systems via objective metrics, listening tests, and LLM-based multi-rater evaluation, reporting that NVV controllability often decouples from overall quality while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. The central claim is that NVBench enables fair, standardized cross-system comparisons across diverse control interfaces.

Significance. If the protocol and dataset prove robust, NVBench would address a clear gap in TTS evaluation by providing a standardized, multi-dimensional framework for non-verbal elements that are essential to natural, human-like speech. The separation of quality from controllability axes is a conceptual strength that could lead to more targeted system improvements than single-score metrics. Benchmarking 15 systems supplies a useful community baseline, and the bilingual scope broadens applicability. The combination of objective, subjective, and LLM-based methods is a positive design choice.

major comments (2)
  1. [§3] §3 (Benchmark and Dataset Construction): The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.
  2. [§4] §4 (Evaluation Protocol and Results): The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.
minor comments (3)
  1. [Table 1] Table 1 or equivalent taxonomy overview: Include one or two concrete audio examples or phonetic descriptions per major NVV category to improve reader intuition without lengthening the text.
  2. [LLM evaluation] LLM-based evaluation subsection: Report prompt templates, temperature settings, and inter-rater agreement (e.g., Fleiss' kappa) between LLM and human raters to allow assessment of reliability.
  3. [Figure 3] Figure 3 (system comparison): Ensure axis labels explicitly distinguish the four evaluation dimensions and include error bars or significance markers for the 15-system results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of NVBench and the recommendation for minor revision. The comments are constructive and will help strengthen the transparency of our taxonomy construction and statistical reporting. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3] The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.

    Authors: We appreciate the referee's emphasis on methodological transparency. While §3 describes the taxonomy and its motivation, we agree that expanded details on construction will better support the unification claim. In the revised manuscript we will add: (1) explicit selection criteria for the 45 types, grounded in linguistic frequency data, psychological taxonomies of vocal affect, and cross-lingual coverage considerations; (2) inter-annotator agreement statistics (Fleiss' kappa and percentage agreement) from the multi-annotator validation process; and (3) per-language balance tables showing sample counts, duration distributions, and any balancing procedures applied across English and Chinese subsets. These additions will directly address potential selection-bias concerns and allow readers to evaluate the taxonomy's representativeness. revision: yes

  2. Referee: [§4] The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.

    Authors: We concur that precise statistical documentation is required to substantiate the decoupling result. The observed decoupling was quantified via rank-based correlation between per-system NVV controllability scores and overall quality scores. In the revised manuscript we will explicitly report: the correlation coefficient used (Spearman's ρ), the statistical test and p-value thresholds applied, and 95% confidence intervals for the key correlations as well as for the bottleneck analyses on low-SNR oral cues and long-duration affective NVVs. These details will be added to §4 and the results tables, enabling independent verification while preserving the original interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a benchmark paper that defines a new 45-type taxonomy, curates a bilingual dataset, and proposes a multi-axis evaluation protocol separating general quality from NVV controllability, placement, and salience. No mathematical derivations, parameter fits, or predictions are present that could reduce to the authors' own inputs by construction. The central claim—that NVBench enables fair cross-system comparison—is realized directly by the benchmark's design and application to 15 external TTS systems, rather than by any self-referential loop or self-citation chain. Any references to prior TTS or NVV literature serve only as background and are not load-bearing for the framework's validity or results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied benchmark paper whose central contribution is dataset curation and protocol design rather than mathematical derivation; it relies on standard speech quality metrics extended to NVVs and on the assumption that human and LLM raters can reliably separate the targeted dimensions.

axioms (1)
  • domain assumption Standard objective and subjective speech quality metrics remain valid when non-verbal vocalizations are added to utterances.
    The multi-axis protocol treats general naturalness as separable from NVV-specific scores, inheriting this separation from prior TTS evaluation practice.

pith-pipeline@v0.9.0 · 5502 in / 1365 out tokens · 80871 ms · 2026-05-10T07:33:33.912989+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

    eess.AS 2026-04 unverdicted novelty 7.0

    MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Introduction Text-to-speech (TTS) has progressed rapidly from intelligi- ble speech to expressive speech generation. Recent large- scale speech language modeling and codec-based generation paradigms have further improved perceptual quality, speaker similarity, and controllability, pushing synthetic speech toward increasingly human-like delivery [1, 2, 3]....

  2. [2]

    he might cough

    Non-verbal Vocalization Benchmark 2.1. Benchmark Overview We introduce theNon-verbal Vocalization Benchmark (NVBench), a standardized evaluation suite for assessing a TTS system’s ability to synthesizenon-verbal vocalizations (NVVs) beyond lexical content. The overview of NVBench is pre- sented in Figure 1. Given an input utterance, NVBench sup- ports two...

  3. [3]

    We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum

    TTS systems Existing TTS systems that support speech generation with NVVs can be divided into two categories: prompt-based and tag-based systems. We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum. Commercial models demonstrate strong indus- trial performance, while open-source systems e...

  4. [4]

    hallucinate

    Results and Analysis 4.1. Objective Results: Robust Trends and NVV-Specific Confounders The objective results for both prompt-based and tag-based sys- tems are summarized in Table 3. Across three independent syn- thesis runs, most measures exhibit small run-to-run variance, indicating that the systems arestable. Prompt-based objective results.Results reve...

  5. [5]

    Conclusion In this work, we introduce NVBench, a bilingual benchmark for evaluating NVV-capable speech synthesis. NVBench covers a unified 45-type NVV taxonomy and a multi-axis evaluation pro- tocol, which separates general speech naturalness and quality from NVV controllability and perceptual salience. We bench- mark 15 TTS systems via objective metrics,...

  6. [6]

    First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors

    Generative AI Use Disclosure We used large language models (LLMs) to assist three com- ponents of this work. First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors. Second, LLMs were used to support evaluation TTS systems through LLM-based judging...

  7. [7]

    Recent advances in speech language models: A sur- vey,

    W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A sur- vey,” inProceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943–13 970

  8. [8]

    Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

    X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-TTS: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

  9. [9]

    Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,

    Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, Z. Wu, and Z. Liu, “Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,” inInternational Conference on Learning Representations (ICLR), 2026

  10. [10]

    The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

  11. [11]

    Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,

    M. Borisov, E. Spirin, and D. Diatlova, “NonverbalTTS: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025

  12. [12]

    A scalable pipeline for enabling non-verbal speech generation and understanding.arXiv preprint arXiv:2508.05385, 2025

    R. Ye, Y . Zhou, R. Yu, Z. Lin, K. Li, X. Li, X. Liu, G. Zeng, and Z. Wu, “A scalable pipeline for enabling non-verbal speech generation and understanding,”arXiv preprint arXiv:2508.05385, 2025

  13. [13]

    SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,

    Z. Wu, D. Liu, J. Liu, Y . Wang, L. Li, L. Jin, H. Bu, P. Zhang, and M. Li, “SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,” in Proceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 12 564–12 570

  14. [14]

    Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations.arXiv preprint arXiv:2508.04195, 2025

    H. Liao, Q. Ni, Y . Wang, Y . Lu, H. Zhan, P. Xie, Q. Zhang, and Z. Wu, “Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations,” arXiv preprint arXiv:2508.04195, 2025

  15. [15]

    Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu

    J. Mai, J. Ji, X. Xing, C. Yang, W. Chen, J. Xing, and X. Xu, “MNV-17: A high-quality performative mandarin dataset for nonverbal vocalization recognition in speech,”arXiv preprint arXiv:2509.18196, 2025

  16. [16]

    WESR: Scaling and evaluating word-level event-speech recognition,

    C. Yang, K. Huang, L. Fan, Q. Tu, B. Jiang, D. Zhang, L. Yin, S. Li, Z. Fei, Q. Chenget al., “WESR: Scaling and evaluating word-level event-speech recognition,”arXiv preprint arXiv:2601.04508, 2026

  17. [17]

    ParaLBench: A large-scale benchmark for compu- tational paralinguistics over acoustic foundation models,

    Z. Zhang, W. Xu, Z. Dong, K. Wang, Y . Wu, J. Peng, R. Wang, and D.-Y . Huang, “ParaLBench: A large-scale benchmark for compu- tational paralinguistics over acoustic foundation models,”arXiv preprint arXiv:2411.09349, 2024

  18. [18]

    InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems,

    K. Huang, Q. Tu, L. Fan, C. Yang, D. Zhang, S. Li, Z. Fei, Q. Cheng, and X. Qiu, “InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems,” arXiv preprint arXiv:2506.16381, 2025

  19. [19]

    S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

    F. Jiang, Z. Lin, F. Bu, Y . Du, B. Wang, and H. Li, “S2S-Arena, evaluating speech2speech protocols on instruction following with paralinguistic information,”arXiv preprint arXiv:2503.05085, 2025

  20. [20]

    Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

    S.-w. Yang, M. Tu, A. T. Liu, X. Qu, H.-y. Lee, L. Lu, Y . Wang, and Y . Wu, “ParaS2S: Benchmarking and aligning spoken lan- guage models for paralinguistic-aware speech-to-speech interac- tion,”arXiv preprint arXiv:2511.08723, 2025

  21. [21]

    Wavbench: Benchmarking reasoning, colloquialism, and paralinguistics for end-to-end spoken dialogue models.arXiv preprint arXiv:2602.12135, 2026

    Y . Li, S. Ji, Y . Chen, T. Liang, H. Ying, Y . Wang, J. Li, J. Fang, and Z. Zhao, “WavBench: Benchmarking reasoning, colloquialism, and paralinguistics for end-to-end spoken dialogue models,”arXiv preprint arXiv:2602.12135, 2026

  22. [22]

    Nv-bench: Benchmark of nonverbal vocalization synthesis for expressive text-to-speech generation.arXiv preprint arXiv:2603.15352, 2026

    Q. Ni, H. Liao, D. Chen, Y . Wang, and Z. Wu, “Nv-bench: Bench- mark of nonverbal vocalization synthesis for expressive text-to- speech generation,”arXiv preprint arXiv:2603.15352, 2026

  23. [23]

    ChatTTS: A generative speech model for daily dialogue,

    2noise, “ChatTTS: A generative speech model for daily dialogue,” https://github.com/2noise/ChatTTS, 2024, gitHub repository (accessed 2026-02-26)

  24. [24]

    Higgs Audio: Text-audio foundation model from bo- son ai,

    Boson AI, “Higgs Audio: Text-audio foundation model from bo- son ai,” https://github.com/boson-ai/higgs-audio, 2025, gitHub repository (accessed 2026-02-26)

  25. [25]

    Bark: Text-prompted generative audio model,

    Suno AI, “Bark: Text-prompted generative audio model,” https: //github.com/suno-ai/bark, 2023, gitHub repository (accessed 2026-02-26)

  26. [26]

    Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,

    S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

  27. [27]

    Orpheus-TTS: Towards human-sounding speech,

    Canopy AI, “Orpheus-TTS: Towards human-sounding speech,” https://github.com/canopyai/Orpheus-TTS, 2025, gitHub reposi- tory (accessed 2026-02-26)

  28. [28]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

  29. [29]

    Elevenlabs documentation: Models,

    ElevenLabs, “Elevenlabs documentation: Models,” https://eleven labs.io/docs/overview/models, 2026, documentation page (ac- cessed 2026-02-26)

  30. [30]

    Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,

    Nari Labs, “Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,” https://github.com/nari-labs/dia, 2025, gitHub repository (accessed 2026-02-26)

  31. [31]

    SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,

    B. Bai, Q. Lu, W. Yang, Z. Sun, Y . Hou, P. Jia, S. Pu, R. Fu, Y . Gao, Y . Li, and J. Gao, “SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,” 2025

  32. [32]

    Cap- speech: Enabling downstream applications in style-captioned text-to-speech,

    H. Wang, J. Hai, D. Chong, K. Thakkar, T. Feng, D. Yang, J. Lee, T. Thebaud, L. M. Velazquez, J. Villalbaet al., “Cap- speech: Enabling downstream applications in style-captioned text-to-speech,”arXiv preprint arXiv:2506.02863, 2025

  33. [33]

    Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,

    R. Werner, S. Fuchs, J. Trouvain, S. Kürbis, B. Möbius, and P. Birkholz, “Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 10S, pp. 3947–3961, 2024

  34. [34]

    V oices without words: the spectrum of nonverbal vocalisations,

    R. G. Kamilo ˘glu and D. A. Sauter, “V oices without words: the spectrum of nonverbal vocalisations,”European Review of Social Psychology, pp. 1–36, 2024

  35. [35]

    The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,

    B. Schuller, A. Batliner, S. Amiriparian, C. Bergler, M. Ger- czuk, N. Holz, P. Larrouy-Maestri, S. Bayerl, K. Riedhammer, A. Mallol-Ragoltaet al., “The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7120–7124

  36. [36]

    Acoustic analysis of several laughter types in conversational dialogues,

    K. Wang, C. Ishi, and R. Hayashi, “Acoustic analysis of several laughter types in conversational dialogues,”Proc. SpeechProsody 2024, pp. 667–671, 2024

  37. [37]

    An acoustic-prosodic analysis of laughter types,

    B. Ludusan, M. Schröer, and P. Wagner, “An acoustic-prosodic analysis of laughter types,”Speech Prosody 2024, 2024

  38. [38]

    DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

  39. [39]

    Clap learning audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  40. [40]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  41. [41]

    Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

    Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inProc. Interspeech 2022, 2022, pp. 2063–2067

  42. [42]

    SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

    H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “SpeechLLM-as-Judges: To- wards general and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

  43. [43]

    Audio-aware large language models as judges for speaking styles,

    C.-H. Chiang, X. Wang, C.-C. Lin, K. Lin, L. Li, R. Kopetz, Y . Qian, Z. Wang, Z. Yang, H.-y. Leeet al., “Audio-aware large language models as judges for speaking styles,”arXiv preprint arXiv:2506.05984, vol. 7, 2025

  44. [44]

    Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS technical report,” arXiv preprint arXiv:2601.15621, 2026