Recognition: unknown
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
Pith reviewed 2026-05-10 07:33 UTC · model grok-4.3
The pith
NVBench introduces a unified benchmark with a 45-type taxonomy to evaluate non-verbal vocalizations like laughs and sighs in speech synthesis separately from overall quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework by combining a 45-type taxonomy of non-verbal vocalizations, a bilingual English/Chinese dataset, and a multi-axis protocol that isolates NVV controllability, placement, and salience from general speech quality.
What carries the argument
The multi-axis evaluation protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience.
If this is right
- Controllability of non-verbal vocalizations often decouples from general speech quality, allowing targeted improvements.
- Low-SNR oral cues and long-duration affective NVVs remain consistent bottlenecks across systems.
- Objective metrics, listening tests, and LLM-based multi-rater methods can be combined for more reliable assessment.
- Diverse control interfaces in TTS can now be compared directly within one standardized setup.
Where Pith is reading between the lines
- The benchmark could serve as a template for evaluating other expressive elements such as prosody or emotion in synthetic audio.
- Results on English and Chinese data suggest the framework might adapt to additional languages to test cultural differences in vocalization use.
- Integration with existing TTS quality benchmarks could produce more complete evaluation pipelines for end-to-end system development.
Load-bearing premise
The 45-type taxonomy together with the separation of quality from NVV controllability, placement, and salience fully captures the important dimensions of non-verbal vocalization performance without missing key cases or introducing evaluation biases.
What would settle it
A set of human preference ratings or real-world listening scenarios where systems ranked high on the NVBench axes perform poorly on naturalness or expressiveness, or where important vocalization behaviors fall outside the 45 types.
Figures
read the original abstract
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NVBench, a bilingual (English/Chinese) benchmark for speech synthesis with non-verbal vocalizations (NVVs). It defines a unified 45-type taxonomy, curates a corresponding dataset, and proposes a multi-axis evaluation protocol that separates general speech naturalness/quality from NVV-specific controllability, placement, and salience. The authors benchmark 15 TTS systems via objective metrics, listening tests, and LLM-based multi-rater evaluation, reporting that NVV controllability often decouples from overall quality while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. The central claim is that NVBench enables fair, standardized cross-system comparisons across diverse control interfaces.
Significance. If the protocol and dataset prove robust, NVBench would address a clear gap in TTS evaluation by providing a standardized, multi-dimensional framework for non-verbal elements that are essential to natural, human-like speech. The separation of quality from controllability axes is a conceptual strength that could lead to more targeted system improvements than single-score metrics. Benchmarking 15 systems supplies a useful community baseline, and the bilingual scope broadens applicability. The combination of objective, subjective, and LLM-based methods is a positive design choice.
major comments (2)
- [§3] §3 (Benchmark and Dataset Construction): The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.
- [§4] §4 (Evaluation Protocol and Results): The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.
minor comments (3)
- [Table 1] Table 1 or equivalent taxonomy overview: Include one or two concrete audio examples or phonetic descriptions per major NVV category to improve reader intuition without lengthening the text.
- [LLM evaluation] LLM-based evaluation subsection: Report prompt templates, temperature settings, and inter-rater agreement (e.g., Fleiss' kappa) between LLM and human raters to allow assessment of reliability.
- [Figure 3] Figure 3 (system comparison): Ensure axis labels explicitly distinguish the four evaluation dimensions and include error bars or significance markers for the 15-system results.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of NVBench and the recommendation for minor revision. The comments are constructive and will help strengthen the transparency of our taxonomy construction and statistical reporting. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3] The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.
Authors: We appreciate the referee's emphasis on methodological transparency. While §3 describes the taxonomy and its motivation, we agree that expanded details on construction will better support the unification claim. In the revised manuscript we will add: (1) explicit selection criteria for the 45 types, grounded in linguistic frequency data, psychological taxonomies of vocal affect, and cross-lingual coverage considerations; (2) inter-annotator agreement statistics (Fleiss' kappa and percentage agreement) from the multi-annotator validation process; and (3) per-language balance tables showing sample counts, duration distributions, and any balancing procedures applied across English and Chinese subsets. These additions will directly address potential selection-bias concerns and allow readers to evaluate the taxonomy's representativeness. revision: yes
-
Referee: [§4] The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.
Authors: We concur that precise statistical documentation is required to substantiate the decoupling result. The observed decoupling was quantified via rank-based correlation between per-system NVV controllability scores and overall quality scores. In the revised manuscript we will explicitly report: the correlation coefficient used (Spearman's ρ), the statistical test and p-value thresholds applied, and 95% confidence intervals for the key correlations as well as for the bottleneck analyses on low-SNR oral cues and long-duration affective NVVs. These details will be added to §4 and the results tables, enabling independent verification while preserving the original interpretation. revision: yes
Circularity Check
No significant circularity identified
full rationale
This is a benchmark paper that defines a new 45-type taxonomy, curates a bilingual dataset, and proposes a multi-axis evaluation protocol separating general quality from NVV controllability, placement, and salience. No mathematical derivations, parameter fits, or predictions are present that could reduce to the authors' own inputs by construction. The central claim—that NVBench enables fair cross-system comparison—is realized directly by the benchmark's design and application to 15 external TTS systems, rather than by any self-referential loop or self-citation chain. Any references to prior TTS or NVV literature serve only as background and are not load-bearing for the framework's validity or results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard objective and subjective speech quality metrics remain valid when non-verbal vocalizations are added to utterances.
Forward citations
Cited by 1 Pith paper
-
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
Reference graph
Works this paper leans on
-
[1]
Introduction Text-to-speech (TTS) has progressed rapidly from intelligi- ble speech to expressive speech generation. Recent large- scale speech language modeling and codec-based generation paradigms have further improved perceptual quality, speaker similarity, and controllability, pushing synthetic speech toward increasingly human-like delivery [1, 2, 3]....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Non-verbal Vocalization Benchmark 2.1. Benchmark Overview We introduce theNon-verbal Vocalization Benchmark (NVBench), a standardized evaluation suite for assessing a TTS system’s ability to synthesizenon-verbal vocalizations (NVVs) beyond lexical content. The overview of NVBench is pre- sented in Figure 1. Given an input utterance, NVBench sup- ports two...
-
[3]
We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum
TTS systems Existing TTS systems that support speech generation with NVVs can be divided into two categories: prompt-based and tag-based systems. We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum. Commercial models demonstrate strong indus- trial performance, while open-source systems e...
-
[4]
hallucinate
Results and Analysis 4.1. Objective Results: Robust Trends and NVV-Specific Confounders The objective results for both prompt-based and tag-based sys- tems are summarized in Table 3. Across three independent syn- thesis runs, most measures exhibit small run-to-run variance, indicating that the systems arestable. Prompt-based objective results.Results reve...
-
[5]
Conclusion In this work, we introduce NVBench, a bilingual benchmark for evaluating NVV-capable speech synthesis. NVBench covers a unified 45-type NVV taxonomy and a multi-axis evaluation pro- tocol, which separates general speech naturalness and quality from NVV controllability and perceptual salience. We bench- mark 15 TTS systems via objective metrics,...
-
[6]
First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors
Generative AI Use Disclosure We used large language models (LLMs) to assist three com- ponents of this work. First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors. Second, LLMs were used to support evaluation TTS systems through LLM-based judging...
-
[7]
Recent advances in speech language models: A sur- vey,
W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A sur- vey,” inProceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943–13 970
2025
-
[8]
X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-TTS: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025
-
[9]
Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,
Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, Z. Wu, and Z. Liu, “Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,” inInternational Conference on Learning Representations (ICLR), 2026
2026
-
[10]
The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,
F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015
2015
-
[11]
M. Borisov, E. Spirin, and D. Diatlova, “NonverbalTTS: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025
-
[12]
R. Ye, Y . Zhou, R. Yu, Z. Lin, K. Li, X. Li, X. Liu, G. Zeng, and Z. Wu, “A scalable pipeline for enabling non-verbal speech generation and understanding,”arXiv preprint arXiv:2508.05385, 2025
-
[13]
SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,
Z. Wu, D. Liu, J. Liu, Y . Wang, L. Li, L. Jin, H. Bu, P. Zhang, and M. Li, “SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,” in Proceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 12 564–12 570
2025
-
[14]
H. Liao, Q. Ni, Y . Wang, Y . Lu, H. Zhan, P. Xie, Q. Zhang, and Z. Wu, “Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations,” arXiv preprint arXiv:2508.04195, 2025
-
[15]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu
J. Mai, J. Ji, X. Xing, C. Yang, W. Chen, J. Xing, and X. Xu, “MNV-17: A high-quality performative mandarin dataset for nonverbal vocalization recognition in speech,”arXiv preprint arXiv:2509.18196, 2025
-
[16]
WESR: Scaling and evaluating word-level event-speech recognition,
C. Yang, K. Huang, L. Fan, Q. Tu, B. Jiang, D. Zhang, L. Yin, S. Li, Z. Fei, Q. Chenget al., “WESR: Scaling and evaluating word-level event-speech recognition,”arXiv preprint arXiv:2601.04508, 2026
-
[17]
Z. Zhang, W. Xu, Z. Dong, K. Wang, Y . Wu, J. Peng, R. Wang, and D.-Y . Huang, “ParaLBench: A large-scale benchmark for compu- tational paralinguistics over acoustic foundation models,”arXiv preprint arXiv:2411.09349, 2024
-
[18]
K. Huang, Q. Tu, L. Fan, C. Yang, D. Zhang, S. Li, Z. Fei, Q. Cheng, and X. Qiu, “InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems,” arXiv preprint arXiv:2506.16381, 2025
-
[19]
S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models
F. Jiang, Z. Lin, F. Bu, Y . Du, B. Wang, and H. Li, “S2S-Arena, evaluating speech2speech protocols on instruction following with paralinguistic information,”arXiv preprint arXiv:2503.05085, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
S.-w. Yang, M. Tu, A. T. Liu, X. Qu, H.-y. Lee, L. Lu, Y . Wang, and Y . Wu, “ParaS2S: Benchmarking and aligning spoken lan- guage models for paralinguistic-aware speech-to-speech interac- tion,”arXiv preprint arXiv:2511.08723, 2025
-
[21]
Y . Li, S. Ji, Y . Chen, T. Liang, H. Ying, Y . Wang, J. Li, J. Fang, and Z. Zhao, “WavBench: Benchmarking reasoning, colloquialism, and paralinguistics for end-to-end spoken dialogue models,”arXiv preprint arXiv:2602.12135, 2026
-
[22]
Q. Ni, H. Liao, D. Chen, Y . Wang, and Z. Wu, “Nv-bench: Bench- mark of nonverbal vocalization synthesis for expressive text-to- speech generation,”arXiv preprint arXiv:2603.15352, 2026
-
[23]
ChatTTS: A generative speech model for daily dialogue,
2noise, “ChatTTS: A generative speech model for daily dialogue,” https://github.com/2noise/ChatTTS, 2024, gitHub repository (accessed 2026-02-26)
2024
-
[24]
Higgs Audio: Text-audio foundation model from bo- son ai,
Boson AI, “Higgs Audio: Text-audio foundation model from bo- son ai,” https://github.com/boson-ai/higgs-audio, 2025, gitHub repository (accessed 2026-02-26)
2025
-
[25]
Bark: Text-prompted generative audio model,
Suno AI, “Bark: Text-prompted generative audio model,” https: //github.com/suno-ai/bark, 2023, gitHub repository (accessed 2026-02-26)
2023
-
[26]
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,
S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024
-
[27]
Orpheus-TTS: Towards human-sounding speech,
Canopy AI, “Orpheus-TTS: Towards human-sounding speech,” https://github.com/canopyai/Orpheus-TTS, 2025, gitHub reposi- tory (accessed 2026-02-26)
2025
-
[28]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Elevenlabs documentation: Models,
ElevenLabs, “Elevenlabs documentation: Models,” https://eleven labs.io/docs/overview/models, 2026, documentation page (ac- cessed 2026-02-26)
2026
-
[30]
Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,
Nari Labs, “Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,” https://github.com/nari-labs/dia, 2025, gitHub repository (accessed 2026-02-26)
2025
-
[31]
SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,
B. Bai, Q. Lu, W. Yang, Z. Sun, Y . Hou, P. Jia, S. Pu, R. Fu, Y . Gao, Y . Li, and J. Gao, “SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,” 2025
2025
-
[32]
Cap- speech: Enabling downstream applications in style-captioned text-to-speech,
H. Wang, J. Hai, D. Chong, K. Thakkar, T. Feng, D. Yang, J. Lee, T. Thebaud, L. M. Velazquez, J. Villalbaet al., “Cap- speech: Enabling downstream applications in style-captioned text-to-speech,”arXiv preprint arXiv:2506.02863, 2025
-
[33]
Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,
R. Werner, S. Fuchs, J. Trouvain, S. Kürbis, B. Möbius, and P. Birkholz, “Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 10S, pp. 3947–3961, 2024
2024
-
[34]
V oices without words: the spectrum of nonverbal vocalisations,
R. G. Kamilo ˘glu and D. A. Sauter, “V oices without words: the spectrum of nonverbal vocalisations,”European Review of Social Psychology, pp. 1–36, 2024
2024
-
[35]
The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,
B. Schuller, A. Batliner, S. Amiriparian, C. Bergler, M. Ger- czuk, N. Holz, P. Larrouy-Maestri, S. Bayerl, K. Riedhammer, A. Mallol-Ragoltaet al., “The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7120–7124
2022
-
[36]
Acoustic analysis of several laughter types in conversational dialogues,
K. Wang, C. Ishi, and R. Hayashi, “Acoustic analysis of several laughter types in conversational dialogues,”Proc. SpeechProsody 2024, pp. 667–671, 2024
2024
-
[37]
An acoustic-prosodic analysis of laughter types,
B. Ludusan, M. Schröer, and P. Wagner, “An acoustic-prosodic analysis of laughter types,”Speech Prosody 2024, 2024
2024
-
[38]
DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,
C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890
2022
-
[39]
Clap learning audio concepts from natural language supervision,
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[40]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[41]
Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,
Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inProc. Interspeech 2022, 2022, pp. 2063–2067
2022
-
[42]
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “SpeechLLM-as-Judges: To- wards general and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Audio-aware large language models as judges for speaking styles,
C.-H. Chiang, X. Wang, C.-C. Lin, K. Lin, L. Li, R. Kopetz, Y . Qian, Z. Wang, Z. Yang, H.-y. Leeet al., “Audio-aware large language models as judges for speaking styles,”arXiv preprint arXiv:2506.05984, vol. 7, 2025
-
[44]
Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS technical report,” arXiv preprint arXiv:2601.15621, 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.