arxiv: 2604.16211 · v2 · submitted 2026-04-17 · 💻 cs.SD

Recognition: unknown

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

Liumeng Xue , Weizhen Bian , Jiahao Pan , Wenxuan Wang , Yilin Ren , Boyi Kang , Jingbin Hu , Ziyang Ma

show 4 more authors

Shuai Wang Xinyuan Qian Hung-yi Lee Yike Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:33 UTC · model grok-4.3

classification 💻 cs.SD

keywords non-verbal vocalizationsspeech synthesisTTS benchmarkcontrollability evaluationplacement and saliencebilingual datasetmulti-axis protocolaudio naturalness

0 comments

The pith

NVBench introduces a unified benchmark with a 45-type taxonomy to evaluate non-verbal vocalizations like laughs and sighs in speech synthesis separately from overall quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NVBench as a bilingual benchmark that pairs a standardized 45-type taxonomy of non-verbal vocalizations with a curated dataset and a multi-axis evaluation protocol. This setup separates general speech naturalness from specific measures of NVV controllability, correct placement, and salience. By applying the benchmark to 15 TTS systems through objective metrics, listening tests, and LLM-based ratings, the work shows that NVV control often operates independently of speech quality while highlighting persistent difficulties with low-SNR cues and extended affective sounds. A reader would care because without such a shared framework, progress on expressive, human-like synthetic speech remains hard to measure or compare across different system designs.

Core claim

NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework by combining a 45-type taxonomy of non-verbal vocalizations, a bilingual English/Chinese dataset, and a multi-axis protocol that isolates NVV controllability, placement, and salience from general speech quality.

What carries the argument

The multi-axis evaluation protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience.

If this is right

Controllability of non-verbal vocalizations often decouples from general speech quality, allowing targeted improvements.
Low-SNR oral cues and long-duration affective NVVs remain consistent bottlenecks across systems.
Objective metrics, listening tests, and LLM-based multi-rater methods can be combined for more reliable assessment.
Diverse control interfaces in TTS can now be compared directly within one standardized setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a template for evaluating other expressive elements such as prosody or emotion in synthetic audio.
Results on English and Chinese data suggest the framework might adapt to additional languages to test cultural differences in vocalization use.
Integration with existing TTS quality benchmarks could produce more complete evaluation pipelines for end-to-end system development.

Load-bearing premise

The 45-type taxonomy together with the separation of quality from NVV controllability, placement, and salience fully captures the important dimensions of non-verbal vocalization performance without missing key cases or introducing evaluation biases.

What would settle it

A set of human preference ratings or real-world listening scenarios where systems ranked high on the NVBench axes perform poorly on naturalness or expressiveness, or where important vocalization behaviors fall outside the 45 types.

Figures

Figures reproduced from arXiv: 2604.16211 by Boyi Kang, Hung-yi Lee, Jiahao Pan, Jingbin Hu, Liumeng Xue, Shuai Wang, Weizhen Bian, Wenxuan Wang, Xinyuan Qian, Yike Guo, Yilin Ren, Ziyang Ma.

**Figure 1.** Figure 1: The overview of NVBench. representative TTS systems (8 tag-based systems and 7 promptbased systems), measured via objective metrics, human listening tests, and an LLM-based multi-rater evaluation. Results indicate that NVV controllability often decouples from overall speech quality and differs markedly across control interfaces, and highlight low-SNR oral cues and long-duration affective NVVs as persist… view at source ↗

**Figure 2.** Figure 2: NVV perceptual effect heatmaps for EN (left) and ZH (right) under tag-based (upper) and prompt-based (bottom) TTS systems. For prompt-based TTS, we synthesize paired samples from NVV-aware captions and their neutral counterparts, where NVV cues are removed while preserving the same propositional content. For tag-based TTS, we synthesize paired samples from plain text and the same text with inserted NVV t… view at source ↗

read the original abstract

Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NVBench gives the TTS field a new bilingual benchmark with a 45-type NVV taxonomy and a protocol that separates controllability from general quality, which is a practical step forward even if the validation details need more scrutiny.

read the letter

NVBench is a benchmark paper that sets up a standardized way to test speech synthesis systems on non-verbal vocalizations. The key points are a 45-type taxonomy for these sounds, a bilingual dataset in English and Chinese, and an evaluation that looks at controllability, placement, and salience apart from general speech quality. The paper applies this to 15 different TTS systems and finds that NVV handling often doesn't line up with overall quality scores. It highlights persistent issues with low signal-to-noise oral cues and longer affective sounds. This kind of breakdown is helpful because it gives developers specific targets rather than just overall scores. What works well here is the multi-axis protocol. It lets you compare systems that use very different control methods under one framework, which prior benchmarks didn't do as cleanly for non-verbal elements. The mix of objective metrics, listening tests, and LLM-based ratings covers several bases for a benchmark. The soft spots are in the level of detail provided. The abstract mentions the results but doesn't include dataset sizes, how the taxonomy was built or tested for coverage, or full statistical analysis of the decoupling. The LLM evaluation is a potential weak point if prompt choices affect the outcomes, though the paper doesn't appear to have built its claims around its own fitted models. If the full paper has those numbers and agreements, it strengthens the case. This is aimed at the speech synthesis community, particularly people working on expressive or conversational TTS for agents and media. A reader looking for ways to measure and improve non-verbal performance will find the dataset and protocol valuable to try out. It deserves a serious referee because it supplies a new tool for the field with initial comparative data that can be iterated on. I would recommend sending it for peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces NVBench, a bilingual (English/Chinese) benchmark for speech synthesis with non-verbal vocalizations (NVVs). It defines a unified 45-type taxonomy, curates a corresponding dataset, and proposes a multi-axis evaluation protocol that separates general speech naturalness/quality from NVV-specific controllability, placement, and salience. The authors benchmark 15 TTS systems via objective metrics, listening tests, and LLM-based multi-rater evaluation, reporting that NVV controllability often decouples from overall quality while low-SNR oral cues and long-duration affective NVVs remain bottlenecks. The central claim is that NVBench enables fair, standardized cross-system comparisons across diverse control interfaces.

Significance. If the protocol and dataset prove robust, NVBench would address a clear gap in TTS evaluation by providing a standardized, multi-dimensional framework for non-verbal elements that are essential to natural, human-like speech. The separation of quality from controllability axes is a conceptual strength that could lead to more targeted system improvements than single-score metrics. Benchmarking 15 systems supplies a useful community baseline, and the bilingual scope broadens applicability. The combination of objective, subjective, and LLM-based methods is a positive design choice.

major comments (2)

[§3] §3 (Benchmark and Dataset Construction): The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.
[§4] §4 (Evaluation Protocol and Results): The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.

minor comments (3)

[Table 1] Table 1 or equivalent taxonomy overview: Include one or two concrete audio examples or phonetic descriptions per major NVV category to improve reader intuition without lengthening the text.
[LLM evaluation] LLM-based evaluation subsection: Report prompt templates, temperature settings, and inter-rater agreement (e.g., Fleiss' kappa) between LLM and human raters to allow assessment of reliability.
[Figure 3] Figure 3 (system comparison): Ensure axis labels explicitly distinguish the four evaluation dimensions and include error bars or significance markers for the 15-system results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of NVBench and the recommendation for minor revision. The comments are constructive and will help strengthen the transparency of our taxonomy construction and statistical reporting. We respond to each major comment below.

read point-by-point responses

Referee: [§3] The 45-type taxonomy and multi-axis separation are load-bearing for the unification claim, yet the manuscript provides insufficient detail on type selection criteria, inter-annotator validation, and balance statistics across English/Chinese samples. Without these, it is difficult to confirm that the taxonomy fully captures relevant NVV dimensions or avoids selection bias that could affect cross-system fairness.

Authors: We appreciate the referee's emphasis on methodological transparency. While §3 describes the taxonomy and its motivation, we agree that expanded details on construction will better support the unification claim. In the revised manuscript we will add: (1) explicit selection criteria for the 45 types, grounded in linguistic frequency data, psychological taxonomies of vocal affect, and cross-lingual coverage considerations; (2) inter-annotator agreement statistics (Fleiss' kappa and percentage agreement) from the multi-annotator validation process; and (3) per-language balance tables showing sample counts, duration distributions, and any balancing procedures applied across English and Chinese subsets. These additions will directly address potential selection-bias concerns and allow readers to evaluate the taxonomy's representativeness. revision: yes
Referee: [§4] The claim that 'NVV controllability often decouples from quality' is central to the findings, but the abstract and methods summary do not specify the exact correlation measures, statistical tests, or confidence intervals used to establish decoupling. This leaves the robustness of the result open to verification and weakens the interpretation of bottlenecks such as low-SNR cues.

Authors: We concur that precise statistical documentation is required to substantiate the decoupling result. The observed decoupling was quantified via rank-based correlation between per-system NVV controllability scores and overall quality scores. In the revised manuscript we will explicitly report: the correlation coefficient used (Spearman's ρ), the statistical test and p-value thresholds applied, and 95% confidence intervals for the key correlations as well as for the bottleneck analyses on low-SNR oral cues and long-duration affective NVVs. These details will be added to §4 and the results tables, enabling independent verification while preserving the original interpretation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a benchmark paper that defines a new 45-type taxonomy, curates a bilingual dataset, and proposes a multi-axis evaluation protocol separating general quality from NVV controllability, placement, and salience. No mathematical derivations, parameter fits, or predictions are present that could reduce to the authors' own inputs by construction. The central claim—that NVBench enables fair cross-system comparison—is realized directly by the benchmark's design and application to 15 external TTS systems, rather than by any self-referential loop or self-citation chain. Any references to prior TTS or NVV literature serve only as background and are not load-bearing for the framework's validity or results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an applied benchmark paper whose central contribution is dataset curation and protocol design rather than mathematical derivation; it relies on standard speech quality metrics extended to NVVs and on the assumption that human and LLM raters can reliably separate the targeted dimensions.

axioms (1)

domain assumption Standard objective and subjective speech quality metrics remain valid when non-verbal vocalizations are added to utterances.
The multi-axis protocol treats general naturalness as separable from NVV-specific scores, inheriting this separation from prior TTS evaluation practice.

pith-pipeline@v0.9.0 · 5502 in / 1365 out tokens · 80871 ms · 2026-05-10T07:33:33.912989+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech
eess.AS 2026-04 unverdicted novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Introduction Text-to-speech (TTS) has progressed rapidly from intelligi- ble speech to expressive speech generation. Recent large- scale speech language modeling and codec-based generation paradigms have further improved perceptual quality, speaker similarity, and controllability, pushing synthetic speech toward increasingly human-like delivery [1, 2, 3]....

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

he might cough

Non-verbal Vocalization Benchmark 2.1. Benchmark Overview We introduce theNon-verbal Vocalization Benchmark (NVBench), a standardized evaluation suite for assessing a TTS system’s ability to synthesizenon-verbal vocalizations (NVVs) beyond lexical content. The overview of NVBench is pre- sented in Figure 1. Given an input utterance, NVBench sup- ports two...

work page arXiv
[3]

We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum

TTS systems Existing TTS systems that support speech generation with NVVs can be divided into two categories: prompt-based and tag-based systems. We benchmark 15 TTS systems, including 7 prompt-based and 8 tag-based systems, offering a diverse evalu- ation spectrum. Commercial models demonstrate strong indus- trial performance, while open-source systems e...
[4]

hallucinate

Results and Analysis 4.1. Objective Results: Robust Trends and NVV-Specific Confounders The objective results for both prompt-based and tag-based sys- tems are summarized in Table 3. Across three independent syn- thesis runs, most measures exhibit small run-to-run variance, indicating that the systems arestable. Prompt-based objective results.Results reve...
[5]

Conclusion In this work, we introduce NVBench, a bilingual benchmark for evaluating NVV-capable speech synthesis. NVBench covers a unified 45-type NVV taxonomy and a multi-axis evaluation pro- tocol, which separates general speech naturalness and quality from NVV controllability and perceptual salience. We bench- mark 15 TTS systems via objective metrics,...
[6]

First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors

Generative AI Use Disclosure We used large language models (LLMs) to assist three com- ponents of this work. First, LLMs were used in benchmark dataset generation to draft candidate texts and speech captions, which were then reviewed, filtered, and finalized by the authors. Second, LLMs were used to support evaluation TTS systems through LLM-based judging...
[7]

Recent advances in speech language models: A sur- vey,

W. Cui, D. Yu, X. Jiao, Z. Meng, G. Zhang, Q. Wang, S. Y . Guo, and I. King, “Recent advances in speech language models: A sur- vey,” inProceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13 943–13 970

2025
[8]

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-TTS: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page arXiv 2025
[9]

Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Li, Z. Wu, and Z. Liu, “Hierarchical semantic-acoustic modeling via semi-discrete residual representations for expres- sive end-to-end speech synthesis,” inInternational Conference on Learning Representations (ICLR), 2026

2026
[10]

The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

2015
[11]

Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,

M. Borisov, E. Spirin, and D. Diatlova, “NonverbalTTS: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025

work page arXiv 2025
[12]

A scalable pipeline for enabling non-verbal speech generation and understanding.arXiv preprint arXiv:2508.05385, 2025

R. Ye, Y . Zhou, R. Yu, Z. Lin, K. Li, X. Li, X. Liu, G. Zeng, and Z. Wu, “A scalable pipeline for enabling non-verbal speech generation and understanding,”arXiv preprint arXiv:2508.05385, 2025

work page arXiv 2025
[13]

SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,

Z. Wu, D. Liu, J. Liu, Y . Wang, L. Li, L. Jin, H. Bu, P. Zhang, and M. Li, “SMIIP-NV: A multi-annotation non-verbal expres- sive speech corpus in mandarin for llm-based speech synthesis,” in Proceedings of the 33rd ACM International Conference on Multi- media, 2025, pp. 12 564–12 570

2025
[14]

Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations.arXiv preprint arXiv:2508.04195, 2025

H. Liao, Q. Ni, Y . Wang, Y . Lu, H. Zhan, P. Xie, Q. Zhang, and Z. Wu, “Nvspeech: An integrated and scalable pipeline for human-like speech modeling with paralinguistic vocalizations,” arXiv preprint arXiv:2508.04195, 2025

work page arXiv 2025
[15]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu

J. Mai, J. Ji, X. Xing, C. Yang, W. Chen, J. Xing, and X. Xu, “MNV-17: A high-quality performative mandarin dataset for nonverbal vocalization recognition in speech,”arXiv preprint arXiv:2509.18196, 2025

work page arXiv 2025
[16]

WESR: Scaling and evaluating word-level event-speech recognition,

C. Yang, K. Huang, L. Fan, Q. Tu, B. Jiang, D. Zhang, L. Yin, S. Li, Z. Fei, Q. Chenget al., “WESR: Scaling and evaluating word-level event-speech recognition,”arXiv preprint arXiv:2601.04508, 2026

work page arXiv 2026
[17]

ParaLBench: A large-scale benchmark for compu- tational paralinguistics over acoustic foundation models,

Z. Zhang, W. Xu, Z. Dong, K. Wang, Y . Wu, J. Peng, R. Wang, and D.-Y . Huang, “ParaLBench: A large-scale benchmark for compu- tational paralinguistics over acoustic foundation models,”arXiv preprint arXiv:2411.09349, 2024

work page arXiv 2024
[18]

InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems,

K. Huang, Q. Tu, L. Fan, C. Yang, D. Zhang, S. Li, Z. Fei, Q. Cheng, and X. Qiu, “InstructTTSEval: Benchmarking complex natural-language instruction following in text-to-speech systems,” arXiv preprint arXiv:2506.16381, 2025

work page arXiv 2025
[19]

S2S-Arena: Evaluating Paralinguistic Instruction Following in Speech-to-Speech Models

F. Jiang, Z. Lin, F. Bu, Y . Du, B. Wang, and H. Li, “S2S-Arena, evaluating speech2speech protocols on instruction following with paralinguistic information,”arXiv preprint arXiv:2503.05085, 2025

work page internal anchor Pith review arXiv 2025
[20]

Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

S.-w. Yang, M. Tu, A. T. Liu, X. Qu, H.-y. Lee, L. Lu, Y . Wang, and Y . Wu, “ParaS2S: Benchmarking and aligning spoken lan- guage models for paralinguistic-aware speech-to-speech interac- tion,”arXiv preprint arXiv:2511.08723, 2025

work page arXiv 2025
[21]

Wavbench: Benchmarking reasoning, colloquialism, and paralinguistics for end-to-end spoken dialogue models.arXiv preprint arXiv:2602.12135, 2026

Y . Li, S. Ji, Y . Chen, T. Liang, H. Ying, Y . Wang, J. Li, J. Fang, and Z. Zhao, “WavBench: Benchmarking reasoning, colloquialism, and paralinguistics for end-to-end spoken dialogue models,”arXiv preprint arXiv:2602.12135, 2026

work page arXiv 2026
[22]

Nv-bench: Benchmark of nonverbal vocalization synthesis for expressive text-to-speech generation.arXiv preprint arXiv:2603.15352, 2026

Q. Ni, H. Liao, D. Chen, Y . Wang, and Z. Wu, “Nv-bench: Bench- mark of nonverbal vocalization synthesis for expressive text-to- speech generation,”arXiv preprint arXiv:2603.15352, 2026

work page arXiv 2026
[23]

ChatTTS: A generative speech model for daily dialogue,

2noise, “ChatTTS: A generative speech model for daily dialogue,” https://github.com/2noise/ChatTTS, 2024, gitHub repository (accessed 2026-02-26)

2024
[24]

Higgs Audio: Text-audio foundation model from bo- son ai,

Boson AI, “Higgs Audio: Text-audio foundation model from bo- son ai,” https://github.com/boson-ai/higgs-audio, 2025, gitHub repository (accessed 2026-02-26)

2025
[25]

Bark: Text-prompted generative audio model,

Suno AI, “Bark: Text-prompted generative audio model,” https: //github.com/suno-ai/bark, 2023, gitHub repository (accessed 2026-02-26)

2023
[26]

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis,

S. Liao, Y . Wang, T. Li, Y . Cheng, R. Zhang, R. Zhou, and Y . Xing, “Fish-Speech: Leveraging large language models for advanced multilingual text-to-speech synthesis,”arXiv preprint arXiv:2411.01156, 2024

work page arXiv 2024
[27]

Orpheus-TTS: Towards human-sounding speech,

Canopy AI, “Orpheus-TTS: Towards human-sounding speech,” https://github.com/canopyai/Orpheus-TTS, 2025, gitHub reposi- tory (accessed 2026-02-26)

2025
[28]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review arXiv 2024
[29]

Elevenlabs documentation: Models,

ElevenLabs, “Elevenlabs documentation: Models,” https://eleven labs.io/docs/overview/models, 2026, documentation page (ac- cessed 2026-02-26)

2026
[30]

Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,

Nari Labs, “Dia: A TTS model capable of generating ultra- realistic dialogue in one pass,” https://github.com/nari-labs/dia, 2025, gitHub repository (accessed 2026-02-26)

2025
[31]

SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,

B. Bai, Q. Lu, W. Yang, Z. Sun, Y . Hou, P. Jia, S. Pu, R. Fu, Y . Gao, Y . Li, and J. Gao, “SynParaSpeech: Automated synthesis of paralinguistic datasets for speech generation and understand- ing,” 2025

2025
[32]

Cap- speech: Enabling downstream applications in style-captioned text-to-speech,

H. Wang, J. Hai, D. Chong, K. Thakkar, T. Feng, D. Yang, J. Lee, T. Thebaud, L. M. Velazquez, J. Villalbaet al., “Cap- speech: Enabling downstream applications in style-captioned text-to-speech,”arXiv preprint arXiv:2506.02863, 2025

work page arXiv 2025
[33]

Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,

R. Werner, S. Fuchs, J. Trouvain, S. Kürbis, B. Möbius, and P. Birkholz, “Acoustics of breath noises in human speech: De- scriptive and three-dimensional modeling approaches,”Journal of Speech, Language, and Hearing Research, vol. 67, no. 10S, pp. 3947–3961, 2024

2024
[34]

V oices without words: the spectrum of nonverbal vocalisations,

R. G. Kamilo ˘glu and D. A. Sauter, “V oices without words: the spectrum of nonverbal vocalisations,”European Review of Social Psychology, pp. 1–36, 2024

2024
[35]

The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,

B. Schuller, A. Batliner, S. Amiriparian, C. Bergler, M. Ger- czuk, N. Holz, P. Larrouy-Maestri, S. Bayerl, K. Riedhammer, A. Mallol-Ragoltaet al., “The acm multimedia 2022 computa- tional paralinguistics challenge: V ocalisations, stuttering, activ- ity, & mosquitoes,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7120–7124

2022
[36]

Acoustic analysis of several laughter types in conversational dialogues,

K. Wang, C. Ishi, and R. Hayashi, “Acoustic analysis of several laughter types in conversational dialogues,”Proc. SpeechProsody 2024, pp. 667–671, 2024

2024
[37]

An acoustic-prosodic analysis of laughter types,

B. Ludusan, M. Schröer, and P. Wagner, “An acoustic-prosodic analysis of laughter types,”Speech Prosody 2024, 2024

2024
[38]

DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

2022
[39]

Clap learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[40]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[41]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inProc. Interspeech 2022, 2022, pp. 2063–2067

2022
[42]

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “SpeechLLM-as-Judges: To- wards general and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Audio-aware large language models as judges for speaking styles,

C.-H. Chiang, X. Wang, C.-C. Lin, K. Lin, L. Li, R. Kopetz, Y . Qian, Z. Wang, Z. Yang, H.-y. Leeet al., “Audio-aware large language models as judges for speaking styles,”arXiv preprint arXiv:2506.05984, vol. 7, 2025

work page arXiv 2025
[44]

Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-TTS technical report,” arXiv preprint arXiv:2601.15621, 2026

work page arXiv 2026