ISCSLP 2026 CoT-TTS Challenge: Chain-of-Thought Reasoning for Context-Aware Text-to-Speech

Bei Liu; Bin Long; Boyi Kang; Jiahao Pan; Junlan Feng; Liumeng Xue; Ruosong Yang; Shilei Zhang; Sitong Cheng; Wei Xue

arxiv: 2606.21933 · v1 · pith:DIMANKXSnew · submitted 2026-06-20 · 💻 cs.SD

ISCSLP 2026 CoT-TTS Challenge: Chain-of-Thought Reasoning for Context-Aware Text-to-Speech

Wei Xue , Junlan Feng , Shilei Zhang , Yue Wang , Ruosong Yang , Bei Liu , Liumeng Xue , Sitong Cheng

show 4 more authors

Jiahao Pan Weizhen Bian Boyi Kang Bin Long

This is my paper

Pith reviewed 2026-06-26 11:32 UTC · model grok-4.3

classification 💻 cs.SD

keywords text-to-speechchain-of-thought reasoningcontext-aware TTSconversational speech synthesisexpressive speech generationTTS challengemultimodal evaluation

0 comments

The pith

The ISCSLP 2026 CoT-TTS Challenge requires text-to-speech systems to produce explicit chain-of-thought reasoning about context before generating speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a challenge to test whether TTS systems can automatically determine appropriate speaking styles from surrounding conversational context instead of relying on explicit style prompts. It defines two tracks—one using text context and one using audio context—where each submitted system must output both a step-by-step reasoning trace and the final speech waveform. A large bilingual training set drawn from media sources is supplied along with filtered evaluation data. Evaluation combines objective scores, multimodal LLM judgments, and human listening tests. A baseline fine-tuned 0.6B model and training recipe are released to lower the entry barrier for participants working on film dubbing, audiobooks, and dialogue agents.

Core claim

The paper establishes the CoT-TTS Challenge as a benchmark in which systems must infer intended speaking manner from either textual or audio context, output an explicit chain-of-thought analysis of that inference, and then synthesize speech that remains consistent with both the analysis and the surrounding scene; two separate tracks, a bilingual training corpus, and a three-way evaluation protocol (objective metrics, multimodal LLM scoring, and human assessment) are defined to measure progress toward context-aware expressive speech generation.

What carries the argument

The CoT-TTS requirement that each system output both an explicit chain-of-thought reasoning trace and the generated waveform, applied across text-context and audio-context tracks.

If this is right

Successful systems must combine context understanding with waveform generation rather than treating style as an independent input.
The released bilingual training data and three-stage fine-tuning recipe will allow rapid iteration on models that jointly reason and synthesize speech.
Leaderboard results will establish quantitative baselines for how well current architectures handle implicit style inference in long dialogues.
Improved performance should directly benefit downstream tasks such as film dubbing and spoken dialogue agents that currently require manual style specification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The challenge format could be extended to test whether reasoning traces transfer across languages or to new acoustic environments not seen in training.
If the evaluation protocol proves stable, similar CoT requirements might be applied to other generative audio tasks such as music or sound-effect synthesis from scene descriptions.
The separation of text-context and audio-context tracks offers a controlled comparison of how much additional information audio cues provide for style inference.

Load-bearing premise

That the filtered evaluation set together with the combination of objective metrics, multimodal LLM scoring, and human judgments will reliably isolate and measure genuine context-aware reasoning ability rather than dataset-specific patterns.

What would settle it

Top-ranked systems on the challenge leaderboard produce speech that listeners rate as inconsistent with the intended manner when tested on new conversational scenes drawn from the same media domains but excluded from the official evaluation set.

Figures

Figures reproduced from arXiv: 2606.21933 by Bei Liu, Bin Long, Boyi Kang, Jiahao Pan, Junlan Feng, Liumeng Xue, Ruosong Yang, Shilei Zhang, Sitong Cheng, Wei Xue, Weizhen Bian, Yue Wang.

**Figure 1.** Figure 1: Overview of the initial training data construction pipeline. Raw media are converted and normalized, processed into utterance- and scene-level segments, annotated with emotion and reasoning information, and finally organized into CoTTTS training samples. After the initial construction, we refine the dataset through sample-level filtering. We extract audio-level features, including segment duration, effe… view at source ↗

**Figure 2.** Figure 2: Overview of the evaluation data filtering pipeline. The initial candidate pool is progressively filtered through basic audio and context checks, target-audio quality assessment, similarity-based deduplication, LLM-based text quality scoring, audio–CoT consistency checking, and final human assessment. with multiple speakers in the target audio are removed. After automatic filtering, we further conduct huma… view at source ↗

read the original abstract

Recent advances in text-to-speech (TTS) have greatly improved speech naturalness, speaker similarity, and controllability. However, most existing controllable TTS systems still rely on explicit user-provided style prompts, making it difficult to automatically determine how a sentence should be spoken in long and complex conversational scenarios. This proposal introduces the ISCSLP 2026 CoT-TTS Challenge, which aims to evaluate whether a system can infer the intended speaking manner from contextual information and generate speech consistent with both the reasoning output and the surrounding scene. The challenge contains two tracks: text-context-aware CoT-TTS and audio-context-aware CoT-TTS. We construct a large-scale bilingual training set from speech-rich media and provide carefully filtered evaluation data for leaderboard comparison. Each system is required to output both a chain-of-thought reasoning analysis and the generated speech waveform. The official evaluation combines objective metrics, multimodal LLM-based evaluation, and human subjective assessment. To facilitate reproducibility, we provide inference code together with a fine-tuning recipe for a 0.6B Qwen3-based model trained via a three-stage strategy. This challenge is expected to support research on context understanding, chain-of-thought reasoning, and expressive speech generation for applications such as film dubbing, audiobook production, virtual characters, and spoken dialogue agents. Further information about the associated challenge is available at:https://iscslp2026-cot-tts.github.io/challenge-website/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a challenge proposal that sets up tracks and data for context-aware TTS with explicit reasoning steps but contains no new methods or results.

read the letter

This document announces the ISCSLP 2026 CoT-TTS Challenge rather than reporting new research. It defines two tracks for inferring speaking style from text or audio context, requires systems to output both chain-of-thought reasoning and the waveform, and supplies a bilingual training set from media plus filtered evaluation data.

The organizers provide a 0.6B Qwen3 baseline with a three-stage fine-tuning recipe and inference code, which supports reproducibility. The evaluation plan combines objective metrics, multimodal LLM scoring, and human tests. This framing pulls together existing ideas on controllable TTS and context modeling into a shared task, which could help coordinate work on applications like dubbing and dialogue agents.

The proposal is clearly written and practical in its infrastructure. No mathematical claims or empirical results appear, so there is nothing to verify for correctness. The main soft spot is that the evaluation protocol is asserted without evidence that the LLM component or the required reasoning output will reliably measure the target capability; that remains to be seen once the challenge runs.

This is useful for groups planning to enter the challenge or use the released data and baseline. It shows straightforward organization of the problem but does not advance the underlying techniques. A serious editor should send it to peer review if the venue accepts challenge descriptions, since the setup could usefully focus community effort even if the evaluation details need refinement.

Referee Report

0 major / 1 minor

Summary. The manuscript proposes the ISCSLP 2026 CoT-TTS Challenge consisting of two tracks (text-context-aware CoT-TTS and audio-context-aware CoT-TTS). It supplies a large-scale bilingual training set derived from speech-rich media, filtered evaluation data, and an evaluation protocol that combines objective metrics, multimodal LLM-based assessment, and human subjective listening tests. Systems must produce both chain-of-thought reasoning and the corresponding speech waveform. A reproducibility baseline is supplied in the form of inference code and a three-stage fine-tuning recipe for a 0.6B Qwen3 model.

Significance. If executed as described, the challenge could stimulate research on inferring expressive speaking style from conversational context rather than explicit prompts, with downstream relevance to dubbing, audiobooks, and dialogue agents. The explicit provision of training data, evaluation code, and a documented baseline constitutes a concrete contribution to reproducibility that strengthens the proposal.

minor comments (1)

[Abstract] Abstract, final sentence: the URL is written without a space after 'at:' ('at:https://...').

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the manuscript. The comments confirm that the challenge proposal, data provision, evaluation protocol, and reproducibility baseline are viewed as a concrete contribution.

Circularity Check

0 steps flagged

No significant circularity: challenge proposal with no derivations

full rationale

This document is a challenge proposal that defines tracks, evaluation protocols, and a training recipe without any quantitative claims, derivations, equations, or predictions. No load-bearing steps exist that could reduce to fitted inputs, self-definitions, or self-citations. The central statements are definitional descriptions of the intended evaluation setup, which are self-contained by construction and contain no testable assertions that could exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The document introduces no mathematical models, free parameters, axioms, or invented entities; it is an organizational proposal for a competition.

pith-pipeline@v0.9.1-grok · 5837 in / 1028 out tokens · 25998 ms · 2026-06-26T11:32:09.921895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 9 linked inside Pith

[1]

Session/Challenge Organizers The proposed session organizers are listed below, together with their contact information and brief biographies where available for reference and further communication. Wei Xue, The Hong Kong University of Science and Technol- ogy Email:weixue@ust.hk Brief biography:Wei Xue is an Assistant Professor at the Hong Kong University...

Pith/arXiv arXiv 2026
[2]

Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis

Background, Motivation, and Potential Impact Recent advances in large language models have expanded the capabilities of text-to-speech (TTS) systems [1]. Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis. Natural-language style prompts have been used for expressive T...

2026
[3]

The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing

Tentative List of Authors / Teams Who Could Contribute Papers We expect this challenge to attract participating teams from both academia and industry. The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing. The fol- lowing list represents potential contrib...
[4]

The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline

Tentative Challenge Timeline The tentative timeline is designed to follow the official ISCSLP 2026 schedule. The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline. The evaluation stage will take place before the paper acceptance notification, so that the final ranking...

2026
[5]

The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results

Planned Format of the Challenge Session The proposed challenge session will be organized as a two-hour event. The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results. Representatives from the winning teams in each track and model category will then be invited t...
[6]

They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing

Recommended Expert Reviewers The following researchers are recommended as potential expert reviewers for papers submitted to this challenge session. They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing. Table 4:Tentative list of recommended reviewers. Name Affiliation Xie Chen Shanghai J...
[7]

Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline

Additional or Non-Standard Resources Needed 7.1. Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline. We first select suitable source materials and convert them into normalized audio. The audio is then processed to obtain utterance-level information such as speaker labels, timestamps, and spoken...
[8]

speaker-id: spoken text

Challenge Description This challenge focuses on context-aware CoT-TTS. Given con- textual information, a reference speech sample, and a target text, a system is expected to infer the appropriate speaking style of the target sentence, produce a chain-of-thought reasoning anal- ysis, and generate speech with the speaker timbre specified by the reference spe...
[9]

Towards con- trollable speech synthesis in the era of large language models: A systematic survey,

T. Xie, Y . Rong, P. Zhang, W. Wang, and L. Liu, “Towards con- trollable speech synthesis in the era of large language models: A systematic survey,” inProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, 2025, pp. 764– 791

2025
[10]

Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,

D. Yang, S. Liu, R. Huang, C. Weng, and H. Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024

2024
[11]

Fish audio s2 technical report,

S. Liao, Y . Wang, S. Liu, Y . Cheng, R. Zhang, T. Li, S. Li, Y . Zheng, X. Liu, Q. Wanget al., “Fish audio s2 technical report,” arXiv preprint arXiv:2603.08823, 2026

arXiv 2026
[12]

V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025

arXiv 2025
[13]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 41, 2026, pp. 35 139–35 148

2026
[14]

Borderless long speech synthesis,

X. Song, D. Wu, D. Zhou, P. Cheng, H. Ding, Y . He, J. Wang, S. Shen, S. Lv, L. Fanet al., “Borderless long speech synthesis,” arXiv preprint arXiv:2603.19798, 2026

Pith/arXiv arXiv 2026
[15]

Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,

H. Xie, H. Lin, W. Cao, D. Guo, W. Tian, J. Wu, H. Wen, R. Shang, H. Liu, Z. Jianget al., “Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,” arXiv preprint arXiv:2510.23541, 2025

arXiv 2025
[16]

Duplexcascade: Full- duplex speech-to-speech dialogue with vad-free cascaded asr- llm-tts pipeline and micro-turn optimization,

J. Yang, Y . Fujita, and Y . Sudo, “Duplexcascade: Full- duplex speech-to-speech dialogue with vad-free cascaded asr- llm-tts pipeline and micro-turn optimization,”arXiv preprint arXiv:2603.09180, 2026

arXiv 2026
[17]

Style- talker: Finetuning audio language model and style-based text-to- speech model for fast spoken dialogue generation,

Y . A. Li, X. Jiang, J. Darefsky, G. Zhu, and N. Mesgarani, “Style- talker: Finetuning audio language model and style-based text-to- speech model for fast spoken dialogue generation,”arXiv preprint arXiv:2408.11849, 2024

arXiv 2024
[18]

M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,

J. Xue, Y . Deng, F. Wang, Y . Li, Y . Gao, J. Tao, J. Sun, and J. Liang, “M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2023, pp. 1–5

2023
[19]

H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech

D. Seong and J.-H. Chang, “H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech.” inINTER- SPEECH, 2024

2024
[20]

Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,

W. Wu, Z. Lin, Y . Zhou, J. Li, R. Niu, Q. Wu, S. Cao, L. Ma, and Z. Wu, “Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,” inICASSP 2025-2025 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2025, pp. 1–5

2025
[21]

Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,

Y . Deng, J. Xue, F. Wang, Y . Gao, and Y . Li, “Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6081–6089

2023
[22]

Actormind: Emulating hu- man actor reasoning for speech role-playing,

X. Chen, W. Xue, and Y . Guo, “Actormind: Emulating hu- man actor reasoning for speech role-playing,”arXiv preprint arXiv:2604.11103, 2026

Pith/arXiv arXiv 2026
[23]

A preliminary exploration with gpt-4o voice mode,

Y .-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

arXiv 2025
[24]

Gpt-4o system card,

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[25]

Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,

X. Su, Z. Sun, P. Jia, and J. Gao, “Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,”arXiv preprint arXiv:2604.08363, 2026

Pith/arXiv arXiv 2026
[26]

Dailydialog: A manually labelled multi-turn dialogue dataset,

Y . Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” inProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 986–995

2017
[27]

Dailytalk: Spoken dialogue dataset for conversational text-to-speech,

K. Lee, K. Park, and D. Kim, “Dailytalk: Spoken dialogue dataset for conversational text-to-speech,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5

2023
[28]

Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,

R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,” inProceedings of the AAAI Conference on Ar- tificial Intelligence, vol. 38, no. 17, 2024, pp. 18 698–18 706

2024
[29]

Prompt-guided selective masking loss for context-aware emotive text-to-speech,

Y . Jeon, Y . Kim, J. Lee, and G. Lee, “Prompt-guided selective masking loss for context-aware emotive text-to-speech,” inFind- ings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 638–650

2025
[30]

Generative expressive conversational speech synthesis,

R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Generative expressive conversational speech synthesis,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4187–4196

2024
[31]

Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,

W. Bian, Y . Zhou, K. Zhang, and X. Gu, “Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,” in2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2024, pp. 417–420

2024
[32]

Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,

C. Cheng, H. Sun, B. Du, S. Shang, X. Hu, and R. Yan, “Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 599–18 616

2025
[33]

But system description to voxceleb speaker recognition chal- lenge 2019,

H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition chal- lenge 2019,”arXiv preprint arXiv:1910.12592, 2019

arXiv 2019
[34]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[35]

pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” inProc. INTERSPEECH 2023, 2023

2023
[36]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. INTERSPEECH 2023, 2023

2023
[37]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025
[38]

Funasr: A fundamental end-to-end speech recog- nition toolkit,

Z. Gaoet al., “Funasr: A fundamental end-to-end speech recog- nition toolkit,” inINTERSPEECH, 2023

2023
[39]

Dawn of the trans- former era in speech emotion recognition: closing the valence gap,

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the trans- former era in speech emotion recognition: closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 9, pp. 10 745–10 759, 2023

2023
[40]

Qwen3-asr technical report,

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026

Pith/arXiv arXiv 2026
[41]

Uniss: Unified expressive speech-to-speech transla- tion with your voice,

S. Cheng, W. Bian, X. Wang, R. Yuan, J. Chen, S. Yin, Y . Guo, and W. Xue, “Uniss: Unified expressive speech-to-speech transla- tion with your voice,”arXiv preprint arXiv:2509.21144, 2025

arXiv 2025
[42]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[43]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025
[44]

Distillation and pruning for scalable self-supervised representation-based speech quality assessment,

B. Stahl and H. Gamper, “Distillation and pruning for scalable self-supervised representation-based speech quality assessment,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. [Online]. Available: https://arxiv.org/abs/2502.05356

arXiv 2025
[45]

The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Work- shop (SLT), 2024

2024
[46]

Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

2022
[47]

From wer and ril to mer and wil: improved evaluation measures for connected speech recognition

A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004. A. Web Interface and Scoring Criteria for Human Quality Assessment B. Data Copyright Statement The challenge dataset is prepared for academic research and evaluation purposes onl...

2004

[1] [1]

Session/Challenge Organizers The proposed session organizers are listed below, together with their contact information and brief biographies where available for reference and further communication. Wei Xue, The Hong Kong University of Science and Technol- ogy Email:weixue@ust.hk Brief biography:Wei Xue is an Assistant Professor at the Hong Kong University...

Pith/arXiv arXiv 2026

[2] [2]

Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis

Background, Motivation, and Potential Impact Recent advances in large language models have expanded the capabilities of text-to-speech (TTS) systems [1]. Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis. Natural-language style prompts have been used for expressive T...

2026

[3] [3]

The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing

Tentative List of Authors / Teams Who Could Contribute Papers We expect this challenge to attract participating teams from both academia and industry. The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing. The fol- lowing list represents potential contrib...

[4] [4]

The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline

Tentative Challenge Timeline The tentative timeline is designed to follow the official ISCSLP 2026 schedule. The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline. The evaluation stage will take place before the paper acceptance notification, so that the final ranking...

2026

[5] [5]

The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results

Planned Format of the Challenge Session The proposed challenge session will be organized as a two-hour event. The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results. Representatives from the winning teams in each track and model category will then be invited t...

[6] [6]

They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing

Recommended Expert Reviewers The following researchers are recommended as potential expert reviewers for papers submitted to this challenge session. They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing. Table 4:Tentative list of recommended reviewers. Name Affiliation Xie Chen Shanghai J...

[7] [7]

Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline

Additional or Non-Standard Resources Needed 7.1. Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline. We first select suitable source materials and convert them into normalized audio. The audio is then processed to obtain utterance-level information such as speaker labels, timestamps, and spoken...

[8] [8]

speaker-id: spoken text

Challenge Description This challenge focuses on context-aware CoT-TTS. Given con- textual information, a reference speech sample, and a target text, a system is expected to infer the appropriate speaking style of the target sentence, produce a chain-of-thought reasoning anal- ysis, and generate speech with the speaker timbre specified by the reference spe...

[9] [9]

Towards con- trollable speech synthesis in the era of large language models: A systematic survey,

T. Xie, Y . Rong, P. Zhang, W. Wang, and L. Liu, “Towards con- trollable speech synthesis in the era of large language models: A systematic survey,” inProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, 2025, pp. 764– 791

2025

[10] [10]

Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,

D. Yang, S. Liu, R. Huang, C. Weng, and H. Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024

2024

[11] [11]

Fish audio s2 technical report,

S. Liao, Y . Wang, S. Liu, Y . Cheng, R. Zhang, T. Li, S. Li, Y . Zheng, X. Liu, Q. Wanget al., “Fish audio s2 technical report,” arXiv preprint arXiv:2603.08823, 2026

arXiv 2026

[12] [12]

V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,

Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025

arXiv 2025

[13] [13]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 41, 2026, pp. 35 139–35 148

2026

[14] [14]

Borderless long speech synthesis,

X. Song, D. Wu, D. Zhou, P. Cheng, H. Ding, Y . He, J. Wang, S. Shen, S. Lv, L. Fanet al., “Borderless long speech synthesis,” arXiv preprint arXiv:2603.19798, 2026

Pith/arXiv arXiv 2026

[15] [15]

Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,

H. Xie, H. Lin, W. Cao, D. Guo, W. Tian, J. Wu, H. Wen, R. Shang, H. Liu, Z. Jianget al., “Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,” arXiv preprint arXiv:2510.23541, 2025

arXiv 2025

[16] [16]

Duplexcascade: Full- duplex speech-to-speech dialogue with vad-free cascaded asr- llm-tts pipeline and micro-turn optimization,

J. Yang, Y . Fujita, and Y . Sudo, “Duplexcascade: Full- duplex speech-to-speech dialogue with vad-free cascaded asr- llm-tts pipeline and micro-turn optimization,”arXiv preprint arXiv:2603.09180, 2026

arXiv 2026

[17] [17]

Style- talker: Finetuning audio language model and style-based text-to- speech model for fast spoken dialogue generation,

Y . A. Li, X. Jiang, J. Darefsky, G. Zhu, and N. Mesgarani, “Style- talker: Finetuning audio language model and style-based text-to- speech model for fast spoken dialogue generation,”arXiv preprint arXiv:2408.11849, 2024

arXiv 2024

[18] [18]

M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,

J. Xue, Y . Deng, F. Wang, Y . Li, Y . Gao, J. Tao, J. Sun, and J. Liang, “M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2023, pp. 1–5

2023

[19] [19]

H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech

D. Seong and J.-H. Chang, “H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech.” inINTER- SPEECH, 2024

2024

[20] [20]

Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,

W. Wu, Z. Lin, Y . Zhou, J. Li, R. Niu, Q. Wu, S. Cao, L. Ma, and Z. Wu, “Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,” inICASSP 2025-2025 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2025, pp. 1–5

2025

[21] [21]

Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,

Y . Deng, J. Xue, F. Wang, Y . Gao, and Y . Li, “Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6081–6089

2023

[22] [22]

Actormind: Emulating hu- man actor reasoning for speech role-playing,

X. Chen, W. Xue, and Y . Guo, “Actormind: Emulating hu- man actor reasoning for speech role-playing,”arXiv preprint arXiv:2604.11103, 2026

Pith/arXiv arXiv 2026

[23] [23]

A preliminary exploration with gpt-4o voice mode,

Y .-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025

arXiv 2025

[24] [24]

Gpt-4o system card,

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[25] [25]

Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,

X. Su, Z. Sun, P. Jia, and J. Gao, “Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,”arXiv preprint arXiv:2604.08363, 2026

Pith/arXiv arXiv 2026

[26] [26]

Dailydialog: A manually labelled multi-turn dialogue dataset,

Y . Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” inProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 986–995

2017

[27] [27]

Dailytalk: Spoken dialogue dataset for conversational text-to-speech,

K. Lee, K. Park, and D. Kim, “Dailytalk: Spoken dialogue dataset for conversational text-to-speech,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5

2023

[28] [28]

Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,

R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,” inProceedings of the AAAI Conference on Ar- tificial Intelligence, vol. 38, no. 17, 2024, pp. 18 698–18 706

2024

[29] [29]

Prompt-guided selective masking loss for context-aware emotive text-to-speech,

Y . Jeon, Y . Kim, J. Lee, and G. Lee, “Prompt-guided selective masking loss for context-aware emotive text-to-speech,” inFind- ings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 638–650

2025

[30] [30]

Generative expressive conversational speech synthesis,

R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Generative expressive conversational speech synthesis,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4187–4196

2024

[31] [31]

Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,

W. Bian, Y . Zhou, K. Zhang, and X. Gu, “Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,” in2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2024, pp. 417–420

2024

[32] [32]

Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,

C. Cheng, H. Sun, B. Du, S. Shang, X. Hu, and R. Yan, “Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 599–18 616

2025

[33] [33]

But system description to voxceleb speaker recognition chal- lenge 2019,

H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition chal- lenge 2019,”arXiv preprint arXiv:1910.12592, 2019

arXiv 2019

[34] [34]

Wespeaker: A research and production oriented speaker embedding learning toolkit,

H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[35] [35]

pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,

H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” inProc. INTERSPEECH 2023, 2023

2023

[36] [36]

Powerset multi-class cross entropy loss for neural speaker diarization,

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. INTERSPEECH 2023, 2023

2023

[37] [37]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948

Pith/arXiv arXiv 2025

[38] [38]

Funasr: A fundamental end-to-end speech recog- nition toolkit,

Z. Gaoet al., “Funasr: A fundamental end-to-end speech recog- nition toolkit,” inINTERSPEECH, 2023

2023

[39] [39]

Dawn of the trans- former era in speech emotion recognition: closing the valence gap,

J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the trans- former era in speech emotion recognition: closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 9, pp. 10 745–10 759, 2023

2023

[40] [40]

Qwen3-asr technical report,

X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026

Pith/arXiv arXiv 2026

[41] [41]

Uniss: Unified expressive speech-to-speech transla- tion with your voice,

S. Cheng, W. Bian, X. Wang, R. Yuan, J. Chen, S. Yin, Y . Guo, and W. Xue, “Uniss: Unified expressive speech-to-speech transla- tion with your voice,”arXiv preprint arXiv:2509.21144, 2025

arXiv 2025

[42] [42]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[43] [43]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025

[44] [44]

Distillation and pruning for scalable self-supervised representation-based speech quality assessment,

B. Stahl and H. Gamper, “Distillation and pruning for scalable self-supervised representation-based speech quality assessment,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. [Online]. Available: https://arxiv.org/abs/2502.05356

arXiv 2025

[45] [45]

The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Work- shop (SLT), 2024

2024

[46] [46]

Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890

2022

[47] [47]

From wer and ril to mer and wil: improved evaluation measures for connected speech recognition

A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004. A. Web Interface and Scoring Criteria for Human Quality Assessment B. Data Copyright Statement The challenge dataset is prepared for academic research and evaluation purposes onl...

2004