ISCSLP 2026 CoT-TTS Challenge: Chain-of-Thought Reasoning for Context-Aware Text-to-Speech
Pith reviewed 2026-06-26 11:32 UTC · model grok-4.3
The pith
The ISCSLP 2026 CoT-TTS Challenge requires text-to-speech systems to produce explicit chain-of-thought reasoning about context before generating speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes the CoT-TTS Challenge as a benchmark in which systems must infer intended speaking manner from either textual or audio context, output an explicit chain-of-thought analysis of that inference, and then synthesize speech that remains consistent with both the analysis and the surrounding scene; two separate tracks, a bilingual training corpus, and a three-way evaluation protocol (objective metrics, multimodal LLM scoring, and human assessment) are defined to measure progress toward context-aware expressive speech generation.
What carries the argument
The CoT-TTS requirement that each system output both an explicit chain-of-thought reasoning trace and the generated waveform, applied across text-context and audio-context tracks.
If this is right
- Successful systems must combine context understanding with waveform generation rather than treating style as an independent input.
- The released bilingual training data and three-stage fine-tuning recipe will allow rapid iteration on models that jointly reason and synthesize speech.
- Leaderboard results will establish quantitative baselines for how well current architectures handle implicit style inference in long dialogues.
- Improved performance should directly benefit downstream tasks such as film dubbing and spoken dialogue agents that currently require manual style specification.
Where Pith is reading between the lines
- The challenge format could be extended to test whether reasoning traces transfer across languages or to new acoustic environments not seen in training.
- If the evaluation protocol proves stable, similar CoT requirements might be applied to other generative audio tasks such as music or sound-effect synthesis from scene descriptions.
- The separation of text-context and audio-context tracks offers a controlled comparison of how much additional information audio cues provide for style inference.
Load-bearing premise
That the filtered evaluation set together with the combination of objective metrics, multimodal LLM scoring, and human judgments will reliably isolate and measure genuine context-aware reasoning ability rather than dataset-specific patterns.
What would settle it
Top-ranked systems on the challenge leaderboard produce speech that listeners rate as inconsistent with the intended manner when tested on new conversational scenes drawn from the same media domains but excluded from the official evaluation set.
Figures
read the original abstract
Recent advances in text-to-speech (TTS) have greatly improved speech naturalness, speaker similarity, and controllability. However, most existing controllable TTS systems still rely on explicit user-provided style prompts, making it difficult to automatically determine how a sentence should be spoken in long and complex conversational scenarios. This proposal introduces the ISCSLP 2026 CoT-TTS Challenge, which aims to evaluate whether a system can infer the intended speaking manner from contextual information and generate speech consistent with both the reasoning output and the surrounding scene. The challenge contains two tracks: text-context-aware CoT-TTS and audio-context-aware CoT-TTS. We construct a large-scale bilingual training set from speech-rich media and provide carefully filtered evaluation data for leaderboard comparison. Each system is required to output both a chain-of-thought reasoning analysis and the generated speech waveform. The official evaluation combines objective metrics, multimodal LLM-based evaluation, and human subjective assessment. To facilitate reproducibility, we provide inference code together with a fine-tuning recipe for a 0.6B Qwen3-based model trained via a three-stage strategy. This challenge is expected to support research on context understanding, chain-of-thought reasoning, and expressive speech generation for applications such as film dubbing, audiobook production, virtual characters, and spoken dialogue agents. Further information about the associated challenge is available at:https://iscslp2026-cot-tts.github.io/challenge-website/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the ISCSLP 2026 CoT-TTS Challenge consisting of two tracks (text-context-aware CoT-TTS and audio-context-aware CoT-TTS). It supplies a large-scale bilingual training set derived from speech-rich media, filtered evaluation data, and an evaluation protocol that combines objective metrics, multimodal LLM-based assessment, and human subjective listening tests. Systems must produce both chain-of-thought reasoning and the corresponding speech waveform. A reproducibility baseline is supplied in the form of inference code and a three-stage fine-tuning recipe for a 0.6B Qwen3 model.
Significance. If executed as described, the challenge could stimulate research on inferring expressive speaking style from conversational context rather than explicit prompts, with downstream relevance to dubbing, audiobooks, and dialogue agents. The explicit provision of training data, evaluation code, and a documented baseline constitutes a concrete contribution to reproducibility that strengthens the proposal.
minor comments (1)
- [Abstract] Abstract, final sentence: the URL is written without a space after 'at:' ('at:https://...').
Simulated Author's Rebuttal
We thank the referee for the thorough review and positive recommendation to accept the manuscript. The comments confirm that the challenge proposal, data provision, evaluation protocol, and reproducibility baseline are viewed as a concrete contribution.
Circularity Check
No significant circularity: challenge proposal with no derivations
full rationale
This document is a challenge proposal that defines tracks, evaluation protocols, and a training recipe without any quantitative claims, derivations, equations, or predictions. No load-bearing steps exist that could reduce to fitted inputs, self-definitions, or self-citations. The central statements are definitional descriptions of the intended evaluation setup, which are self-contained by construction and contain no testable assertions that could exhibit circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Session/Challenge Organizers The proposed session organizers are listed below, together with their contact information and brief biographies where available for reference and further communication. Wei Xue, The Hong Kong University of Science and Technol- ogy Email:weixue@ust.hk Brief biography:Wei Xue is an Assistant Professor at the Hong Kong University...
Pith/arXiv arXiv 2026
-
[2]
Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis
Background, Motivation, and Potential Impact Recent advances in large language models have expanded the capabilities of text-to-speech (TTS) systems [1]. Modern TTS models can now support expressive style control, instruction- following generation, and long-form multi-speaker speech synthesis. Natural-language style prompts have been used for expressive T...
2026
-
[3]
The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing
Tentative List of Authors / Teams Who Could Contribute Papers We expect this challenge to attract participating teams from both academia and industry. The topic is closely related to current re- search in speech synthesis, speech foundation models, spoken dialogue systems, and multimodal speech processing. The fol- lowing list represents potential contrib...
-
[4]
The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline
Tentative Challenge Timeline The tentative timeline is designed to follow the official ISCSLP 2026 schedule. The challenge will be launched after proposal acceptance, and the model submission deadline will be aligned with the paper submission deadline. The evaluation stage will take place before the paper acceptance notification, so that the final ranking...
2026
-
[5]
The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results
Planned Format of the Challenge Session The proposed challenge session will be organized as a two-hour event. The session will begin with an overview of the chal- lenge, including the task setup, dataset construction, evaluation methodology, and overall results. Representatives from the winning teams in each track and model category will then be invited t...
-
[6]
They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing
Recommended Expert Reviewers The following researchers are recommended as potential expert reviewers for papers submitted to this challenge session. They have relevant expertise in speech synthesis, speech generation, speech understanding, and spoken language processing. Table 4:Tentative list of recommended reviewers. Name Affiliation Xie Chen Shanghai J...
-
[7]
Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline
Additional or Non-Standard Resources Needed 7.1. Training Dataset The training dataset is built from speech-rich media through a step-by-step processing pipeline. We first select suitable source materials and convert them into normalized audio. The audio is then processed to obtain utterance-level information such as speaker labels, timestamps, and spoken...
-
[8]
speaker-id: spoken text
Challenge Description This challenge focuses on context-aware CoT-TTS. Given con- textual information, a reference speech sample, and a target text, a system is expected to infer the appropriate speaking style of the target sentence, produce a chain-of-thought reasoning anal- ysis, and generate speech with the speaker timbre specified by the reference spe...
-
[9]
Towards con- trollable speech synthesis in the era of large language models: A systematic survey,
T. Xie, Y . Rong, P. Zhang, W. Wang, and L. Liu, “Towards con- trollable speech synthesis in the era of large language models: A systematic survey,” inProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, 2025, pp. 764– 791
2025
-
[10]
Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,
D. Yang, S. Liu, R. Huang, C. Weng, and H. Meng, “Instructtts: Modelling expressive tts in discrete latent space with natural lan- guage style prompt,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2913–2925, 2024
2024
-
[11]
Fish audio s2 technical report,
S. Liao, Y . Wang, S. Liu, Y . Cheng, R. Zhang, T. Li, S. Li, Y . Zheng, X. Liu, Q. Wanget al., “Fish audio s2 technical report,” arXiv preprint arXiv:2603.08823, 2026
arXiv 2026
-
[12]
V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,
Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025
arXiv 2025
-
[13]
Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 41, 2026, pp. 35 139–35 148
2026
-
[14]
Borderless long speech synthesis,
X. Song, D. Wu, D. Zhou, P. Cheng, H. Ding, Y . He, J. Wang, S. Shen, S. Lv, L. Fanet al., “Borderless long speech synthesis,” arXiv preprint arXiv:2603.19798, 2026
Pith/arXiv arXiv 2026
-
[15]
Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,
H. Xie, H. Lin, W. Cao, D. Guo, W. Tian, J. Wu, H. Wen, R. Shang, H. Liu, Z. Jianget al., “Soulx-podcast: Towards realis- tic long-form podcasts with dialectal and paralinguistic diversity,” arXiv preprint arXiv:2510.23541, 2025
arXiv 2025
-
[16]
J. Yang, Y . Fujita, and Y . Sudo, “Duplexcascade: Full- duplex speech-to-speech dialogue with vad-free cascaded asr- llm-tts pipeline and micro-turn optimization,”arXiv preprint arXiv:2603.09180, 2026
arXiv 2026
-
[17]
Y . A. Li, X. Jiang, J. Darefsky, G. Zhu, and N. Mesgarani, “Style- talker: Finetuning audio language model and style-based text-to- speech model for fast spoken dialogue generation,”arXiv preprint arXiv:2408.11849, 2024
arXiv 2024
-
[18]
M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,
J. Xue, Y . Deng, F. Wang, Y . Li, Y . Gao, J. Tao, J. Sun, and J. Liang, “M 2-ctts: End-to-end multi-scale multi-modal conver- sational text-to-speech synthesis,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[19]
H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech
D. Seong and J.-H. Chang, “H4c-tts: Leveraging multi-modal historical context for conversational text-to-speech.” inINTER- SPEECH, 2024
2024
-
[20]
Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,
W. Wu, Z. Lin, Y . Zhou, J. Li, R. Niu, Q. Wu, S. Cao, L. Ma, and Z. Wu, “Diffcss: Diverse and expressive conversational speech synthesis with diffusion models,” inICASSP 2025-2025 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[21]
Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,
Y . Deng, J. Xue, F. Wang, Y . Gao, and Y . Li, “Cmcu-css: En- hancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6081–6089
2023
-
[22]
Actormind: Emulating hu- man actor reasoning for speech role-playing,
X. Chen, W. Xue, and Y . Guo, “Actormind: Emulating hu- man actor reasoning for speech role-playing,”arXiv preprint arXiv:2604.11103, 2026
Pith/arXiv arXiv 2026
-
[23]
A preliminary exploration with gpt-4o voice mode,
Y .-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025
arXiv 2025
-
[24]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[25]
Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,
X. Su, Z. Sun, P. Jia, and J. Gao, “Captalk: Unified voice de- sign for single-utterance and dialogue speech generation,”arXiv preprint arXiv:2604.08363, 2026
Pith/arXiv arXiv 2026
-
[26]
Dailydialog: A manually labelled multi-turn dialogue dataset,
Y . Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” inProceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017, pp. 986–995
2017
-
[27]
Dailytalk: Spoken dialogue dataset for conversational text-to-speech,
K. Lee, K. Park, and D. Kim, “Dailytalk: Spoken dialogue dataset for conversational text-to-speech,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[28]
Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,
R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling,” inProceedings of the AAAI Conference on Ar- tificial Intelligence, vol. 38, no. 17, 2024, pp. 18 698–18 706
2024
-
[29]
Prompt-guided selective masking loss for context-aware emotive text-to-speech,
Y . Jeon, Y . Kim, J. Lee, and G. Lee, “Prompt-guided selective masking loss for context-aware emotive text-to-speech,” inFind- ings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 638–650
2025
-
[30]
Generative expressive conversational speech synthesis,
R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Generative expressive conversational speech synthesis,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4187–4196
2024
-
[31]
Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,
W. Bian, Y . Zhou, K. Zhang, and X. Gu, “Emospeech: A corpus of emotionally rich and contextually detailed speech annotations,” in2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2024, pp. 417–420
2024
-
[32]
Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,
C. Cheng, H. Sun, B. Du, S. Shang, X. Hu, and R. Yan, “Dnaspeech: A contextualized and situated text-to-speech dataset with dialogues, narratives and actions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 18 599–18 616
2025
-
[33]
But system description to voxceleb speaker recognition chal- lenge 2019,
H. Zeinali, S. Wang, A. Silnova, P. Mat ˇejka, and O. Plchot, “But system description to voxceleb speaker recognition chal- lenge 2019,”arXiv preprint arXiv:1910.12592, 2019
arXiv 2019
-
[34]
Wespeaker: A research and production oriented speaker embedding learning toolkit,
H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y . Deng, and Y . Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” inIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[35]
pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,
H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: prin- ciple, benchmark, and recipe,” inProc. INTERSPEECH 2023, 2023
2023
-
[36]
Powerset multi-class cross entropy loss for neural speaker diarization,
A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” inProc. INTERSPEECH 2023, 2023
2023
-
[37]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948
Pith/arXiv arXiv 2025
-
[38]
Funasr: A fundamental end-to-end speech recog- nition toolkit,
Z. Gaoet al., “Funasr: A fundamental end-to-end speech recog- nition toolkit,” inINTERSPEECH, 2023
2023
-
[39]
Dawn of the trans- former era in speech emotion recognition: closing the valence gap,
J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the trans- former era in speech emotion recognition: closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 45, no. 9, pp. 10 745–10 759, 2023
2023
-
[40]
X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026
Pith/arXiv arXiv 2026
-
[41]
Uniss: Unified expressive speech-to-speech transla- tion with your voice,
S. Cheng, W. Bian, X. Wang, R. Yuan, J. Chen, S. Yin, Y . Guo, and W. Xue, “Uniss: Unified expressive speech-to-speech transla- tion with your voice,”arXiv preprint arXiv:2509.21144, 2025
arXiv 2025
-
[42]
Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
Pith/arXiv arXiv 2025
-
[43]
Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,
X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025
Pith/arXiv arXiv 2025
-
[44]
B. Stahl and H. Gamper, “Distillation and pruning for scalable self-supervised representation-based speech quality assessment,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. [Online]. Available: https://arxiv.org/abs/2502.05356
arXiv 2025
-
[45]
The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,
K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the V oiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Work- shop (SLT), 2024
2024
-
[46]
Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evalu- ate noise suppressors,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 886–890
2022
-
[47]
From wer and ril to mer and wil: improved evaluation measures for connected speech recognition
A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” inInterspeech, no. 4-8, 2004, p. 2004. A. Web Interface and Scoring Criteria for Human Quality Assessment B. Data Copyright Statement The challenge dataset is prepared for academic research and evaluation purposes onl...
2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.