SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
Pith reviewed 2026-05-19 18:54 UTC · model grok-4.3
The pith
SemaVoice adds a foundation-model alignment step to continuous speech representations so autoregressive TTS can keep semantic meaning without losing acoustic quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemaVoice introduces an SFM-guided alignment mechanism that refines continuous speech representations to capture local semantic consistency and global structural relationships; these representations then condition a patch-wise diffusion head inside the autoregressive framework, producing high-fidelity zero-shot TTS that reduces the mismatch between semantic-prosodic modeling and reconstruction-driven features.
What carries the argument
SFM-guided alignment mechanism that refines continuous speech representations to enforce local semantic consistency and global structural relationships before they condition the diffusion head.
If this is right
- Refined representations reduce the tendency of autoregressive generation to drift from intended meaning.
- Error accumulation across successive patches is limited, supporting longer coherent outputs.
- The same alignment step improves results at multiple representation granularities under a fixed information-rate budget.
- Objective word-error and subjective quality scores remain competitive with leading open-source zero-shot TTS systems.
Where Pith is reading between the lines
- The alignment idea could transfer to other continuous-generation domains such as music or environmental audio where high-level structure matters.
- It offers one route to combine the strengths of large semantic encoders with the flexibility of continuous acoustic modeling.
- Longer utterances or streaming scenarios might benefit if the alignment can be made causal and incremental.
Load-bearing premise
The speech foundation model alignment can correct the mismatch between semantic-prosodic needs and continuous acoustic representations without creating new artifacts or extra error buildup during autoregressive steps.
What would settle it
A controlled ablation that disables only the SFM-guided alignment while keeping every other component fixed and then measures whether semantic coherence scores drop or audible artifacts rise on the same test set.
Figures
read the original abstract
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech representations. This mismatch causes TTS models to focus excessively on low-level acoustic textures at the expense of high-level semantic coherence, further exacerbating error accumulation in autoregressive generation. To address this challenge, we propose SemaVoice, a semantic-aware continuous autoregressive framework for high-fidelity zero-shot TTS. SemaVoice introduces a Speech Foundation Model (SFM) guided alignment mechanism that refines continuous speech representations to better capture both local semantic consistency and global structural relationships. These representations condition a patch-wise diffusion head within the autoregressive framework for high-quality speech synthesis. Experimental results on the Seed-TTS benchmark show that SemaVoice achieves an English WER of 1.71\% and remains highly competitive with state-of-the-art open-source systems in both objective and subjective evaluations. The effectiveness of SFM guided alignment is further confirmed by significant improvements under varying representation granularities with a fixed information-rate constraint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SemaVoice, a semantic-aware continuous autoregressive framework for zero-shot TTS. It identifies a mismatch between semantic-prosodic modeling and reconstruction-driven continuous representations that leads to excessive focus on low-level acoustics and error accumulation. The solution introduces an SFM-guided alignment mechanism to refine continuous speech representations for improved local semantic consistency and global structural relationships; these representations then condition a patch-wise diffusion head within the autoregressive model. Experiments on the Seed-TTS benchmark report an English WER of 1.71% with competitiveness against open-source SOTA systems in objective and subjective metrics, plus ablations confirming gains under varying representation granularities at fixed information rate.
Significance. If the empirical results hold, the work offers a practical way to inject semantic awareness into continuous AR speech synthesis without sacrificing reconstruction quality. The fixed information-rate ablations and direct comparisons to open-source baselines provide a clear test of whether SFM alignment mitigates the stated mismatch, which could influence subsequent designs that combine foundation-model guidance with diffusion-based heads.
major comments (2)
- [§4.2] §4.2, alignment objective: the claim that SFM-guided alignment resolves the semantic-prosodic mismatch without introducing new artifacts rests on the reported WER and subjective scores, yet the manuscript does not quantify error accumulation rates across generation lengths or provide a direct comparison of semantic coherence metrics (e.g., sentence embedding similarity) between aligned and unaligned representations.
- [Table 2] Table 2, Seed-TTS English row: the 1.71% WER is presented as state-of-the-art among open-source systems, but the table omits confidence intervals or the number of evaluation utterances; without these, it is difficult to assess whether the improvement over the next-best baseline is statistically reliable.
minor comments (3)
- [Eq. (7)] The notation for the patch-wise diffusion head (Eq. 7) uses p for both patch index and probability; a distinct symbol would improve readability.
- [Figure 3] Figure 3 caption does not state the exact number of samples used for the MOS listening test or whether listeners were screened for native English proficiency.
- [§2] The related-work section cites several continuous AR TTS papers but omits recent diffusion-based non-autoregressive baselines that also employ semantic conditioning; a brief comparison paragraph would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and have prepared revisions to incorporate additional analyses and statistical details where they strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [§4.2] §4.2, alignment objective: the claim that SFM-guided alignment resolves the semantic-prosodic mismatch without introducing new artifacts rests on the reported WER and subjective scores, yet the manuscript does not quantify error accumulation rates across generation lengths or provide a direct comparison of semantic coherence metrics (e.g., sentence embedding similarity) between aligned and unaligned representations.
Authors: We appreciate the referee's suggestion for more direct evidence. The reported WER reduction to 1.71% and competitive subjective scores already indicate that SFM-guided alignment improves semantic consistency without degrading perceptual quality, as the fixed information-rate ablations further isolate the benefit of alignment from mere capacity changes. Nevertheless, to provide a more explicit demonstration, we will add in the revised §4.2 a comparison of sentence-level embedding similarity (using cosine similarity from a frozen sentence transformer) between aligned and unaligned continuous representations on the Seed-TTS test set. Regarding error accumulation, our current experiments focus on benchmark-length utterances; while we do not have new long-form generation results ready, the observed gains across granularities at constant bitrate already suggest reduced drift. We will therefore include the embedding similarity analysis and note the limitation on accumulation quantification in the text. revision: partial
-
Referee: [Table 2] Table 2, Seed-TTS English row: the 1.71% WER is presented as state-of-the-art among open-source systems, but the table omits confidence intervals or the number of evaluation utterances; without these, it is difficult to assess whether the improvement over the next-best baseline is statistically reliable.
Authors: We agree that statistical context is helpful. The Seed-TTS English evaluation follows the benchmark protocol, but the table will be updated in the revision to explicitly state the number of utterances used and to report 95% bootstrap confidence intervals for all WER entries. This addition will allow readers to directly assess the reliability of the 1.71% result relative to the next-best open-source baseline. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central contribution is an architectural proposal (SFM-guided alignment refining continuous representations to condition a patch-wise diffusion head) whose effectiveness is asserted via direct empirical measurement on the Seed-TTS benchmark (WER 1.71 %) and ablations under fixed information-rate constraints. No derivation chain is presented that reduces a claimed prediction or first-principles result to its own inputs by construction; the reported metrics are external evaluations rather than quantities fitted and then re-predicted within the same model. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Refining continuous speech representations via SFM alignment improves semantic-prosodic coherence without harming acoustic fidelity
Reference graph
Works this paper leans on
-
[1]
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023
work page 2023
-
[2]
Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision.Transactions of the Association for Computational Linguistics, 11, 2023
work page 2023
-
[3]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens.arXiv preprint arXiv:2407.05407, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Sanyuan Chen, Chengyi Wang, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.IEEE Transactions on Audio, Speech and Language Processing, 33: 705–718, 2025
work page 2025
-
[5]
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
LLaSa: Scaling train-time and inference-time compute for LLaMa-based speech synthesis,
Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, et al. Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis.arXiv preprint arXiv:2502.04128, 2025
-
[7]
Autoregressive speech synthesis without vector quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, et al. Autoregressive speech synthesis without vector quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1287–1300, 2025
work page 2025
-
[8]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du, Changfeng Gao, Yuxuan Wang, Fan Yu, Tianyu Zhao, Hao Wang, Xiang Lv, Hui Wang, Chongjia Ni, Xian Shi, et al. Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35139–35148, 2026
work page 2026
-
[11]
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system.arXiv preprint arXiv:2502.05512, 2025
-
[12]
Fluid: Scaling autoregressive text-to-image generative models with continuous tokens
Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, and Yonglong Tian. Fluid: Scaling autoregressive text-to-image generative models with continuous tokens. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[13]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[14]
Continuous autoregressive modeling with stochastic monotonic alignment for speech synthesis
Weiwei Lin and He Chenhang. Continuous autoregressive modeling with stochastic monotonic alignment for speech synthesis. InThe Thirteenth International Conference on Learning Representations, 2025. 10
work page 2025
-
[15]
Continuous-token diffusion for speaker- referenced tts in multimodal llms
Xinlu He, Swayambhu Nath Ray, Harish Mallidi, JIA-HONG HUANG, Ashwin Bellur, Chander Chandak, M Maruf, and Venkatesh Ravichandran. Continuous-token diffusion for speaker- referenced tts in multimodal llms. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
work page 2025
-
[16]
Felle: Autoregressive speech synthesis with token-wise coarse-to-fine flow matching
Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, et al. Felle: Autoregressive speech synthesis with token-wise coarse-to-fine flow matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 10229–10238, 2025
work page 2025
-
[17]
Efficient speech language modeling via energy distance in continuous latent space
Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, et al. Efficient speech language modeling via energy distance in continuous latent space. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[18]
Ditar: diffusion transformer autore- gressive modeling for speech generation
Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, and Yuxuan Wang. Ditar: diffusion transformer autore- gressive modeling for speech generation. InProceedings of the 42nd International Conference on Machine Learning, 2025
work page 2025
-
[19]
Vibevoice: Expressive podcast generation with next-token diffusion
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, and Furu Wei. Vibevoice: Expressive podcast generation with next-token diffusion. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[20]
Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, and Zhiyuan Liu. Hierarchical semantic- acoustic modeling via semi-discrete residual representations for expressive end-to-end speech synthesis. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[21]
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018
work page 2018
-
[22]
Revisiting over-smoothness in text to speech
Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. Revisiting over-smoothness in text to speech. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8197–8213, 2022
work page 2022
-
[23]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[24]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
work page 2024
-
[25]
Chun Yat Wu, Jiajun Deng, Guinan Li, Qiuqiang Kong, and Simon Lui. Clear: Continuous latent autoregressive modeling for high-quality and low-latency speech synthesis.arXiv preprint arXiv:2508.19098, 2025
-
[26]
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014
work page 2014
-
[27]
Marco Pasini, Javier Nistal, Stefan Lattner, and George Fazekas. Continuous autoregressive models with noise augmentation avoid error accumulation.arXiv preprint arXiv:2411.18447, 2024
-
[28]
Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024
Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion.arXiv preprint arXiv:2412.08635, 2024. 11
-
[29]
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. V oicebox: Text-guided multilin- gual universal speech generation at scale.Advances in neural information processing systems, 36:14005–14034, 2023
work page 2023
-
[30]
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, et al. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In2024 IEEE spoken language technology workshop (SLT), pages 682–689. IEEE, 2024
work page 2024
-
[31]
E3 tts: Easy end-to-end diffusion- based text to speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 tts: Easy end-to-end diffusion- based text to speech. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023
work page 2023
-
[32]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Zhao Jian, Kai Yu, and Xie Chen. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255–6271, 2025
work page 2025
-
[33]
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6): 4234–4245, 2024
work page 2024
-
[34]
Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.arXiv preprint arXiv:2304.09116, 2023
-
[35]
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.arXiv preprint arXiv:2403.03100, 2024
-
[36]
Semantic-vae: Semantic-alignment latent representation for better speech synthesis,
Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, et al. Semantic-vae: Semantic-alignment latent representation for better speech synthesis.arXiv preprint arXiv:2509.22167, 2025
-
[37]
Changhao Cheng, Wei Wang, Wangyou Zhang, Dongya Jia, Jian Wu, Zhuo Chen, and Yanmin Qian. On the distillation loss functions of speech vae for unified reconstruction, understanding, and generation.arXiv preprint arXiv:2604.12383, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. Vall-e r: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment.arXiv preprint arXiv:2406.07855, 2024
-
[39]
Hall-e: Hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis
Yuto Nishimura, Takumi Hirose, Masanari Ohi, Hideki Nakayama, and Nakamasa Inoue. Hall-e: Hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis. arXiv preprint arXiv:2410.04380, 2024
-
[40]
Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25174–25182, 2025
work page 2025
-
[41]
Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, et al. Rall-e: Robust codec lan- guage modeling with chain-of-thought prompting for text-to-speech synthesis.arXiv preprint arXiv:2404.03204, 2024
-
[42]
Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent Van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data.arXiv preprint arXiv:2402.08093, 2024. 12
-
[43]
Hao-Han Guo, Yao Hu, Kun Liu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kai-Tuo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283, 2024
-
[44]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al. Seed-tts: A family of high-quality versatile speech generation models.arXiv preprint arXiv:2406.02430, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Kall-e: Au- toregressive speech synthesis with next-distribution prediction
Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, and Lei Xie. Kall-e: Au- toregressive speech synthesis with next-distribution prediction. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34016–34024, 2026
work page 2026
-
[46]
High- fidelity audio compression with improved rvqgan
Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High- fidelity audio compression with improved rvqgan. InProceedings of the 37th International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[47]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
work page 2025
-
[48]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[49]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications
work page 2021
-
[50]
Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation. IEEE Transactions on Audio, Speech and Language Processing, 33:4044–4054, 2025
work page 2025
-
[51]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre- training for full stack speech processing.IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
work page 2022
-
[52]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Maskgct: Zero-shot text-to-speech with masked generative codec transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. InThe Thirteenth International Conference on Learning Representations
-
[54]
Fireredtts-2: Towards long conversational speech generation for podcast and chatbot,
Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, and Yao Hu. Fireredtts-2: Towards long conversational speech generation for podcast and chatbot.arXiv preprint arXiv:2509.02020, 2025
-
[55]
Higgs Audio V2: Redefining Expressiveness in Audio Generation
Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation. https:// github.com/boson-ai/higgs-audio, 2025. GitHub repository. Release blog available at https://www.boson.ai/blog/higgs-audio-v2
work page 2025
-
[56]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report, 2025.URL https://arxiv. org/abs/2503.20215, 2025. 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.