BareWave: Waveform-Native Flow-Matching Text-to-Speech
Pith reviewed 2026-06-27 15:12 UTC · model grok-4.3
The pith
BareWave generates speech directly from text to raw waveform in flow-matching TTS by solving three specific training challenges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BareWave is a fully waveform-native flow-matching TTS framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA) to make direct text-to-wave generation practical, delivering strong intelligibility, speaker similarity, and naturalness in zero-shot voice cloning experiments while using only a single waveform-native inference path.
What carries the argument
Velocity-aware perceptual alignment (VAPA) together with staged noise scheduling and training-time representation alignment, which align data-space perceptual objectives to the velocity-space flow objective for raw-waveform modeling.
If this is right
- Zero-shot voice cloning becomes feasible under a single waveform-native inference path with no pretrained components at test time.
- High-quality TTS no longer requires separately trained acoustic and waveform stages.
- Training can reach strong final operating points despite the absence of a strong pretrained scaffold in the waveform domain.
Where Pith is reading between the lines
- The same alignment and scheduling approach might reduce the number of separately trained modules needed in other audio generation tasks.
- Direct waveform modeling could lower overall system latency by eliminating intermediate representation conversions at inference.
- Future experiments could test whether VAPA-style objectives improve flow-matching performance on non-speech audio such as music or environmental sounds.
Load-bearing premise
The three proposed techniques together can overcome the optimization difficulties of modeling raw waveforms directly in flow-matching without a pretrained representational scaffold.
What would settle it
A controlled ablation showing that removing representation alignment, staged scheduling, or VAPA causes the system to fall below the intelligibility and naturalness of conventional staged TTS systems on the same zero-shot cloning test set.
Figures
read the original abstract
Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BareWave, a fully waveform-native flow-matching TTS framework for direct text-to-waveform generation without intermediate acoustic representations or pretrained decoders at inference time. It explicitly identifies three training challenges (lack of pretrained representational scaffold, need for stage-specific noise schedules, and misalignment between data-space perceptual objectives and velocity-space flow objectives) and proposes three corresponding techniques: training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA). The central empirical claim is that these techniques enable strong zero-shot voice cloning performance in intelligibility, speaker similarity, and naturalness under a strictly waveform-native path.
Significance. If the results are robustly validated with quantitative metrics and ablations, the work would be significant as an empirical demonstration that direct waveform modeling in flow-matching TTS can be made practical, potentially simplifying TTS pipelines by eliminating separately trained acoustic stages and test-time pretrained components. The explicit framing of optimization difficulties and targeted mitigations, along with the project page for audio demos, strengthens the contribution as a practical direction rather than a purely theoretical one.
major comments (1)
- [Abstract, §4] Abstract and §4 (Experiments): The central claim that the three techniques enable 'strong intelligibility, speaker similarity, and naturalness' in zero-shot cloning is presented without any reported metrics (e.g., WER, speaker similarity scores, MOS), baselines, dataset details, ablation results, or error analysis. This makes it impossible to determine whether the data support that the proposed methods overcome the stated optimization difficulties, rendering the empirical evidence load-bearing but unevaluable from the manuscript.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., a table reference or specific score) to ground the performance claims.
- [§3] Notation for VAPA and the staged noise schedule should be defined with explicit equations or pseudocode in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of a waveform-native flow-matching TTS approach. We address the major comment below and will incorporate revisions to strengthen the empirical presentation.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): The central claim that the three techniques enable 'strong intelligibility, speaker similarity, and naturalness' in zero-shot cloning is presented without any reported metrics (e.g., WER, speaker similarity scores, MOS), baselines, dataset details, ablation results, or error analysis. This makes it impossible to determine whether the data support that the proposed methods overcome the stated optimization difficulties, rendering the empirical evidence load-bearing but unevaluable from the manuscript.
Authors: We agree that the submitted manuscript does not report specific quantitative metrics, baselines, dataset details, ablations, or error analysis in the abstract or §4, which prevents proper evaluation of the claims. The full experimental section in the manuscript describes the setup and results qualitatively but lacks the numerical tables and comparisons needed to substantiate 'strong' performance. We will revise the abstract to include key reported numbers (e.g., WER, speaker similarity cosine scores, MOS) and expand §4 with full tables, baseline comparisons (such as to cascaded systems), dataset specifications (e.g., LibriTTS or similar), ablation studies on the three proposed techniques, and error analysis. This addresses the concern directly. revision: yes
Circularity Check
No significant circularity; empirical demonstration only
full rationale
The manuscript describes an empirical TTS system that combines three training techniques (representation alignment, staged noise scheduling, VAPA) to enable direct waveform-native flow-matching. No equations, derivations, fitted parameters, or first-principles results are presented that could reduce to their own inputs by construction. The central claim is scoped as experimental evidence from zero-shot cloning trials rather than a closed mathematical chain. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the text. The argument is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Roi Benita, Michael Elad, and Joseph Keshet. 2024. DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation. InThe Twelfth International Conference on Learning Representations. doi:10.48550/arXiv.2310. 01381
-
[2]
Weiss, Mohammad Norouzi, and William Chan
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2021. WaveGrad: Estimating Gradients for Waveform Generation. InInternational Conference on Learning Representations. https://openreview.net/ forum?id=NsMLjcFaO8O
2021
-
[3]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE Journal of Selected Topics in Sig...
-
[4]
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2025. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad T...
2025
-
[5]
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. 2025. DiP: Taming Diffusion Models in Pixel Space.arXiv preprint arXiv:2511.18822(2025). https://arxiv.org/ abs/2511.18822
arXiv 2025
-
[6]
Jeongsoo Choi, Zhikang Niu, Ji-Hoon Kim, Chunhui Wang, Joon Son Chung, and Xie Chen. 2025. Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment. InProceedings of Interspeech 2025. 3459–3463. doi:10.21437/Interspeech.2025-1236
-
[8]
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. 2024. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models.arXiv preprint arXiv:2412.10117(2024). doi:10.4...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2024
-
[9]
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. 2024. E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. In2024 IEEE Spoken Language Technology Workshop (SLT). doi:10.1109/SLT61566.2024.10832320
-
[10]
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. 2023. E3 TTS: Easy End-to-End Diffusion-based Text to Speech. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). https://arxiv.org/abs/2311. 00945
2023
-
[11]
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. 2024. Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation. In2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 885–890. https...
arXiv 2024
-
[12]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self- Supervised Speech Representation Learning by Masked Prediction of Hidden Units.IEEE/ACM Transactions on Audio, Speech, and Language Processing29 (2021), 3451–3460
2021
-
[13]
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker New- house, and Jeremy Bernstein. 2024. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
2024
-
[14]
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. 2024. Nat- uralSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. InProceedings of the 41st Internati...
2024
-
[15]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Ad- versarial Networks for Efficient and High Fidelity Speech Synthesis. InAdvances in Neural Information Processing Systems, Vol. 33. https://proceedings.neurips. cc/paper/2020/hash/c5d736809766d46260d816d8dbc9eb44-Abstract.html
2020
-
[16]
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. InInternational Conference on Learning Representations. https://openreview.net/forum?id=a- xFK8Ymz5J
2021
-
[17]
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. 2023. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. InAdvances in Neural Information Processing Sys- tems, Vol. 36. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 2...
2023
-
[18]
Tianhong Li and Kaiming He. 2025. Back to Basics: Let Denoising Generative Models Denoise.arXiv preprint arXiv:2511.13720(2025). https://arxiv.org/abs/ 2511.13720
Pith/arXiv arXiv 2025
-
[19]
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. 2024. Autoregressive Image Generation without Vector Quantization. InAdvances in Neural Information Processing Systems, Vol. 37. doi:10.52202/079017-1797
-
[20]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations. https://arxiv.org/abs/2210.02747
Pith/arXiv arXiv 2023
-
[21]
Tianze Luo, Xingchen Miao, and Wenbo Duan. 2025. WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers). Asso- ciation for Computational Linguistics, Albuquerqu...
-
[22]
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. 2025. DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation.arXiv preprint arXiv:2511.19365(2025). https://arxiv.org/abs/2511.19365
Pith/arXiv arXiv 2025
-
[23]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations. https://arxiv.org/abs/2006. 04558
2021
-
[24]
Hubert Siuzdak. 2023. Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis.arXiv preprint arXiv:2306.00814(2023)
arXiv 2023
-
[25]
Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. 2025. JetFormer: An autoregressive generative model of raw images and text. In International Conference on Learning Representations. https://proceedings.iclr. cc/paper_files/paper/2025/hash/d5a8e37f38a08c68162452dcba89ae9c-Abstract- Conference.html
2025
-
[26]
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers.arXiv preprint arXiv:2301.02111(2023). https://arxiv.org/abs/2301. 02111
Pith/arXiv arXiv 2023
-
[27]
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. 2025. PixNerd: Pixel Neural Field Diffusion.arXiv preprint arXiv:2507.23268(2025). https://arxiv.org/abs/2507.23268
arXiv 2025
-
[28]
Weiss, R
Ron J. Weiss, R. J. Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, and Diederik P. Kingma. 2021. Wave-Tacotron: Spectrogram-Free End-to-End Text- to-Speech Synthesis. In2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://research.google/pubs/wave-tacotron- spectrogram-free-end-to-end-text-to-speech-synthesis/
2021
-
[29]
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram. In2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi:10.1109/ICASSP40776.2020. 9053795
-
[30]
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jin- woo Shin, and Saining Xie. 2025. Representation Alignment for Generation: Train- ing Diffusion Transformers Is Easier Than You Think. InInternational Conference on Learning Representations. https://openreview.net/forum?id=DJSZGGZYVi
2025
-
[31]
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo
-
[32]
PixelDiT: Pixel Diffusion Transformers for Image Generation.arXiv preprint arXiv:2511.20645(2025). https://arxiv.org/abs/2511.20645 10 A Detailed Experimental Setup A.1 Training Data and Preprocessing All waveform-native runs in this work are trained on the Eng- lish subset of Emilia. We build the training set after filtering out utterances with transcrip...
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.