FlowTTS-GRPO: Online Reinforcement Learning with Multi-Objective Reward Optimization for Flow-Matching Based Text-to-Speech
Pith reviewed 2026-06-26 07:00 UTC · model grok-4.3
The pith
Converting ODE trajectories to SDE paths enables direct RL fine-tuning of flow-matching TTS models without auxiliary models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlowTTS-GRPO converts ordinary differential equation trajectories into stochastic differential equation paths to enable online reinforcement learning fine-tuning of flow-matching TTS models. This permits direct optimization using multi-objective rewards on open-source models like CosyVoice and F5-TTS. The method shows faster convergence with weighted rewards, and optimizations such as skipping classifier-free guidance speed up training while improving robustness and detail metrics through hard case synthesis and targeted RL application.
What carries the argument
The ODE-to-SDE conversion that turns deterministic flow paths into stochastic ones for RL-based sampling and optimization.
If this is right
- Weighted reward combination leads to faster convergence than probabilistic scheme.
- Omitting CFG during RL training accelerates convergence.
- Synthesizing hard cases improves robustness.
- RL on the FM component enhances audio-detail metrics.
- Results in objective and subjective gains in speaker similarity and perceptual quality.
Where Pith is reading between the lines
- This technique might extend to fine-tuning other types of generative flow models beyond speech synthesis.
- The identified optimizations could be tested in non-RL fine-tuning scenarios for flow models.
Load-bearing premise
The conversion of ODE trajectories to SDE paths preserves the generative capabilities of the flow-matching model sufficiently for effective online RL fine-tuning without introducing artifacts or instability.
What would settle it
If experiments showed that models fine-tuned with FlowTTS-GRPO had worse or equal performance in speaker similarity and perceptual quality compared to the base models, the effectiveness of the method would be questioned.
Figures
read the original abstract
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajectories into stochastic differential equation (SDE) paths, our method enables direct fine-tuning of open-source FM models without auxiliary models. We show that a weighted reward combination converges faster than a probabilistic scheme, and identify three practical optimizations: omitting classifier-free guidance (CFG) during training accelerates convergence; synthesizing hard cases improves robustness; and applying RL to the FM component enhances audio-detail metrics. Experiments on CosyVoice 3.0 and F5-TTS demonstrate objective and subjective preference gains in speaker similarity and perceptual quality, with F5-TTS also improving intelligibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlowTTS-GRPO, an online RL framework for flow-matching (FM) based TTS. By converting ODE trajectories into SDE paths, it enables direct fine-tuning of open-source FM models (CosyVoice 3.0, F5-TTS) without auxiliary networks. It claims a weighted multi-objective reward combination converges faster than a probabilistic scheme and identifies three optimizations (omitting CFG during training, synthesizing hard cases, applying RL to the FM component). Experiments are said to yield objective and subjective gains in speaker similarity, perceptual quality, and (for F5-TTS) intelligibility.
Significance. If the ODE-to-SDE conversion is shown to preserve generative fidelity without introducing instability or distribution shift, the approach could meaningfully extend RL fine-tuning techniques from LLM-based TTS to the FM setting, allowing direct improvement of open-source models. The practical optimizations and weighted-reward finding would be of interest to TTS practitioners if supported by reproducible ablations.
major comments (2)
- Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.
- Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will make targeted revisions to strengthen the abstract and methods presentation.
read point-by-point responses
-
Referee: Abstract: the central claim that ODE-to-SDE conversion 'enables direct fine-tuning ... without auxiliary models' and 'preserves the generative capabilities' is load-bearing, yet the abstract supplies no derivation, sampling procedure, or fidelity metric (e.g., distribution-shift or reconstruction error) to substantiate that the conversion does not introduce artifacts; this must be addressed with explicit equations and verification in the methods section before the empirical claims can be evaluated.
Authors: The full manuscript's Methods section already derives the ODE-to-SDE conversion, specifies the sampling procedure, and reports fidelity verification (distribution shift and reconstruction error metrics) confirming no artifacts or instability. To address the referee's concern directly, we will revise the abstract to include a concise reference to these elements and ensure the methods section explicitly highlights the verification results with equations. revision: yes
-
Referee: Abstract: objective and subjective 'preference gains' are asserted without any numerical values, error bars, dataset sizes, or baseline comparisons, rendering the magnitude and reliability of the reported improvements impossible to assess; this is a load-bearing gap for the experimental contribution.
Authors: We agree the abstract is too high-level. The results section contains the full numerical results, error bars, dataset details, and baseline comparisons. We will revise the abstract to report key quantitative preference gains (with error bars and dataset sizes) so the magnitude of improvements is immediately clear. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The provided abstract and context describe an RL framework for flow-matching TTS that converts ODE trajectories to SDE paths to enable direct fine-tuning without auxiliary models, compares weighted vs. probabilistic reward schemes, and lists three practical optimizations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are exhibited in the given material. The central claims rest on empirical results from external models (CosyVoice 3.0, F5-TTS) and standard RL techniques rather than reducing to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Text-to-speech (TTS) converts input text into audible speech and plays a key role in human–computer interaction. Re- cently, researchers have incorporated large language models (LLMs) [1] into TTS [2–9]. One line of work uses LLMs’ ability to model discrete speech tokens with in-context learn- ing (ICL) [2, 7, 8]: a speech codec [10, 11] prod...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning
Methods 2.1. The TTS model used for RL finetuning We select CosyV oice 3.0 [6] (CV3) and F5-TTS [13] as pre- trained TTS models for RL finetuning. Zero-shot voice-cloning refers to generating speech in a target speaker’s voice using only a short prompt audio without requiring explicit speaker adapta- tion or fine-tuning. To refine zero-shot voice-cloning ...
2000
-
[3]
Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets
Experimental Setup 3.1. Training Dataset We use WenetSpeech4TTS [38] Premium (Chinese) and LibriTTS-960 [39] (English) as training sets. Audio files serve as prompt waveforms with transcripts as prompt text. We ran- domly shuffle the original text corpus to produce target texts for voice cloning. We construct 20k samples each for Chinese and English (40k ...
2000
-
[4]
Results and Discussion 4.1. Evaluation Metrics Following CV3 [6], we evaluate the effect of FlowTTS-GRPO fine-tuning using three objective metrics: • Content consistency (CER/WER): measures the intelligibil- ity of synthesized speech. We report Character Error Rate (CER) for Chinese and other non-English languages, and Word Error Rate (WER) for English. F...
2000
-
[5]
Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths
Conclusion We introduce FlowTTS-GRPO, the first application of Flow- GRPO to text-to-speech models. Our method enables direct fine-tuning of open-source FM-only and LLM-FM hybrid mod- els by converting ODE trajectories to SDE paths. Our frame- work simplifies prior RL approaches by eliminating value net- works, preference pairs, and token-to-reward models...
-
[6]
The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results
Generative AI Use Disclosure We use ChatGPT (OpenAI) and Qwen3-Max (Alibaba) for En- glish grammar checking, sentence polishing in the manuscript. The AI tools are used only for grammar checking and did not generate any significant part of the scientific content, technical contributions, or experimental results. All authors are fully re- sponsible for the...
-
[7]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Lauragpt: Listen, attend, understand, and re- generate audio with gpt,
Z. Du, J. Wang, Q. Chen, Y . Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Maet al., “Lauragpt: Listen, attend, understand, and re- generate audio with gpt,”arXiv preprint arXiv:2310.04673, 2023
-
[10]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,
Z. Ye, X. Zhu, C.-M. Chan, X. Wang, X. Tan, J. Lei, Y . Peng, H. Liu, Y . Jin, Z. Daiet al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,”arXiv preprint arXiv:2502.04128, 2025
-
[14]
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gaoet al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Soundstream: An end-to-end neural audio codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 30, pp. 495–507, 2021
2021
-
[17]
High Fidelity Neural Audio Compression
A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Recent advances in discrete speech tokens: A review,
Y . Guo, Z. Li, H. Wang, B. Li, C. Shao, H. Zhang, C. Du, X. Chen, S. Liu, and K. Yu, “Recent advances in discrete speech tokens: A review,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[19]
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 6255–6271
2025
-
[20]
Flow straight and fast: Learning to generate and transfer data with rectified flow,
X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inProc. ICLR. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=XVjTT1nw5z
2023
-
[21]
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025
-
[22]
Train- ing language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Train- ing language models to follow instructions with human feedback,” Proc. NeurIPS, vol. 35, pp. 27 730–27 744, 2022
2022
-
[23]
Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,
J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human prefer- ences for text-to-image generation,”Proc. NeurIPS, vol. 36, pp. 15 903–15 935, 2023
2023
-
[24]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Preference alignment improves language model-based tts,
J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Preference alignment improves language model-based tts,” in Proc. ICASSP. IEEE, 2025, pp. 1–5
2025
-
[26]
Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,
X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inProc. ICASSP. IEEE, 2025, pp. 1–5
2025
-
[27]
Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,
S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245
2025
-
[28]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Proc. NeurIPS, vol. 36, pp. 53 728–53 741, 2023
2023
-
[29]
Differentiable Reward Optimiza- tion for LLM based TTS system,
C. Gao, Z. Du, and S. Zhang, “Differentiable Reward Optimiza- tion for LLM based TTS system,” inProc. Interspeech, 2025, pp. 2450–2454
2025
-
[30]
Rrpo: Robust reward policy optimization for llm-based emotional tts,
C. Wang, C. Gao, Y . Xiang, Z. Du, K. An, H. Zhao, Q. Chen, X. Li, Y . Gao, and Y . Li, “Rrpo: Robust reward policy optimization for llm-based emotional tts,”arXiv preprint arXiv:2512.04552, 2025
-
[31]
Group relative policy optimization for text-to-speech with large language models,
C. Liu, Y .-J. Hu, Y .-Y . Gao, S.-L. Zhang, and Z.-H. Ling, “Group relative policy optimization for text-to-speech with large language models,”arXiv preprint arXiv:2509.18798, 2025
-
[32]
F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,
X. Sun, R. Xiao, J. Mo, B. Wu, Q. Yu, and B. Wang, “F5r-tts: Improving flow-matching based text-to-speech with group relative policy optimization,”arXiv preprint arXiv:2504.02407, 2025
-
[33]
Flow-GRPO: Training Flow Matching Models via Online RL
J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,”arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,
H. Wang, B. Tian, Y . Jiang, Z. Pan, S. Zhao, B. Ma, D. Chen, and X. Li, “Flowse-grpo: Training flow matching speech en- hancement via online reinforcement learning,”arXiv preprint arXiv:2601.16483, 2026
-
[35]
HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Genera- tive Adversarial Networks for Efficient and High Fi- delity Speech Synthesis,” inProc. NeurIPS, 2020. [On- line]. Available: https://proceedings.neurips.cc/paper/2020/hash/ c5d736809766d46260d816d8dbc9eb44-Abstract.html
2020
-
[36]
H. Siuzdak, “V ocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” arXiv preprint arXiv:2306.00814, 2023
-
[37]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode- sde,”arXiv preprint arXiv:2507.21802, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,
Y . Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi, “An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification,” inProc. Interspeech, 2023, pp. 2228–2232
2023
-
[39]
Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,
Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,” inInterspeech. ISCA, 2022, pp. 2063– 2067
2022
-
[40]
Faster whisper large v3,
Systran, “Faster whisper large v3,” 2023. [Online]. Available: https://huggingface.co/Systran/faster-whisper-large-v3
2023
-
[41]
Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos p. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP. IEEE, 2022, pp. 886–890
2022
-
[42]
Promptrl: Prompt matters in rl for flow-based image generation,
F.-Y . Wang, H. Zhang, M. Gharbi, H. Li, and T. Park, “Promptrl: Prompt matters in rl for flow-based image generation,”arXiv preprint arXiv:2602.01382, 2026
-
[43]
Maskgct: Zero-shot text-to- speech with masked generative codec transformer,
Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to- speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024
-
[44]
Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,
L. Ma, D. Guo, K. Song, Y . Jiang, S. Wang, L. Xue, W. Xu, H. Zhao, B. Zhang, and L. Xie, “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model bench- mark,”arXiv preprint arXiv:2406.05763, 2024
-
[45]
Libritts: A corpus derived from librispeech for text- to-speech,
H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,” inProc. Interspeech, 2019, pp. 1526–1530
2019
-
[46]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022
2022
-
[47]
Large-scale self-supervised speech representation learning for automatic speaker verification,
Z. Chen, S. Chen, Y . Wu, Y . Qian, C. Wang, S. Liu, Y . Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151
2022
-
[48]
Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,
C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inProc. ICASSP. IEEE, 2021, pp. 6493–6497
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.