One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

· 2026 · eess.AS · arXiv 2606.18072

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.

representative citing papers

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

eess.AS · 2026-06-16 · unverdicted · novelty 7.0

MeanFlow applied in latent space enables true one-step Token2Wav generation with up to 17x RTF improvement and negligible quality loss versus multi-step baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space eess.AS · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
MeanFlow applied in latent space enables true one-step Token2Wav generation with up to 17x RTF improvement and negligible quality loss versus multi-step baselines.

One-Step Token-to-Waveform Generation with MeanFlow in Latent Space

fields

years

verdicts

representative citing papers

citing papers explorer