MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion
MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.