arxiv: 2604.12456 · v2 · submitted 2026-04-14 · 📡 eess.AS · cs.AI

Recognition: unknown

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Qixi Zheng , Yuxiang Zhao , Tianrui Wang , Wenxi Chen , Kele Xu , Yikang Li , Qinyuan Chen , Xipeng Qiu

show 2 more authors

Kai Yu Xie Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:16 UTC · model grok-4.3

classification 📡 eess.AS cs.AI

keywords zero-shot voice conversionstreaming voice conversionneural codeclatent space conversionspeaker similaritycross-lingual conversionlow-latency audio

0 comments

The pith

X-VC converts speech to unseen voices in one step inside a neural codec's latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents X-VC as a way to change a speaker's voice to match an unseen target while keeping the words the same, but doing so fast enough for live streaming use. It works by operating directly on the compressed latent codes from a pretrained neural codec instead of generating audio frame by frame. The model uses joint conditioning on source content and target voice details at both frame and utterance levels, trained on synthetic paired examples to handle real targets better. If the approach holds, it could support responsive voice conversion in applications like live translation or personalized audio without large quality losses or delays.

Core claim

X-VC achieves one-step conversion by using a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. Training employs generated paired data and a role-assignment strategy combining standard, reconstruction, and reversed modes to reduce the train-inference gap. Streaming inference uses a chunkwise scheme with overlap smoothing aligned to the codec's segment-based training paradigm. On Seed-TTS-Eval, this yields the best streaming word error rates in both English and Chinese, strong speaker similarity in same-

What carries the argument

dual-conditioning acoustic converter that jointly processes source codec latents with target acoustic conditions and adaptive normalization for speaker identity

If this is right

One-step codec latent conversion becomes viable for interactive zero-shot VC without separate vocoder stages.
The role-assignment training reduces mismatch for unseen targets in streaming setups.
Chunkwise inference aligned with codec training delivers low real-time factor while preserving quality.
Cross-lingual speaker similarity holds without language-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Advances in general neural codecs could directly lift VC performance across more audio domains.
The method might extend to other low-latency generative tasks such as real-time speech enhancement.
Integration into end-to-end pipelines could eliminate the need for separate acoustic feature extractors.

Load-bearing premise

Training on generated paired data with the role-assignment strategy sufficiently prepares the model for truly unseen target speakers under real streaming conditions.

What would settle it

Run the system in true streaming mode on live recordings from real unseen human target speakers providing short references and measure if streaming WER rises above the reported levels or speaker similarity falls sharply.

Figures

Figures reproduced from arXiv: 2604.12456 by Kai Yu, Kele Xu, Qinyuan Chen, Qixi Zheng, Tianrui Wang, Wenxi Chen, Xie Chen, Xipeng Qiu, Yikang Li, Yuxiang Zhao.

**Figure 1.** Figure 1: Overall framework of X-VC. A pretrained speech encoder maps the input waveform into latent representations, which [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of the acoustic converter. Source codec [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training data construction and role assignment. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Chunkwise streaming inference of X-VC. Each step [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Our audio samples, code and checkpoints are released at https://github.com/Jerrister/X-VC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

X-VC gives a workable one-step codec-space method for low-latency zero-shot streaming VC, but the generated paired data training leaves the real generalization gap unproven.

read the letter

X-VC tries to solve the practical problem of zero-shot voice conversion that runs in streaming mode with low latency. It does the conversion in a single step inside the latent space of a pretrained neural codec instead of going through waveform or mel stages. The dual-conditioning converter takes frame-level acoustic features from the target reference speech and adds utterance-level speaker information through adaptive normalization. Training uses generated paired data plus a role-assignment trick that mixes standard, reconstruction, and reversed modes, and inference runs chunkwise with overlap smoothing to match the codec's segment training. The abstract reports the best streaming WER on Seed-TTS-Eval for both English and Chinese, solid speaker similarity including cross-lingual, and clearly lower offline RTF than the baselines they compare against. Releasing code and checkpoints is a plus for anyone who wants to test it directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces X-VC, a zero-shot streaming voice conversion system performing one-step conversion in the latent space of a pretrained neural codec. It employs a dual-conditioning acoustic converter jointly modeling source codec latents and frame-level acoustic conditions from target reference speech, with utterance-level speaker information injected via adaptive normalization. Training uses generated paired data and a role-assignment strategy (standard, reconstruction, and reversed modes) to reduce train-inference mismatch, combined with chunkwise inference and overlap smoothing for streaming. On Seed-TTS-Eval, it claims the best streaming WER in English and Chinese, strong speaker similarity (including cross-lingual), and substantially lower offline RTF than baselines.

Significance. If the results hold under rigorous validation, this would be a notable contribution to practical zero-shot VC by showing that codec-space one-step conversion with targeted training can simultaneously deliver high fidelity, cross-lingual transfer, and low-latency streaming suitable for interactive use. The open release of code, checkpoints, and audio samples is a clear strength supporting reproducibility.

major comments (2)

[Abstract and §4] Abstract and §4: The claim of achieving the 'best streaming WER in both English and Chinese' and 'strong speaker similarity' is load-bearing for the central experimental result, yet the manuscript provides no details on the exact baselines, data splits, number of evaluation samples, error bars, or statistical significance tests, preventing verification of the comparative superiority.
[§3] §3 (Method, role-assignment and training): The dual-conditioning converter plus chunkwise inference is presented as enabling true zero-shot streaming on unseen targets, but the generated paired data with role-assignment (standard/reconstruction/reversed) is not validated against real acoustic distributions or ablated for its effect on closing the train-inference gap, especially in cross-lingual settings and low-latency chunked conditions; this assumption is central to the generalization claims.

minor comments (2)

A dedicated table or figure summarizing exact WER, similarity scores, and RTF values with all baselines would improve clarity of the comparative results.
The description of overlap smoothing in chunkwise inference could benefit from a short equation or pseudocode to make the alignment with codec segment training explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major point below and will revise the manuscript accordingly to improve transparency and validation of our claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: The claim of achieving the 'best streaming WER in both English and Chinese' and 'strong speaker similarity' is load-bearing for the central experimental result, yet the manuscript provides no details on the exact baselines, data splits, number of evaluation samples, error bars, or statistical significance tests, preventing verification of the comparative superiority.

Authors: We agree that the experimental section requires more precise documentation to support the reported superiority. In the revised manuscript, we will expand §4 with: the full list of baselines and their configurations; the exact data splits and number of evaluation samples from Seed-TTS-Eval (specifying counts for English and Chinese); standard deviations across multiple runs presented as error bars; and results of statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests) on WER and speaker similarity metrics. These additions will enable direct verification of the claims. revision: yes
Referee: [§3] §3 (Method, role-assignment and training): The dual-conditioning converter plus chunkwise inference is presented as enabling true zero-shot streaming on unseen targets, but the generated paired data with role-assignment (standard/reconstruction/reversed) is not validated against real acoustic distributions or ablated for its effect on closing the train-inference gap, especially in cross-lingual settings and low-latency chunked conditions; this assumption is central to the generalization claims.

Authors: We acknowledge the value of explicit validation for the role-assignment strategy. While zero-shot settings inherently lack real paired data for unseen speakers, making direct distributional comparisons difficult, the strategy is intended to mitigate train-inference mismatch through mode switching. In the revision, we will add an ablation study (in §4 or an appendix) quantifying the impact of each training mode on WER, speaker similarity, and cross-lingual performance, including results under different chunk sizes to address low-latency conditions. We will also include a brief analysis of how the generated pairs approximate real acoustic properties based on reconstruction fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical validation

full rationale

The paper presents X-VC as an architectural system (dual-conditioning converter in pretrained codec space, role-assignment training on generated pairs, chunkwise streaming inference) whose performance claims are supported by direct comparisons on Seed-TTS-Eval rather than any derivation that reduces to its own inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the described chain; the role-assignment strategy is a training heuristic whose effectiveness is measured externally, not assumed by construction. The work is self-contained against external benchmarks and pretrained components.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the pretrained neural codec preserving linguistic content in its latents and on generated paired data being representative of real zero-shot scenarios. No new physical entities are postulated.

free parameters (1)

chunk size and overlap for streaming inference
Chosen to align with the segment-based training of the codec; specific values not stated in abstract.

axioms (1)

domain assumption Pretrained neural codec latents preserve sufficient linguistic content for one-step conversion without explicit content modeling.
Invoked to justify operating directly in codec space rather than waveform or spectrogram.

pith-pipeline@v0.9.0 · 5590 in / 1215 out tokens · 43775 ms · 2026-05-10T14:16:01.511481+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 32 canonical work pages · 2 internal anchors

[1]

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

work page internal anchor Pith review arXiv 2024
[2]

Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi. 2023. SoundStorm: Efficient Parallel Audio Generation. arXiv:2305.09636 [cs.SD] https://arxiv.org/abs/2305.09636

work page arXiv 2023
[3]

Edresson Casanova, Julian Weber, Christopher D Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir A Ponti. 2022. YourTTS: Towards Zero-Shot Multi- Speaker TTS and Zero-Shot Voice Conversion for Everyone. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stef...

2022
[4]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.IEEE Journal of Selected Topics in Sig...

work page doi:10.1109/jstsp.2022.3188113 2022
[5]

Wenxi Chen, Xinsheng Wang, Ruiqi Yan, Yushen Chen, Zhikang Niu, Ziyang Ma, Xiquan Li, Yuzhe Liang, Hanlin Wen, Shunshun Yin, Ming Tao, and Xie Chen. 2025. SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization. arXiv:2510.16841 [eess.AS] https://arxiv.org/abs/2510.16841

work page arXiv 2025
[6]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. 2025. F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad T...

2025
[7]

Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi
[8]

InInterspeech 2023

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification. InInterspeech 2023. 2228–2232. doi:10.21437/Interspeech.2023-1294

work page doi:10.21437/interspeech.2023-1294 2023
[9]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. 2023. High Fidelity Neural Audio Compression.Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=ivCd8z8zR2 Featured Certification, Reproducibility Certification

2023
[10]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. 2024. E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS. In2024 IEEE Spoken Language Technology Workshop (SLT). 682–689. doi:10.1109/SLT61566.2024.10832320

work page doi:10.1109/slt61566.2024.10832320 2024
[11]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. 2024. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. InProceedings of the 41st International Conference on Machine Learning...

2024
[12]

Zhifu Gao, ShiLiang Zhang, Ian McLoughlin, and Zhijie Yan. 2022. Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition. InInterspeech 2022. 2063–2067. doi:10.21437/Interspeech. 2022-9996

work page doi:10.21437/interspeech 2022
[13]

Yiwei Guo, Zhihan Li, Junjie Li, Chenpeng Du, Hankun Wang, Shuai Wang, Xie Chen, and Kai Yu. 2025. vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders. arXiv:2409.01995 [eess.AS] https://arxiv.org/abs/2409.01995

work page arXiv 2025
[14]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. 2024. Emilia: An Extensive, Multilingual, and Diverse Speech Dataset For Large-Scale Speech Generation. In2024 IEEE Spoken Language Technology Workshop (SLT). 885–890. doi:10.1109...

work page doi:10.1109/slt61566 2024
[15]

Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV)

2017
[16]

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xi- angyang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. 2024. NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. InProceedings of the 41st Internati...

2024
[17]

Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour
[18]

doi:10.1162/tacl_a_00618

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision.Transactions of the Association for Computational Linguistics11 (12 2023), 1703–1718. doi:10.1162/tacl_a_00618

work page doi:10.1162/tacl_a_00618 2023
[19]

Junjie Li, Yiwei Guo, Xie Chen, and Kai Yu. 2024. SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 12296–12300. doi:10.1109/ICASSP48485.2024.10446160

work page doi:10.1109/icassp48485.2024.10446160 2024
[20]

Jingyi Li, Weiping Tu, and Li Xiao. 2023. Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. InICASSP 2023 - 2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. doi:10.1109/ICASSP49357.2023.10095191

work page doi:10.1109/icassp49357.2023.10095191 2023
[21]

Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. 2021. StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. InInterspeech 2021. 1349–1353. doi:10.21437/Interspeech.2021-319

work page doi:10.21437/interspeech.2021-319 2021
[22]

Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee

Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee
[23]

Developing real-time streaming transformer transducer for speech recognition on large-scale dataset

Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5939–5943. doi:10.1109/ICASSP39728.2021.9413699

work page doi:10.1109/icassp39728.2021.9413699 2021
[24]

Songting Liu. 2024. Zero-shot Voice Conversion with Diffusion Transformers. arXiv:2411.09943 [cs.SD] https://arxiv.org/abs/2411.09943

work page arXiv 2024
[25]

Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, and Pengcheng Zhu. 2025. MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows. arXiv:2510.08392 [eess.AS] https://arxiv.org/abs/ 2510.08392

work page arXiv 2025
[26]

Seyed Hamidreza Mohammadi and Alexander Kain. 2017. An overview of voice conversion systems.Speech Communication88 (2017), 65–82. doi:10.1016/j. specom.2017.01.008

work page doi:10.1016/j 2017
[27]

Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, and Mengxiao Bi. 2024. Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 11106–11110. doi:10.1109/ICASSP48485.2024.10446229

work page doi:10.1109/icassp48485.2024.10446229 2024
[28]

Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie, and Mengxiao Bi. 2023. DualVC: Dual-mode Voice Conversion using Intra- model Knowledge Distillation and Hybrid Predictive Coding. InInterspeech 2023. 2063–2067. doi:10.21437/Interspeech.2023-1157

work page doi:10.21437/interspeech.2023-1157 2023
[29]

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Trans- formers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4195–4205

2023
[30]

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. Proceedings of the AAAI Conference on Artificial Intelligence32, 1 (Apr. 2018). doi:10.1609/aaai.v32i1.11671

work page doi:10.1609/aaai.v32i1.11671 2018
[31]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa- Johnson. 2019. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5210–5219. https://pro...

2019
[32]

Zengyi Qin, Wenliang Zhao, Xumin Yu, and Xin Sun. 2024. OpenVoice: Versatile Instant Voice Cloning. arXiv:2312.01479 [cs.SD] https://arxiv.org/abs/2312.01479

work page arXiv 2024
[33]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Super- vision. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brun- skill, Kyunghyun Cho, Barbara Engelhardt, Siv...

2023
[34]

Chandan K A Reddy, Vishak Gopal, and Ross Cutler. 2021. Dnsmos: A Non- Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppres- sors. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6493–6497. doi:10.1109/ICASSP39728.2021.9414878

work page doi:10.1109/icassp39728.2021.9414878 2021
[35]

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022. InInterspeech 2022. 4521–4525. doi:10.21437/ Interspeech.2022-439

2022
[36]

Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning.IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 132–157. doi:10.1109/TASLP.2020.3038524 X-VC: Zero-shot Streaming Voice Conversion in Codec Space

work page doi:10.1109/taslp.2020.3038524 2021
[37]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. arXiv:2301.02111 [cs.CL] https://arxiv.org/abs/2301.02111

work page internal anchor Pith review arXiv 2023
[38]

Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and He- len Meng. 2021. VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Con- version. InInterspeech 2021. 1344–1348. doi:10.21437/Interspeech.2021-283

work page doi:10.21437/interspeech.2021-283 2021
[39]

Tianrui Wang, Meng Ge, Zhikang Niu, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Ziyang Ma, Xiaobao Wang, Xie Chen, Longbiao Wang, and Jianwu Dang. 2025. A Progressive Generation Framework with Speech Pre-trained Model for Expressive Voice Conversion. In2025 IEEE International Conference on Multimedia and Expo (ICME). 1–6. doi:10.1109/ICME59968.202...

work page doi:10.1109/icme59968.2025.11209157 2025
[40]

Zhichao Wang, Yuanzhe Chen, Xinsheng Wang, Lei Xie, and Yuping Wang
[41]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

StreamVoice: Streamable Context-Aware Language Modeling for Real- time Zero-Shot Voice Conversion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 7328–7338. doi:10.18653/v1/...

work page doi:10.18653/v1/2024.acl-long.396 2024
[42]

Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, and Matthias Grundmann. 2024. STREAMVC: Real-Time Low-Latency Voice Conversion. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 11016–11020. doi:10.1109/ICASSP48485. 2024.10446863

work page doi:10.1109/icassp48485 2024
[43]

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2022. SoundStream: An End-to-End Neural Audio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10. 1109/TASLP.2021.3129994

work page arXiv 2022
[44]

Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. LibriTTS: A Corpus Derived from LibriSpeech for Text- to-Speech. InInterspeech 2019. 1526–1530. doi:10.21437/Interspeech.2019-2441

work page doi:10.21437/interspeech.2019-2441 2019
[45]

Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Yunhe Li, Yuchen Cao, Jianping Wang, and Linqi Song. 2026. DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion. arXiv:2601.09239 [cs.SD] https://arxiv.org/abs/2601.09239

work page arXiv 2026
[46]

Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yin- gru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang, Zhizheng Wu, and Mingbo Ma. 2025. Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/fo...

2025
[47]

Yu Zhang, Baotong Tian, and Zhiyao Duan. 2025. Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion. arXiv:2507.14534 [eess.AS] https://arxiv.org/abs/2507.14534

work page arXiv 2025