MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Chaoren Wang; Dekun Chen; Tao Feng; Xueyao Zhang; Xun Guan; Yuancheng Wang; Yuxiang Wang; Zhizheng Wu

arxiv: 2604.11552 · v2 · submitted 2026-04-13 · 💻 cs.SD · cs.CL

MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora

Tao Feng , Yuxiang Wang , Yuancheng Wang , Xueyao Zhang , Dekun Chen , Chaoren Wang , Xun Guan , Zhizheng Wu This is my paper

Pith reviewed 2026-05-10 14:52 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords voice imitationzero-shotautoregressive modelingspeech synthesispseudo-parallel corporapreference alignmentnaturalness

0 comments

The pith

MimicLM trains voice imitation models on synthetic sources paired with real recordings to reach higher naturalness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the shortage of real parallel speech data for voice imitation by flipping the usual training setup. Synthetic speech drives the input while real recordings serve as the desired output, so the model learns directly from real speech distributions instead of being limited by synthetic quality. Interleaved text-audio modeling keeps the linguistic content accurate, and preference alignment after training reduces mismatches from the synthetic inputs. A reader would care because voice imitation supports applications from dubbing to accessibility tools, and this method promises better results with less complex model designs. If the approach holds, it shows that high-quality zero-shot imitation becomes feasible without relying on rare real triplets or intricate disentanglement.

Core claim

MimicLM shows that an autoregressive model, when trained on pseudo-parallel data with synthetic speech as sources and real recordings as targets, can generate voice imitations that surpass prior methods in naturalness while holding competitive similarity in speaker identity, accent, and emotion.

What carries the argument

The central mechanism is the pseudo-parallel data construction that uses synthetic speech as sources and real recordings as targets, allowing direct learning from real distributions; this is augmented by interleaved text-audio modeling and preference alignment post-training.

Load-bearing premise

That training with synthetic speech as inputs but real speech as targets lets the model learn real speech distributions and surpass the quality limits of fully synthetic training.

What would settle it

A side-by-side human listening test on naturalness ratings for imitations of unseen speakers, where MimicLM must score higher than baselines using synthetic targets; failure to do so would undermine the central claim.

Figures

Figures reproduced from arXiv: 2604.11552 by Chaoren Wang, Dekun Chen, Tao Feng, Xueyao Zhang, Xun Guan, Yuancheng Wang, Yuxiang Wang, Zhizheng Wu.

**Figure 2.** Figure 2: Four-stage pipeline for pseudo-parallel data construction. We randomly sample two speakers with their [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of training data scale on WER and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but target matches the reference's voice characteristics, yet such data is extremely scarce. Existing approaches either employ carefully designed disentanglement architectures to bypass this data scarcity or leverage external systems to synthesize pseudo-parallel training data. However, the former requires intricate model design, and the latter faces a quality ceiling when synthetic speech is used as training targets. To address these limitations, we propose MimicLM, which takes a novel approach by using synthetic speech as training sources while retaining real recordings as targets. This design enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling. Building on this data construction approach, we incorporate interleaved text-audio modeling to guide the generation of content-accurate speech and apply post-training with preference alignment to mitigate the inherent distributional mismatch when training on synthetic data. Experiments demonstrate that MimicLM achieves superior voice imitation quality with a simple yet effective architecture, significantly outperforming existing methods in naturalness while maintaining competitive similarity scores across speaker identity, accent, and emotion dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MimicLM flips synthetic sources with real targets to chase better naturalness in zero-shot imitation, but the domain shift at inference remains the open question.

read the letter

The central move here is training an autoregressive model on synthetic speech as the source and real recordings as the target, then using interleaved text-audio tokens plus post-training preference alignment to handle the mismatch. This reverses the usual pseudo-parallel setup where synthetic data often ends up on both sides and hits a quality ceiling. The architecture stays straightforward—no heavy disentanglement modules—so that part is clean and easy to follow.

Referee Report

2 major / 1 minor

Summary. The paper proposes MimicLM, an autoregressive model for zero-shot voice imitation that constructs pseudo-parallel training data by pairing synthetic speech as sources with real recordings as targets. This is intended to allow the model to learn directly from real speech distributions and break the quality ceiling of synthetic targets. The method adds interleaved text-audio modeling for content preservation and post-training preference alignment to mitigate source-target mismatch. Experiments are reported to demonstrate superior naturalness with competitive similarity across speaker identity, accent, and emotion.

Significance. If the generalization from synthetic-to-real training to real-to-real inference holds and is properly validated, the approach could simplify voice-imitation architectures by avoiding complex disentanglement while improving output quality, with potential impact on zero-shot TTS and voice conversion applications.

major comments (2)

[Abstract] Abstract: the central claim that the synthetic-source/real-target construction 'enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling' is load-bearing, yet the abstract provides no quantitative metrics, baseline details, statistical tests, or ablation results to support the reported superiority in naturalness.
[Method (data construction and preference alignment)] The data-construction approach (synthetic sources paired with real targets) and the mitigation via preference alignment: the potential domain shift in acoustic fidelity, prosody variance, and artifact patterns between synthetic and real sources is acknowledged but not shown to be resolved for real-source inputs at inference; without an ablation isolating the effect of alignment on real sources, the naturalness gains cannot be attributed to the claimed mechanism rather than architecture or scale.

minor comments (1)

[Abstract] Clarify the exact form of the interleaved text-audio modeling and how it is interleaved during autoregressive generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that strengthening the abstract and providing clearer validation of the training approach will improve the manuscript. Below we respond point-by-point to the major comments and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the synthetic-source/real-target construction 'enables the model to learn directly from real speech distributions, breaking the synthetic quality ceiling' is load-bearing, yet the abstract provides no quantitative metrics, baseline details, statistical tests, or ablation results to support the reported superiority in naturalness.

Authors: We agree that the abstract would be strengthened by including key quantitative support for the central claim. The full paper reports MOS naturalness scores, similarity metrics across identity/accent/emotion, and ablations in Section 4, with MimicLM showing statistically significant naturalness gains over baselines. In the revised manuscript we will update the abstract to concisely include representative metrics (e.g., naturalness MOS improvement and baseline comparisons) and note the presence of supporting ablations and statistical tests, while keeping the abstract within length limits. revision: yes
Referee: [Method (data construction and preference alignment)] The data-construction approach (synthetic sources paired with real targets) and the mitigation via preference alignment: the potential domain shift in acoustic fidelity, prosody variance, and artifact patterns between synthetic and real sources is acknowledged but not shown to be resolved for real-source inputs at inference; without an ablation isolating the effect of alignment on real sources, the naturalness gains cannot be attributed to the claimed mechanism rather than architecture or scale.

Authors: We acknowledge the importance of explicitly demonstrating that the synthetic-to-real training generalizes to real-source inference and isolating the contribution of preference alignment. Our current evaluation uses real sources at inference and reports superior naturalness, consistent with the claimed mechanism. To address the concern directly, we will add a targeted ablation in the revised paper comparing models trained with and without preference alignment, evaluated on real source inputs. We will also expand the discussion of domain-shift mitigation with additional analysis of acoustic and prosodic characteristics. These additions will allow clearer attribution of gains to the data-construction and alignment approach rather than scale or architecture alone. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a data-construction strategy (synthetic sources paired with real targets) plus interleaved text-audio modeling and preference alignment as mitigations. No equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled via prior work appear in the abstract or described method. Experimental superiority claims rest on external comparisons rather than reducing to definitional equivalence or self-referential fits. The derivation is therefore self-contained against the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new postulated entities are described in the abstract; the work relies on standard autoregressive language-model training and existing preference-alignment techniques.

pith-pipeline@v0.9.0 · 5547 in / 1023 out tokens · 47919 ms · 2026-05-10T14:52:06.213846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others

Synthvc: Leveraging synthetic data for end-to- end low latency streaming voice conversion.arXiv preprint arXiv:2510.09245. Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generatio...

work page arXiv 2024
[2]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

IEEE. Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, and 1 others. 2024. Nat- uralspeech 3: Zero-shot speech synthesis with fac- torized codec and diffusion models.arXiv preprint arXiv:2403.03100. Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencode...

work page arXiv 2024
[3]

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee

V oicebox: Text-guided multilingual univer- sal speech generation at scale.Advances in neural information processing systems, 36:14005–14034. Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. 2025. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero- shot speec...

work page arXiv 2025
[4]

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen

Meanvc: Lightweight and streaming zero- shot voice conversion via mean flows.arXiv preprint arXiv:2510.08392. Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2024. emotion2vec: Self-supervised pre-training for speech emotion representation.Proc. ACL 2024 Findings. Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharit...

work page arXiv 2024
[5]

Speech resynthesis from discrete disentangled self-supervised representations

Speech resynthesis from discrete disentan- gled self-supervised representations.arXiv preprint arXiv:2104.00355. Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsu- pervised speech decomposition via triple information bottleneck. InInternational Conference on Machine Learning, pages 7836–7846. PMLR. Kaizhi Qian, Yang Zh...

work page arXiv 2020
[6]

Open- voice: Versatile instant voice cloning,

Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervision.arXiv preprint. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2023. Rob...

work page arXiv 2022
[7]

Chandan KA Reddy, Vishak Gopal, and Ross Cutler

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Chandan KA Reddy, Vishak Gopal, and Ross Cutler

work page
[8]

10 InICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. 10 InICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE. Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2020. An overview of voice conversion and its challenges: From stat...

work page arXiv 2021
[9]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750. Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, and Yuping Wang. 2023b. Lm-vc: Zero-shot voice conversion via speech generation based on language models.IEEE Signal Processing Letters, 30:1157– 1161. LLM-Core-Team Xiaomi. 2025. Mimo-audio: Audio lan...

work page arXiv 2025
[10]

Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dan- gna Li, Yuhao Wang, Julian Chan, and 1 others

Non-parallel sequence-to-sequence voice con- version with disentangled linguistic and speaker rep- resentations.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:540–552. Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dan- gna Li, Yuhao Wang, Julian Chan, and 1 others

work page
[11]

Semantic Distill

Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement.arXiv preprint arXiv:2502.07243. Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2022. Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1– 18. Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. 2025. In- dextts...

work page arXiv 2022

[1] [1]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others

Synthvc: Leveraging synthetic data for end-to- end low latency streaming voice conversion.arXiv preprint arXiv:2510.09245. Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generatio...

work page arXiv 2024

[2] [2]

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

IEEE. Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, and 1 others. 2024. Nat- uralspeech 3: Zero-shot speech synthesis with fac- torized codec and diffusion models.arXiv preprint arXiv:2403.03100. Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencode...

work page arXiv 2024

[3] [3]

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee

V oicebox: Text-guided multilingual univer- sal speech generation at scale.Advances in neural information processing systems, 36:14005–14034. Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, and Seong-Whan Lee. 2025. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero- shot speec...

work page arXiv 2025

[4] [4]

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen

Meanvc: Lightweight and streaming zero- shot voice conversion via mean flows.arXiv preprint arXiv:2510.08392. Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2024. emotion2vec: Self-supervised pre-training for speech emotion representation.Proc. ACL 2024 Findings. Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharit...

work page arXiv 2024

[5] [5]

Speech resynthesis from discrete disentangled self-supervised representations

Speech resynthesis from discrete disentan- gled self-supervised representations.arXiv preprint arXiv:2104.00355. Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. 2020. Unsu- pervised speech decomposition via triple information bottleneck. InInternational Conference on Machine Learning, pages 7836–7846. PMLR. Kaizhi Qian, Yang Zh...

work page arXiv 2020

[6] [6]

Open- voice: Versatile instant voice cloning,

Openvoice: Versatile instant voice cloning. arXiv preprint arXiv:2312.01479. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak su- pervision.arXiv preprint. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2023. Rob...

work page arXiv 2022

[7] [7]

Chandan KA Reddy, Vishak Gopal, and Ross Cutler

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Chandan KA Reddy, Vishak Gopal, and Ross Cutler

work page

[8] [8]

10 InICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. 10 InICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 6493–6497. IEEE. Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2020. An overview of voice conversion and its challenges: From stat...

work page arXiv 2021

[9] [9]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Maskgct: Zero-shot text-to-speech with masked generative codec transformer.arXiv preprint arXiv:2409.00750. Zhichao Wang, Yuanzhe Chen, Lei Xie, Qiao Tian, and Yuping Wang. 2023b. Lm-vc: Zero-shot voice conversion via speech generation based on language models.IEEE Signal Processing Letters, 30:1157– 1161. LLM-Core-Team Xiaomi. 2025. Mimo-audio: Audio lan...

work page arXiv 2025

[10] [10]

Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dan- gna Li, Yuhao Wang, Julian Chan, and 1 others

Non-parallel sequence-to-sequence voice con- version with disentangled linguistic and speaker rep- resentations.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:540–552. Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dan- gna Li, Yuhao Wang, Julian Chan, and 1 others

work page

[11] [11]

Semantic Distill

Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement.arXiv preprint arXiv:2502.07243. Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2022. Emotional voice conversion: Theory, databases and esd.Speech Communication, 137:1– 18. Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. 2025. In- dextts...

work page arXiv 2022