Voice ''Cloning'' is Style Transfer

Anna Pot; Federico Bianchi; James Zou; Kaitlyn Zhou; Martijn Bartelds; Yongchan Kwon

arxiv: 2605.16578 · v2 · pith:IG5772UInew · submitted 2026-05-15 · 💻 cs.SD · cs.AI· cs.HC· cs.LG

Voice ''Cloning'' is Style Transfer

Kaitlyn Zhou , Federico Bianchi , Martijn Bartelds , Anna Pot , Yongchan Kwon , James Zou This is my paper

Pith reviewed 2026-05-21 08:16 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.HCcs.LG

keywords voice cloningstyle transferhuman perceptiontrusthomogenizationspeech synthesisaudio embeddings

0 comments

The pith

Voice cloning models apply style transfer rather than faithfully copying source voices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that what is called voice cloning actually involves a systematic style transfer that alters how voices are perceived. Human annotators consistently rate the cloned versions as more authoritative, warm, and customer-service oriented than the original recordings. These changes also lead listeners to report higher trust in the cloned voices and greater willingness to share sensitive information. The process further reduces diversity by making accents, speaking rates, and audio features more uniform across different speakers.

Core claim

Voice cloning does not faithfully clone an individual's voice. Instead, widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space.

What carries the argument

Systematic style transfer by voice cloning models that shifts perceived traits like authority and warmth while reducing variance in speaker features.

If this is right

Applications such as language dubbing or preserving voices for those with speech loss may unintentionally change how the speaker is perceived by listeners.
Cloned voices could increase user trust and disclosure of personal data in customer service or interactive systems.
Widespread use of voice cloning would reduce the variety of accents and speaking styles in generated speech.
The technology introduces risks by making artificial voices seem more human-like and trustworthy than intended.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar unintended style shifts might occur in other AI generation tasks like text or image synthesis from personal data.
Testing cloning models on a wider range of source voices from different demographics could reveal if the style transfer is universal or context-specific.
Designers might need to add controls to preserve original traits if identity fidelity is the goal.
Over time, this could influence societal expectations of how normal voices sound if cloned ones dominate media.

Load-bearing premise

The selected voice cloning models and the human annotation setup accurately represent common deployed systems and typical listener perceptions of voice qualities.

What would settle it

Observing that cloned voices from a popular model receive the same ratings as source voices on authority, warmth, trust levels, and information disclosure willingness, or show increased rather than decreased variance in traits.

Figures

Figures reproduced from arXiv: 2605.16578 by Anna Pot, Federico Bianchi, James Zou, Kaitlyn Zhou, Martijn Bartelds, Yongchan Kwon.

**Figure 1.** Figure 1: Study pipeline. We collect audio data from n=86 non-native English speakers, which we use as reference audio for voice cloning on three models (ELEVENLABS, COQUI-XTTS, and CHATTERBOX). Each source recording is paired with its cloned counterpart and presented in a randomized order to n=177 annotators, whose ratings we analyze to characterize listener perception and self-reported behavioral responses. 3.1 Au… view at source ↗

**Figure 2.** Figure 2: Illustrate of cross-sentence voice cloning. 3.2 Voice Cloning We evaluate three widely used TTS models — two open-source (ChatterBox, Coqui-XTTS) and one state-of-the-art proprietary model (ElevenLabs V3). Open-source models were selected to reduce privacy risks by enabling greater control over speaker data, while ElevenLabs was included as a leading proprietary system that provides mechanisms for data rem… view at source ↗

**Figure 3.** Figure 3: Rating differences between cloned and source voices across all three models tested [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Shifts in classified accent after voice cloning. Sankey diagrams show how source accent [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Changes to cloned audio across 50 rounds of repeated cloning with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Screenshot of speaker task. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: ElevenLabs Privacy Terms Shared anonymously online via a public research dataset that cannot be used for commercial purposes (explicit guidelines below). Forbidden Uses of the public dataset include: • Generating, enabling, or promoting hate speech, harassment, discrimination, misinformation, or culturally offensive or harmful content • Beyond explicit research purposes, voice cloning, speaker impersonatio… view at source ↗

**Figure 9.** Figure 9: Comparison cloning with long vs short source clips (37 seconds versus 5 seconds). Long [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: PCA projections of Chatterbox acoustic embeddings under different styles. Across [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Human annotations on ElevenLabs clones under "low expressiveness" [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Rating differences between cloned and source voices by model. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Rating differences between cloned and source voices by speaker sex. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Screenshot of Annotation Task 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Change in entropy (denoted as nats) for duration distribution between source and cloned [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Change predicted emotion across the 50 iterative rounds of cloning, visualized with 95% [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Probability distribution on incorrect speakers for source (top) and cloned (bottom) [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Voice cloning shifts outputs toward more authoritative and trusted styles per human ratings, with homogenization in embeddings, but stimulus controls are the key open question.

read the letter

The main thing here is that voice cloning models appear to push source voices toward a more polished style—higher ratings for authority, warmth, and human-likeness—while also making different speakers sound more alike in accent, rate, and embedding space. Listeners report more trust and willingness to share personal info with the cloned versions. This is presented as a new observation rather than a restatement of existing cloning papers.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that voice cloning does not faithfully reproduce source voices but instead applies systematic style transfer. Human annotators rate cloned outputs as more authoritative, warm, customer-service-like, and human-like than sources, with higher reported trust and willingness to disclose personal information. The work also reports homogenization, shown by reduced variance in accent, speaking rate, and audio embedding space across cloned samples.

Significance. If the central empirical findings hold after addressing controls, the paper would usefully document unintended perceptual biases in deployed voice cloning systems and their potential effects on user trust and behavior. The combination of human ratings with embedding-based variance measurements supplies a concrete, falsifiable basis for the style-transfer interpretation.

major comments (2)

[§3.1] §3.1 (Stimulus Preparation): the manuscript does not describe normalization of audio level, background noise, or microphone characteristics between source recordings and cloned outputs. Without these controls, elevated ratings for authority, warmth, and trust (reported in §4.1) cannot be unambiguously attributed to model-driven style transfer rather than acoustic artifacts.
[§4.2] §4.2 (Human Annotation Protocol): no inter-rater reliability metric (e.g., Fleiss' kappa or ICC) or blinding procedure is reported. Because the central claim rests on systematic perceptual differences, the absence of these statistics leaves the statistical robustness of the trait shifts open to question.

minor comments (2)

[Table 1] Table 1: the column headers for model variants are not fully aligned with the text description in §3.2, making it difficult to map which exact systems produced the reported homogenization statistics.
[Figure 3] Figure 3: axis labels on the embedding PCA plot are too small for print readability; increasing font size would improve clarity without altering content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have prepared revisions to improve methodological transparency and statistical reporting.

read point-by-point responses

Referee: [§3.1] §3.1 (Stimulus Preparation): the manuscript does not describe normalization of audio level, background noise, or microphone characteristics between source recordings and cloned outputs. Without these controls, elevated ratings for authority, warmth, and trust (reported in §4.1) cannot be unambiguously attributed to model-driven style transfer rather than acoustic artifacts.

Authors: We agree that explicit documentation of acoustic controls is necessary to support the attribution of perceptual differences to style transfer. The original manuscript omitted these details. In the revised version we will expand §3.1 to describe the stimulus preparation pipeline, including RMS-based level normalization applied to all clips, use of quiet recording environments for source audio, and consistent synthesis parameters for cloned outputs. We will also acknowledge any remaining limitations in microphone matching between sources and clones. These additions will allow readers to assess the controls directly. revision: yes
Referee: [§4.2] §4.2 (Human Annotation Protocol): no inter-rater reliability metric (e.g., Fleiss' kappa or ICC) or blinding procedure is reported. Because the central claim rests on systematic perceptual differences, the absence of these statistics leaves the statistical robustness of the trait shifts open to question.

Authors: We acknowledge that reporting inter-rater reliability and blinding procedures strengthens the credibility of the human evaluation results. Although the annotation interface presented samples without source/clone labels, these elements were not quantified in the submitted manuscript. In the revision we will add Fleiss' kappa for the trait ratings and ICC for the continuous scales to §4.2, together with an explicit statement of the blinding procedure used during data collection. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivation chain or self-referential reductions

full rationale

The paper reports results from applying publicly available voice cloning models to source audio and collecting human ratings on traits including authority, warmth, customer-service orientation, human-likeness, trust, and willingness to disclose information. It additionally measures homogenization via reduced variance in accent, speaking rate, and embedding space. These outcomes are obtained directly from the annotation protocol and standard embedding computations; no equations, fitted parameters, predictions, or self-citations are invoked to derive the central claims by construction. The work is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from human evaluations and audio embeddings of outputs from existing voice-cloning models; no new free parameters, axioms, or invented entities are introduced.

axioms (1)

domain assumption Human annotators' ratings of authority, warmth, and trust accurately reflect systematic differences introduced by the cloning process
The perceptual claims depend on the validity of subjective human judgments as a measurement instrument.

pith-pipeline@v0.9.0 · 5718 in / 1168 out tokens · 33753 ms · 2026-05-21T08:16:19.413532+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like... reduced variance in accent, speaking rate, and the audio embedding space
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

iterative cloning... directional drift in audio embedding space... radii of the approximate bounding sphere going from 366 to 336

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

[1]

2023 , eprint=

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , author=. 2023 , eprint=

work page 2023
[2]

Advances in Neural Information Processing Systems , primaryClass=

Neural Voice Cloning with a Few Samples , author=. Advances in Neural Information Processing Systems , primaryClass=. 2018 , eprint=

work page 2018
[3]

2024 , eprint=

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models , author=. 2024 , eprint=

work page 2024
[4]

2023 , eprint=

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author=. 2023 , eprint=

work page 2023
[6]

C ontrol S peech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Ji, Shengpeng and Chen, Qian and Wang, Wen and Zuo, Jialong and Fang, Minghui and Jiang, Ziyue and Huang, Hai and Wang, Zehan and Cheng, Xize and Zheng, Siqi and Zhao, Zhou. C ontrol S peech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control. Proceedings of the 63rd Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2025.acl-long.346 2025
[7]

2024 , eprint=

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control , author=. 2024 , eprint=

work page 2024
[8]

Prompttts: Controllable Text-To-Speech With Text Descriptions , year=

Guo, Zhifang and Leng, Yichong and Wu, Yihan and Zhao, Sheng and Tan, Xu , booktitle=. Prompttts: Controllable Text-To-Speech With Text Descriptions , year=

work page
[9]

2023 , eprint=

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt , author=. 2023 , eprint=

work page 2023
[10]

2024 , eprint=

OpenVoice: Versatile Instant Voice Cloning , author=. 2024 , eprint=

work page 2024
[11]

2024 , eprint=

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models , author=. 2024 , eprint=

work page 2024
[12]

Pause-Aware Automatic Dubbing using LLM and Voice Cloning

Li, Yuang and Guo, Jiaxin and Zhang, Min and Miaomiao, Ma and Rao, Zhiqiang and Zhang, Weidong and He, Xianghui and Wei, Daimeng and Yang, Hao. Pause-Aware Automatic Dubbing using LLM and Voice Cloning. Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024). 2024. doi:10.18653/v1/2024.iwslt-1.2

work page doi:10.18653/v1/2024.iwslt-1.2 2024
[13]

What's in a voice? The legal implications of voice cloning , author=. Ariz. L. Rev. , volume=. 2022 , publisher=

work page 2022
[14]

Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems

Platnick, Daniel and Abdelnour, Bishoy and Earl, Eamon and Kumar, Rahul and Rezaei, Zahra and Tsangaris, Thomas and Lagum, Faraj. Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems. Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. 2024

work page 2024
[15]

1972 , publisher=

Speech correction , author=. 1972 , publisher=

work page 1972
[16]

and Ren, Xiang and Dziri, Nouha and Jurafsky, Dan and Sap, Maarten

Zhou, Kaitlyn and Hwang, Jena D. and Ren, Xiang and Dziri, Nouha and Jurafsky, Dan and Sap, Maarten. REL - A . I .: An Interaction-Centered Approach To Measuring Human- LM Reliance. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)...

work page doi:10.18653/v1/2025.naacl-long.556 2025
[17]

Miller and Ben A

Elizabeth J. Miller and Ben A. Steward and Zak Witkower and Clare A. M. Sutherland and Eva G. Krumhuber and Amy Dawel , title =. Psychological Science , volume =. 2023 , doi =

work page 2023
[18]

Nightingale and Hany Farid , title =

Sophie J. Nightingale and Hany Farid , title =. Proceedings of the National Academy of Sciences , volume =. 2022 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2120481119 , abstract =

work page doi:10.1073/pnas.2120481119 2022
[19]

2025 , isbn =

Michel, Shira and Kaur, Sufi and Gillespie, Sarah Elizabeth and Gleason, Jeffrey and Wilson, Christo and Ghosh, Avijit , title =. 2025 , isbn =. doi:10.1145/3715275.3732018 , booktitle =

work page doi:10.1145/3715275.3732018 2025
[20]

2025 , isbn =

Du, Jiachen and Huang, Hanyu and Zou, Xinkai and Yin, Shuzi and Gao, Bingjie and Fu, Xinyi , title =. 2025 , isbn =. doi:10.1145/3715070.3749244 , booktitle =

work page doi:10.1145/3715070.3749244 2025
[21]

and McMahan, Ryan P

Do, Tiffany D. and McMahan, Ryan P. and Wisniewski, Pamela J. , title =. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , articleno =. 2022 , isbn =. doi:10.1145/3491102.3517564 , abstract =

work page doi:10.1145/3491102.3517564 2022
[22]

Plos one , volume=

Warning: Humans cannot reliably detect speech deepfakes , author=. Plos one , volume=. 2023 , publisher=

work page 2023
[23]

2024 , isbn =

El Ali, Abdallah and Venkatraj, Karthikeya Puttur and Morosoli, Sophie and Naudts, Laurens and Helberger, Natali and Cesar, Pablo , title =. 2024 , isbn =. doi:10.1145/3613905.3650750 , booktitle =

work page doi:10.1145/3613905.3650750 2024
[24]

PLoS One , volume=

Voice clones sound realistic but not (yet) hyperrealistic , author=. PLoS One , volume=. 2025 , publisher=

work page 2025
[25]

2025 , isbn =

R Chavan, Durwa and Moon, Prachi and Dixon, Emma , title =. 2025 , isbn =. doi:10.1145/3663547.3759720 , booktitle =

work page doi:10.1145/3663547.3759720 2025
[26]

Trends in cognitive sciences , volume=

Universal dimensions of social cognition: Warmth and competence , author=. Trends in cognitive sciences , volume=. 2007 , publisher=

work page 2007
[27]

Number of contact center employees in the United States from 2014 to 2024 , year =

work page 2014
[28]

Gartner Reveals Three Technologies That Will Transform Customer Service and Support By 2028 , year =

work page 2028
[29]

Global Call Centers Market to Reach \ 494.7 Billion by 2030 , year =

work page 2030
[30]

Artificial Intelligence in Emergency Communications Centers , year =

work page
[31]

The future of mobility: how Curb delivers the promised ride with help from Twilio , year =

work page
[32]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

Labor, Power, and Belonging: The Work of Voice in the Age of AI Reproduction , author=. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2025
[33]

Nature Machine Intelligence , volume=

AI-generated characters for supporting personalized learning and well-being , author=. Nature Machine Intelligence , volume=. 2021 , publisher=

work page 2021
[34]

Nature , volume=

An instantaneous voice-synthesis neuroprosthesis , author=. Nature , volume=. 2025 , publisher=

work page 2025
[35]

Interspeech , year=

Commonaccent: Exploring large acoustic pretrained models for accent classification based on common voice , author=. Interspeech , year=

work page
[36]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[37]

2025 , eprint=

Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars , author=. 2025 , eprint=

work page 2025
[38]

The New York Times , year =

South Korea Uses AI to Help Seniors with Dementia , author =. The New York Times , year =

work page
[39]

1987 , publisher =

The Social Construction of Technological Systems: New Directions in the Sociology and History of Technology , editor =. 1987 , publisher =

work page 1987
[40]

Computer ethics , pages=

Do artifacts have politics? , author=. Computer ethics , pages=. 2017 , publisher=

work page 2017
[41]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Fairness and abstraction in sociotechnical systems , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page
[42]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages =

Hutiri, Wiebke and Papakyriakopoulos, Orestis and Xiang, Alice , title =. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2024 , isbn =. doi:10.1145/3630106.3658911 , abstract =

work page doi:10.1145/3630106.3658911 2024
[43]

Singapore Journal of Legal Studies , year =

Vocal Identity Under Siege by AI Voice Cloning Technologies , author =. Singapore Journal of Legal Studies , year =

work page
[44]

Philosophy & Technology , volume=

Look Who’s Talking: Voice cloning as tension point between identity and data , author=. Philosophy & Technology , volume=. 2025 , publisher=

work page 2025
[45]

Philosophy & Technology , volume=

Simulating Voice and the Simulacra of Voice Clones , author=. Philosophy & Technology , volume=. 2026 , publisher=

work page 2026
[46]

Philosophy & Technology , volume=

The role of the voice for identity and implications for voice cloning technology , author=. Philosophy & Technology , volume=. 2025 , publisher=

work page 2025
[47]

AIES , year=

Sound check: Auditing audio datasets , author=. AIES , year=

work page
[48]

Nature , volume=

AI models collapse when trained on recursively generated data , author=. Nature , volume=. 2024 , publisher=

work page 2024
[49]

Self-Consuming Generative Models Go

Alemohammad, Sina and Casco-Rodriguez, Josue and Luzi, Lorenzo and Humayun, Ahmed Imtiaz and Babaei, Hossein and LeJeune, Daniel and Siahkoohi, Ali and Baraniuk, Richard , booktitle =. Self-Consuming Generative Models Go

work page
[50]

Synthetic Data’s Transformative Role in Foundational Speech Models , year=

Generating data with text-to-speech and large-language models for conversational speech recognition , author=. Synthetic Data’s Transformative Role in Foundational Speech Models , year=

work page
[51]

arXiv preprint arXiv:2412.01078 , year=

Advancing speech language models by scaling supervised fine-tuning with over 60,000 hours of synthetic speech dialogue data , author=. arXiv preprint arXiv:2412.01078 , year=

work page arXiv
[52]

2022 international conference on decision aid sciences and applications (DASA) , pages=

An overview of automatic speech recognition preprocessing techniques , author=. 2022 international conference on decision aid sciences and applications (DASA) , pages=. 2022 , organization=

work page 2022
[53]

International Journal of Signal Processing , volume=

On preprocessing of speech signals , author=. International Journal of Signal Processing , volume=

work page
[54]

Computers in Human Behavior: Artificial Humans , volume=

Learning through AI-clones: Enhancing self-perception and presentation performance , author=. Computers in Human Behavior: Artificial Humans , volume=. 2025 , publisher=

work page 2025
[55]

2026 , isbn =

Mogi, Yamato and Akahori, Wataru and Yamashita, Naomi , title =. 2026 , isbn =. doi:10.1145/3772318.3790546 , articleno =

work page doi:10.1145/3772318.3790546 2026
[56]

2026 , isbn =

Park, Minju and Lee, Seunghyun and Ma, Juhwan and Yoon, Dongwook , title =. 2026 , isbn =. doi:10.1145/3772318.3790266 , booktitle =

work page doi:10.1145/3772318.3790266 2026
[57]

The Journal of the Acoustical Society of America , volume=

Physiologic and acoustic differences between male and female voices , author=. The Journal of the Acoustical Society of America , volume=. 1989 , publisher=

work page 1989
[58]

The Journal of the Acoustical Society of America , volume=

Discrimination of speaker sex and size when glottal-pulse rate and vocal-tract length are controlled , author=. The Journal of the Acoustical Society of America , volume=. 2007 , publisher=

work page 2007
[59]

Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris , journal=

work page
[60]

2015 , publisher=

Introducing global englishes , author=. 2015 , publisher=

work page 2015
[61]

English in the world: Teaching and learning the language and literatures/Cambridge UP , year=

Standards, codification and sociolinguistic realism: The English language in the outer circle , author=. English in the world: Teaching and learning the language and literatures/Cambridge UP , year=

work page
[62]

and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =

McFee, Brian and Raffel, Colin and Liang, Dawen and Ellis, Daniel P.W. and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =. SciPy 2015 , year =. doi:10.25080/Majora-7b98e3ed-003 , url =

work page doi:10.25080/majora-7b98e3ed-003 2015
[63]

2025 , howpublished =

work page 2025
[64]

doi:10.5281/zenodo.6334862 , license =

Eren, Gölge and. doi:10.5281/zenodo.6334862 , license =

work page doi:10.5281/zenodo.6334862

[1] [1]

2023 , eprint=

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale , author=. 2023 , eprint=

work page 2023

[2] [2]

Advances in Neural Information Processing Systems , primaryClass=

Neural Voice Cloning with a Few Samples , author=. Advances in Neural Information Processing Systems , primaryClass=. 2018 , eprint=

work page 2018

[3] [3]

2024 , eprint=

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models , author=. 2024 , eprint=

work page 2024

[4] [4]

2023 , eprint=

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers , author=. 2023 , eprint=

work page 2023

[5] [5]

2023 , eprint=

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , author=. 2023 , eprint=

work page 2023

[6] [6]

C ontrol S peech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Ji, Shengpeng and Chen, Qian and Wang, Wen and Zuo, Jialong and Fang, Minghui and Jiang, Ziyue and Huang, Hai and Wang, Zehan and Cheng, Xize and Zheng, Siqi and Zhao, Zhou. C ontrol S peech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control. Proceedings of the 63rd Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2025.acl-long.346 2025

[7] [7]

2024 , eprint=

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control , author=. 2024 , eprint=

work page 2024

[8] [8]

Prompttts: Controllable Text-To-Speech With Text Descriptions , year=

Guo, Zhifang and Leng, Yichong and Wu, Yihan and Zhao, Sheng and Tan, Xu , booktitle=. Prompttts: Controllable Text-To-Speech With Text Descriptions , year=

work page

[9] [9]

2023 , eprint=

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt , author=. 2023 , eprint=

work page 2023

[10] [10]

2024 , eprint=

OpenVoice: Versatile Instant Voice Cloning , author=. 2024 , eprint=

work page 2024

[11] [11]

2024 , eprint=

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models , author=. 2024 , eprint=

work page 2024

[12] [12]

Pause-Aware Automatic Dubbing using LLM and Voice Cloning

Li, Yuang and Guo, Jiaxin and Zhang, Min and Miaomiao, Ma and Rao, Zhiqiang and Zhang, Weidong and He, Xianghui and Wei, Daimeng and Yang, Hao. Pause-Aware Automatic Dubbing using LLM and Voice Cloning. Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024). 2024. doi:10.18653/v1/2024.iwslt-1.2

work page doi:10.18653/v1/2024.iwslt-1.2 2024

[13] [13]

What's in a voice? The legal implications of voice cloning , author=. Ariz. L. Rev. , volume=. 2022 , publisher=

work page 2022

[14] [14]

Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems

Platnick, Daniel and Abdelnour, Bishoy and Earl, Eamon and Kumar, Rahul and Rezaei, Zahra and Tsangaris, Thomas and Lagum, Faraj. Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems. Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. 2024

work page 2024

[15] [15]

1972 , publisher=

Speech correction , author=. 1972 , publisher=

work page 1972

[16] [16]

and Ren, Xiang and Dziri, Nouha and Jurafsky, Dan and Sap, Maarten

Zhou, Kaitlyn and Hwang, Jena D. and Ren, Xiang and Dziri, Nouha and Jurafsky, Dan and Sap, Maarten. REL - A . I .: An Interaction-Centered Approach To Measuring Human- LM Reliance. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)...

work page doi:10.18653/v1/2025.naacl-long.556 2025

[17] [17]

Miller and Ben A

Elizabeth J. Miller and Ben A. Steward and Zak Witkower and Clare A. M. Sutherland and Eva G. Krumhuber and Amy Dawel , title =. Psychological Science , volume =. 2023 , doi =

work page 2023

[18] [18]

Nightingale and Hany Farid , title =

Sophie J. Nightingale and Hany Farid , title =. Proceedings of the National Academy of Sciences , volume =. 2022 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2120481119 , abstract =

work page doi:10.1073/pnas.2120481119 2022

[19] [19]

2025 , isbn =

Michel, Shira and Kaur, Sufi and Gillespie, Sarah Elizabeth and Gleason, Jeffrey and Wilson, Christo and Ghosh, Avijit , title =. 2025 , isbn =. doi:10.1145/3715275.3732018 , booktitle =

work page doi:10.1145/3715275.3732018 2025

[20] [20]

2025 , isbn =

Du, Jiachen and Huang, Hanyu and Zou, Xinkai and Yin, Shuzi and Gao, Bingjie and Fu, Xinyi , title =. 2025 , isbn =. doi:10.1145/3715070.3749244 , booktitle =

work page doi:10.1145/3715070.3749244 2025

[21] [21]

and McMahan, Ryan P

Do, Tiffany D. and McMahan, Ryan P. and Wisniewski, Pamela J. , title =. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , articleno =. 2022 , isbn =. doi:10.1145/3491102.3517564 , abstract =

work page doi:10.1145/3491102.3517564 2022

[22] [22]

Plos one , volume=

Warning: Humans cannot reliably detect speech deepfakes , author=. Plos one , volume=. 2023 , publisher=

work page 2023

[23] [23]

2024 , isbn =

El Ali, Abdallah and Venkatraj, Karthikeya Puttur and Morosoli, Sophie and Naudts, Laurens and Helberger, Natali and Cesar, Pablo , title =. 2024 , isbn =. doi:10.1145/3613905.3650750 , booktitle =

work page doi:10.1145/3613905.3650750 2024

[24] [24]

PLoS One , volume=

Voice clones sound realistic but not (yet) hyperrealistic , author=. PLoS One , volume=. 2025 , publisher=

work page 2025

[25] [25]

2025 , isbn =

R Chavan, Durwa and Moon, Prachi and Dixon, Emma , title =. 2025 , isbn =. doi:10.1145/3663547.3759720 , booktitle =

work page doi:10.1145/3663547.3759720 2025

[26] [26]

Trends in cognitive sciences , volume=

Universal dimensions of social cognition: Warmth and competence , author=. Trends in cognitive sciences , volume=. 2007 , publisher=

work page 2007

[27] [27]

Number of contact center employees in the United States from 2014 to 2024 , year =

work page 2014

[28] [28]

Gartner Reveals Three Technologies That Will Transform Customer Service and Support By 2028 , year =

work page 2028

[29] [29]

Global Call Centers Market to Reach \ 494.7 Billion by 2030 , year =

work page 2030

[30] [30]

Artificial Intelligence in Emergency Communications Centers , year =

work page

[31] [31]

The future of mobility: how Curb delivers the promised ride with help from Twilio , year =

work page

[32] [32]

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

Labor, Power, and Belonging: The Work of Voice in the Age of AI Reproduction , author=. Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2025

[33] [33]

Nature Machine Intelligence , volume=

AI-generated characters for supporting personalized learning and well-being , author=. Nature Machine Intelligence , volume=. 2021 , publisher=

work page 2021

[34] [34]

Nature , volume=

An instantaneous voice-synthesis neuroprosthesis , author=. Nature , volume=. 2025 , publisher=

work page 2025

[35] [35]

Interspeech , year=

Commonaccent: Exploring large acoustic pretrained models for accent classification based on common voice , author=. Interspeech , year=

work page

[36] [36]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[37] [37]

2025 , eprint=

Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars , author=. 2025 , eprint=

work page 2025

[38] [38]

The New York Times , year =

South Korea Uses AI to Help Seniors with Dementia , author =. The New York Times , year =

work page

[39] [39]

1987 , publisher =

The Social Construction of Technological Systems: New Directions in the Sociology and History of Technology , editor =. 1987 , publisher =

work page 1987

[40] [40]

Computer ethics , pages=

Do artifacts have politics? , author=. Computer ethics , pages=. 2017 , publisher=

work page 2017

[41] [41]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Fairness and abstraction in sociotechnical systems , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page

[42] [42]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages =

Hutiri, Wiebke and Papakyriakopoulos, Orestis and Xiang, Alice , title =. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2024 , isbn =. doi:10.1145/3630106.3658911 , abstract =

work page doi:10.1145/3630106.3658911 2024

[43] [43]

Singapore Journal of Legal Studies , year =

Vocal Identity Under Siege by AI Voice Cloning Technologies , author =. Singapore Journal of Legal Studies , year =

work page

[44] [44]

Philosophy & Technology , volume=

Look Who’s Talking: Voice cloning as tension point between identity and data , author=. Philosophy & Technology , volume=. 2025 , publisher=

work page 2025

[45] [45]

Philosophy & Technology , volume=

Simulating Voice and the Simulacra of Voice Clones , author=. Philosophy & Technology , volume=. 2026 , publisher=

work page 2026

[46] [46]

Philosophy & Technology , volume=

The role of the voice for identity and implications for voice cloning technology , author=. Philosophy & Technology , volume=. 2025 , publisher=

work page 2025

[47] [47]

AIES , year=

Sound check: Auditing audio datasets , author=. AIES , year=

work page

[48] [48]

Nature , volume=

AI models collapse when trained on recursively generated data , author=. Nature , volume=. 2024 , publisher=

work page 2024

[49] [49]

Self-Consuming Generative Models Go

Alemohammad, Sina and Casco-Rodriguez, Josue and Luzi, Lorenzo and Humayun, Ahmed Imtiaz and Babaei, Hossein and LeJeune, Daniel and Siahkoohi, Ali and Baraniuk, Richard , booktitle =. Self-Consuming Generative Models Go

work page

[50] [50]

Synthetic Data’s Transformative Role in Foundational Speech Models , year=

Generating data with text-to-speech and large-language models for conversational speech recognition , author=. Synthetic Data’s Transformative Role in Foundational Speech Models , year=

work page

[51] [51]

arXiv preprint arXiv:2412.01078 , year=

Advancing speech language models by scaling supervised fine-tuning with over 60,000 hours of synthetic speech dialogue data , author=. arXiv preprint arXiv:2412.01078 , year=

work page arXiv

[52] [52]

2022 international conference on decision aid sciences and applications (DASA) , pages=

An overview of automatic speech recognition preprocessing techniques , author=. 2022 international conference on decision aid sciences and applications (DASA) , pages=. 2022 , organization=

work page 2022

[53] [53]

International Journal of Signal Processing , volume=

On preprocessing of speech signals , author=. International Journal of Signal Processing , volume=

work page

[54] [54]

Computers in Human Behavior: Artificial Humans , volume=

Learning through AI-clones: Enhancing self-perception and presentation performance , author=. Computers in Human Behavior: Artificial Humans , volume=. 2025 , publisher=

work page 2025

[55] [55]

2026 , isbn =

Mogi, Yamato and Akahori, Wataru and Yamashita, Naomi , title =. 2026 , isbn =. doi:10.1145/3772318.3790546 , articleno =

work page doi:10.1145/3772318.3790546 2026

[56] [56]

2026 , isbn =

Park, Minju and Lee, Seunghyun and Ma, Juhwan and Yoon, Dongwook , title =. 2026 , isbn =. doi:10.1145/3772318.3790266 , booktitle =

work page doi:10.1145/3772318.3790266 2026

[57] [57]

The Journal of the Acoustical Society of America , volume=

Physiologic and acoustic differences between male and female voices , author=. The Journal of the Acoustical Society of America , volume=. 1989 , publisher=

work page 1989

[58] [58]

The Journal of the Acoustical Society of America , volume=

Discrimination of speaker sex and size when glottal-pulse rate and vocal-tract length are controlled , author=. The Journal of the Acoustical Society of America , volume=. 2007 , publisher=

work page 2007

[59] [59]

Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris , journal=

work page

[60] [60]

2015 , publisher=

Introducing global englishes , author=. 2015 , publisher=

work page 2015

[61] [61]

English in the world: Teaching and learning the language and literatures/Cambridge UP , year=

Standards, codification and sociolinguistic realism: The English language in the outer circle , author=. English in the world: Teaching and learning the language and literatures/Cambridge UP , year=

work page

[62] [62]

and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =

McFee, Brian and Raffel, Colin and Liang, Dawen and Ellis, Daniel P.W. and McVicar, Matt and Battenberg, Eric and Nieto, Oriol , title =. SciPy 2015 , year =. doi:10.25080/Majora-7b98e3ed-003 , url =

work page doi:10.25080/majora-7b98e3ed-003 2015

[63] [63]

2025 , howpublished =

work page 2025

[64] [64]

doi:10.5281/zenodo.6334862 , license =

Eren, Gölge and. doi:10.5281/zenodo.6334862 , license =

work page doi:10.5281/zenodo.6334862