pith. machine review for the scientific record. sign in

arxiv: 2604.24794 · v1 · submitted 2026-04-25 · 💻 cs.CR · cs.AI· cs.CY· cs.ET· cs.HC

Recognition: unknown

V.O.I.C.E (Voice, Ownership, Identity, Control, Expression): Risk Taxonomy of Synthetic Voice Generation From Empirical Data

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.ETcs.HC
keywords synthetic voice generationrisk taxonomyvoice synthesis risksAI incident analysisvoice data privacythreat modelingempirical risk assessment
0
0 comments X

The pith

The V.O.I.C.E taxonomy classifies risks from synthetic voice generation by drawing on empirical incidents, reports, and discussions to show how harms emerge and interact with exposure, visibility, and legal protections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative voice models introduce privacy, security, and governance risks that uniform threat models do not capture well. This paper presents the V.O.I.C.E taxonomy, built from 569 incidents in major databases, 1067 direct reports from diverse U.S. participants, and 2221 Reddit discussions. It models how risks arise and differ according to a person's degree of exposure, social visibility, and available legal protections. A reader would care because unconsented voice data reuse now affects voice actors, public figures, and ordinary people in distinct ways. The taxonomy supplies a grounded structure for anticipating and addressing these harms.

Core claim

The central claim is that a taxonomy named V.O.I.C.E (Voice, Ownership, Identity, Control, Expression), derived from multi-source real-world data, explicitly models the emergence of synthetic voice risks and their interactions with contextual factors such as degree of exposure, social visibility, and the availability of legal protections across affected groups, thereby filling gaps in existing uniform threat models.

What carries the argument

The V.O.I.C.E taxonomy, which stands for Voice, Ownership, Identity, Control, Expression, functions as the organizing framework that categorizes risks and traces their dependence on contextual factors like exposure and legal protections.

Load-bearing premise

The collected incidents, reports, and discussions are representative enough to produce a comprehensive taxonomy that captures all major risk interactions with exposure, visibility, and legal protections.

What would settle it

Discovery of a major synthetic voice risk scenario that cannot be placed in any V.O.I.C.E category or that does not interact with exposure, visibility, or legal protections in the manner the taxonomy predicts.

Figures

Figures reproduced from arXiv: 2604.24794 by Ahmed Adnan, Anish Krishnagiri, Lili Dudas, Tanusree Sharma, Visar Berisha.

Figure 1
Figure 1. Figure 1: Three-tier taxonomy of Voice Generation & Synthesis risks. This visualizes the hierarchy, mapping view at source ↗
read the original abstract

As generative voice models are rapidly advancing in both capabilities and public utilization, the unconsented collection, reuse, and synthesis of voice data are introducing new classes of privacy, security and governance risk that are poorly captured by existing, largely uniform threat models. To fill the gap, we present V.O.I.C.E, a taxonomy of voice generation risk grounded in a multi-source threat modeling effort with 569 incidents from major AI incident database, FTC and Internet Crime Complaint Center (IC3); 1067 direct incident reports from U.S. based participants across diverse groups (including voice actors, internet personalities, political personnel, and general public); and 2,221 Reddit discussions. Grounded in real-world data, our taxonomy explicitly models how risk emerges, interact with contextual factors such as degree of exposure, social visibility, and the availability of legal protections for various affected groups.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces V.O.I.C.E, a five-category taxonomy (Voice, Ownership, Identity, Control, Expression) for risks arising from synthetic voice generation. It is constructed from a multi-source empirical effort comprising 569 incidents drawn from AI incident databases, FTC, and IC3 records; 1,067 direct reports from U.S.-based participants across voice actors, internet personalities, political figures, and the general public; and 2,221 Reddit discussions. The central claim is that this taxonomy explicitly models how risks emerge and interact with contextual factors such as degree of exposure, social visibility, and availability of legal protections for different affected groups.

Significance. If the empirical construction and interaction modeling are substantiated with transparent methods, the taxonomy would represent a meaningful advance over existing uniform threat models in AI security and privacy. The scale and diversity of the data sources—particularly the inclusion of reports from voice professionals and public figures—provide a concrete empirical base that could inform policy, detection tools, and governance frameworks. This data-driven approach is a clear strength relative to purely conceptual taxonomies in the field.

major comments (2)
  1. [§3 and §4] §3 Data Collection and §4 Taxonomy Derivation: The manuscript provides counts of incidents, reports, and discussions but contains no description of the thematic analysis or coding procedure used to map these sources onto the five V.O.I.C.E categories, nor any mention of inter-rater reliability, duplicate handling, or bias mitigation. This absence directly undermines the central claim that the taxonomy is 'grounded in real-world data' and 'explicitly models' risk emergence.
  2. [§5] §5 Risk Interaction Analysis: The abstract asserts that the taxonomy models interactions between risks and contextual factors (exposure, visibility, legal protections), yet no concrete examples, tables, or diagrams illustrate these interactions for specific groups (e.g., how legal protections differentially affect voice actors versus the general public). Without such evidence the interaction-modeling claim remains unsupported.
minor comments (2)
  1. [Abstract] Abstract: The V.O.I.C.E acronym is expanded but the paper does not provide even one-sentence definitions of each category, which would improve immediate readability.
  2. [Related Work] Related Work: The discussion of prior voice-synthesis risk literature is brief; adding explicit comparison to existing taxonomies (e.g., those focused on deepfakes or biometric privacy) would clarify the claimed novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important opportunities to strengthen the transparency and evidentiary support in our manuscript. We address each major comment below and commit to revisions that will clarify our methods and substantiate our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§3 and §4] §3 Data Collection and §4 Taxonomy Derivation: The manuscript provides counts of incidents, reports, and discussions but contains no description of the thematic analysis or coding procedure used to map these sources onto the five V.O.I.C.E categories, nor any mention of inter-rater reliability, duplicate handling, or bias mitigation. This absence directly undermines the central claim that the taxonomy is 'grounded in real-world data' and 'explicitly models' risk emergence.

    Authors: We agree that the absence of a methods description for the thematic analysis is a significant omission that weakens the presentation of our empirical grounding. In the revised manuscript, we will insert a dedicated subsection in §4 that details the iterative coding process used to map the 569 incidents, 1,067 reports, and 2,221 discussions onto the V.O.I.C.E categories. This will include the initial open coding approach, how duplicates were identified and resolved across sources, and steps taken to mitigate researcher bias through team review. Although formal inter-rater reliability statistics were not calculated in the original exploratory analysis, we will explicitly describe the collaborative validation process and acknowledge this as a limitation. These additions will directly support the claim that the taxonomy is grounded in real-world data. revision: yes

  2. Referee: [§5] §5 Risk Interaction Analysis: The abstract asserts that the taxonomy models interactions between risks and contextual factors (exposure, visibility, legal protections), yet no concrete examples, tables, or diagrams illustrate these interactions for specific groups (e.g., how legal protections differentially affect voice actors versus the general public). Without such evidence the interaction-modeling claim remains unsupported.

    Authors: We concur that §5 currently provides insufficient concrete illustrations of the risk interactions with contextual factors, leaving the abstract's claims under-supported. In the revision, we will expand §5 with specific examples drawn from the empirical sources, a new table that contrasts differential impacts (such as legal protections for voice actors versus the general public), and a diagram showing how exposure, visibility, and legal context modulate the V.O.I.C.E categories. These elements will be tied directly to the collected data to make the interaction modeling explicit and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical taxonomy from external data sources

full rationale

The paper constructs the V.O.I.C.E taxonomy directly from multi-source external data (569 incidents from public databases, 1067 participant reports, 2221 Reddit discussions). No equations, fitted parameters, predictions, or self-referential derivations appear. The central claim is an inductive classification grounded in independent real-world sources rather than any reduction to the paper's own inputs or prior self-citations. This is a standard non-circular empirical effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the collected incidents and discussions sufficiently represent real-world voice synthesis risks and that the five categories plus contextual factors form a complete model.

axioms (1)
  • domain assumption Multi-source incident and report data is representative of the full range of synthetic voice risks and their interactions with exposure, visibility, and legal protections.
    Invoked to support the claim that the taxonomy is 'grounded in real-world data' and 'explicitly models how risk emerges'.
invented entities (1)
  • V.O.I.C.E taxonomy (Voice, Ownership, Identity, Control, Expression categories) no independent evidence
    purpose: To structure and categorize the identified risks.
    New framework defined by the authors from the empirical data.

pith-pipeline@v0.9.0 · 5482 in / 1326 out tokens · 75751 ms · 2026-05-08T07:42:42.395684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Biometric Information Privacy Act

  2. [2]

    Minnesota Consumer Privacy Act

  3. [3]

    Nurture Originals, Foster Art, and Keep Entertainment Safe Act of 2025

  4. [4]

    Require the Exposure of AI–Led Political Advertisements Act

  5. [5]

    Stop Spying Bosses Act

  6. [6]

    Strengthening Artificial intelligence Normalization and Diffusion By Oversight and eXperimentation Act

  7. [7]

    Texas Responsible Artificial Intelligence Governance Act

  8. [8]

    Sound check: Auditing audio datasets

    William Agnew,JuliaBarnett,AnnieChu,RachelHong,Michael Feffer, Robin Netzorg, Harry H Jiang, Ezra Awumey, and Sauvik Das. Sound check: Auditing audio datasets. arXiv preprint arXiv:2410.13114, 2024

  9. [9]

    Artificial intelligence risk management framework (ai rmf 1.0).URL: https://nvlpubs.nist

    NIST AI. Artificial intelligence risk management framework (ai rmf 1.0).URL: https://nvlpubs.nist. gov/nistpubs/ai/nist.ai, pages 100–1, 2023

  10. [10]

    Resemble ai: Generative voice cloning.https: //www.resemble.ai, 2024

    Resemble AI. Resemble ai: Generative voice cloning.https: //www.resemble.ai, 2024. Accessed: 2024-05-20

  11. [11]

    Artificial intelligence incident database (aiid)

    AI Incident Database. Artificial intelligence incident database (aiid). https://incidentdatabase.ai/ , 2026. Incident repository indexing real-world AI harms and near-harms

  12. [12]

    TheriseoftheAI-clonedvoicescam

    JeffreyM.Allen. TheriseoftheAI-clonedvoicescam. American Bar Association, Voice of Experience, September 2025

  13. [13]

    Areviewofmodernaudio deepfake detection methods: challenges and future directions

    ZaynabAlmutairiandHebahElgibreen. Areviewofmodernaudio deepfake detection methods: challenges and future directions. Algorithms, 15(5):155, 2022

  14. [14]

    Amazon Alexa.https://alexa.amazon.com/,2024

    Amazon. Amazon Alexa.https://alexa.amazon.com/,2024. Accessed: 2024

  15. [15]

    Deep voice 2: multi-speaker neural text-to-speech

    SercanÖ.Arık,GregoryDiamos,AndrewGibiansky,JohnMiller, KainanPeng,WeiPing,JonathanRaiman,andYanqiZhou. Deep voice 2: multi-speaker neural text-to-speech. NIPS’17, page 2966–2974, Red Hook, NY, USA, 2017. Curran Associates Inc

  16. [16]

    Athleticdirectorusedaitoframeprincipalwith racist remarks, police say, 2024

    AssociatedPress. Athleticdirectorusedaitoframeprincipalwith racist remarks, police say, 2024. Accessed: 2026-02-05

  17. [17]

    Why speech deepfake detectors won’t generalize: The limits of detection in an open world.arXivpreprint arXiv:2509.20405, 2025

    Visar Berisha, Prad Kadambi, and Isabella Lenz. Why speech deepfake detectors won’t generalize: The limits of detection in an open world.arXivpreprint arXiv:2509.20405, 2025

  18. [18]

    Tortoise-TTS: A Multi-Voice TTS System.https: //github.com/neonbjb/tortoise- tts , 2022

    James Betker. Tortoise-TTS: A Multi-Voice TTS System.https: //github.com/neonbjb/tortoise- tts , 2022. Accessed: 2024

  19. [19]

    Using thematic analysis in psychology.Qualitativeresearchin psychology, 3(2):77–101, 2006

    Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitativeresearchin psychology, 3(2):77–101, 2006

  20. [20]

    arXiv:1802.07228, 2018

    Miles Brundage, Shahar Avin, Jack Clark, Helen Toner, Peter Eckersley, Ben Garfinkel, Allan Dafoe, Paul Scharre, Thomas Zeitzoff, Bobby Filar, et al. The malicious use of artificial intel- ligence: Forecasting, prevention, and mitigation.arXivpreprint arXiv:1802.07228, 2018

  21. [21]

    Gendershades:Intersectional accuracy disparities in commercial gender classification

    JoyBuolamwiniandTimnitGebru. Gendershades:Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018

  22. [22]

    Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice con- version for everyone

    Edresson Casanova, Julian Weber, Christopher D Shulby, Ar- naldo Candido Junior, Eren Gölge, and Moacir A Ponti. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice con- version for everyone. InInternational conference on machine learning, pages 2709–2720. PMLR, 2022

  23. [23]

    AI Watch: Global regulatory tracker - United States | White & Case LLP — whitecase.com.https://www

    White & Case. AI Watch: Global regulatory tracker - United States | White & Case LLP — whitecase.com.https://www. whitecase.com/insight-our-thinking/ai-watch-globa l-regulatory-tracker-united-states , 2025. [Accessed 29-01-2026]

  24. [24]

    Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

    Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei- Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio.arXiv preprint arXiv:2106.06909, 2021

  25. [25]

    One-shot voiceconversionbyseparatingspeakerandcontentrepresentations with instance normalization.arXiv preprint arXiv:1904.05742, 2019

    Ju-chieh Chou, Cheng-chieh Yeh, and Hung-yi Lee. One-shot voiceconversionbyseparatingspeakerandcontentrepresentations with instance normalization.arXiv preprint arXiv:1904.05742, 2019

  26. [26]

    Between truth and power

    Julie E Cohen. Between truth and power. Oxford University Press, 2019

  27. [27]

    Congress

    U.S. Congress. H.r.5388 – american artificial intelligence lead- ership and uniformity act.https://www.congress.gov/bil l/119th-congress/house-bill/5388 , 2025. Introduced in 119th Congress; referred to committee

  28. [28]

    Coqui TTS: Deep Learning Toolkit for Text-to-Speech

    Coqui AI. Coqui TTS: Deep Learning Toolkit for Text-to-Speech. https://github.com/coqui-ai/TTS, 2024. Accessed: 2024

  29. [29]

    The us approach to ai regulation: federal laws, policies,andstrategiesexplained

    Tatevik Davtyan. The us approach to ai regulation: federal laws, policies,andstrategiesexplained. JournalofLaw,Technology,& the Internet, 16(2):223, 2025

  30. [30]

    Defending speaker verification against deepfake: A multi-step approach

    Thien-Phuc Doan, Hung Dinh-Xuan, Inho Kim, Woongjae Lee, SeongKyu Han, and Souhwan Jung. Defending speaker verification against deepfake: A multi-step approach. In Proceedingsofthe1stACMWorkshoponDeepfake,Deception, andDisinformationSecurity, pages 44–51, 2025

  31. [31]

    Oxford Uni- versity Press, 2020

    Sam Dubberley, Alexa Koenig, and Daragh Murray.Digital witness: using open source information for human rights investigation,documentation, and accountability. Oxford Uni- versity Press, 2020

  32. [32]

    Trump says he’ll sign executive order blocking state AI regulations, despite safety fears | CNN Business

    Clare Duffy. Trump says he’ll sign executive order blocking state AI regulations, despite safety fears | CNN Business

  33. [33]

    ElevenLabs: Prime Voice AI.https://elevenla bs.io/, 2024

    ElevenLabs. ElevenLabs: Prime Voice AI.https://elevenla bs.io/, 2024. Accessed: 2024

  34. [34]

    General data protection regulation (gdpr).https: //gdpr-info.eu/, 2018

    EU GDPR. General data protection regulation (gdpr).https: //gdpr-info.eu/, 2018. Official consolidated text of the EU data protection regulation; accessed 2026-01-29

  35. [35]

    European Commission. Regulation (eu) 2024/1689 of the eu- ropean parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations(ec)no300/2008. http://data.europa.eu/eli/r eg/2024/1689/oj, 2024. Addition to the GDPR about AI from the EU

  36. [36]

    Preventing Woke AI in the Federal Government

    Executive Office of the President. Preventing Woke AI in the Federal Government

  37. [37]

    Promoting the Export of the American AI Technology Stack

    Executive Office of the President. Promoting the Export of the American AI Technology Stack

  38. [38]

    Creating, using, misusing, and detecting deep fakes

    Hany Farid. Creating, using, misusing, and detecting deep fakes. Journal ofOnlineTrustandSafety, 2022

  39. [39]

    Internet crime complaint center (ic3), n.d

    Federal Bureau of Investigation. Internet crime complaint center (ic3), n.d. Accessed: 2026-02-05. 14

  40. [40]

    Ftc voice cloning challenge.https: //www.ftc.gov/news-events/contests/ftc-voice-clo ning-challenge , 2026

    Federal Trade Commission. Ftc voice cloning challenge.https: //www.ftc.gov/news-events/contests/ftc-voice-clo ning-challenge , 2026. FTC initiative addressing consumer harms from AI-enabled voice cloning

  41. [41]

    Timit acoustic-phonetic continuous speech corpus.(NoTitle), 1993

    JohnSGarofolo,LoriFLamel,WilliamMFisher,DavidSPallett, Nancy L Dahlgren, Victor Zue, and Jonathan G Fiscus. Timit acoustic-phonetic continuous speech corpus.(NoTitle), 1993

  42. [42]

    Google Cloud Text-to-Speech.https://cloud.goog le.com/text-to-speech, 2024

    Google. Google Cloud Text-to-Speech.https://cloud.goog le.com/text-to-speech, 2024. Accessed: 2024

  43. [43]

    Pressprotect: Helping journalists navigate social media in the face of online harassment

    Catherine Han, Anne Li, Deepak Kumar, and Zakir Durumeric. Pressprotect: Helping journalists navigate social media in the face of online harassment. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW2):1–34, 2024

  44. [44]

    Not myvoice!ataxonomyofethicalandsafetyharmsofspeechgener- ators

    Wiebke Hutiri, Orestis Papakyriakopoulos, and Alice Xiang. Not myvoice!ataxonomyofethicalandsafetyharmsofspeechgener- ators. InProceedingsofthe2024ACMConferenceonFairness, Accountability,andTransparency,FAccT’24,page359–376,New York, NY, USA, 2024. Association for Computing Machinery

  45. [45]

    Transfer learning from speaker verification to multispeaker text-to-speech synthesis

    Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural informationprocessingsystems, 31, 2018

  46. [46]

    Libri-light: A benchmark for asr with limited or no supervision

    Jacob Kahn, Morgane Riviere, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky,RonanCollobert,ChristianFuegen,etal. Libri-light: A benchmark for asr with limited or no supervision. InICASSP 2020-2020IEEEInternationalConferenceonAcoustics,Speech andSignalProcessing(ICASSP), pages 7669–7673. IEEE, 2020

  47. [47]

    Librivox: Free public domain audiobooks, 2014

    Jodi Kearns. Librivox: Free public domain audiobooks, 2014

  48. [48]

    Conditional varia- tionalautoencoderwithadversariallearningforend-to-endtext-to- speech

    Jaehyeon Kim, Jungil Kong, and Juhee Son. Conditional varia- tionalautoencoderwithadversariallearningforend-to-endtext-to- speech. InInternationalConferenceonMachineLearning,pages 5530–5540. PMLR, 2021

  49. [49]

    Thesimpleeconomicsofcybercrimes

    NirKshetri. Thesimpleeconomicsofcybercrimes. IEEESecurity &Privacy, 4(1):33–39, 2006

  50. [50]

    Voicebox: Text-guided multilingual univer- sal speech generation at scale.Advancesin neural information processingsystems, 36:14005–14034, 2023

    Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, RashelMoritz,MaryWilliamson,VimalManohar,YossiAdi,Jay Mahadeokar, et al. Voicebox: Text-guided multilingual univer- sal speech generation at scale.Advancesin neural information processingsystems, 36:14005–14034, 2023

  51. [51]

    A hierarchical speaker representation framework for one-shot singing voice conversion

    Xu Li, Shansong Liu, and Ying Shan. A hierarchical speaker representation framework for one-shot singing voice conversion. arXivpreprint arXiv:2206.13762, 2022

  52. [52]

    Fish-speech:Leveraginglargelan- guage models for advanced multilingual text-to-speech synthesis,

    ShijiaLiao,YuxuanWang,TianyuLi,YifanCheng,RuoyiZhang, RongzhiZhou,andYijinXing. Fish-speech:Leveraginglargelan- guage models for advanced multilingual text-to-speech synthesis,

  53. [53]

    URLhttps://arxiv.org/abs/2411.01156

  54. [54]

    Autoregressive diffusion transformer for text-to-speech synthesis

    Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li. Autoregressive diffusion transformer for text-to-speech synthesis. arXivpreprint arXiv:2406.05551, 2024

  55. [55]

    Warning: Humans cannot reliably detect speech deepfakes.Plos one, 18(8):e0285333, 2023

    Kimberly T Mai, Sergi Bray, Toby Davies, and Lewis D Griffin. Warning: Humans cannot reliably detect speech deepfakes.Plos one, 18(8):e0285333, 2023

  56. [56]

    The creation and detection of deepfakes: A survey.ACMcomputingsurveys(CSUR),54(1):1– 41, 2021

    Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey.ACMcomputingsurveys(CSUR),54(1):1– 41, 2021

  57. [57]

    The economics of cybersecurity: Principles and policy options

    Tyler Moore. The economics of cybersecurity: Principles and policy options. International Journal of Critical Infrastructure Protection, 3(3-4):103–117, 2010

  58. [58]

    Does Audio Deepfake Detection Generalize?

    Nicolas M Müller, Pavel Czempin, Franziska Dieckmann, Adam Froghyar, and Konstantin Böttinger. Does audio deepfake detec- tion generalize?arXiv preprint arXiv:2203.16263, 2022

  59. [59]

    Humanper- ceptionofaudiodeepfakes

    NicolasMMüller,KarlaPizzi,andJenniferWilliams. Humanper- ceptionofaudiodeepfakes. In Proceedingsofthe1stinternational workshopondeepfakedetectionforaudiomultimedia,pages85– 91, 2022

  60. [60]

    V oxCeleb: A large-scale speaker identifica- tion dataset

    Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Vox- celeb: a large-scale speaker identification dataset.arXivpreprint arXiv:1706.08612, 2017

  61. [61]

    Aim: Ai incidents and hazards monitor (incidents portal).https://oecd.ai/en/incidents,

    OECD AI Policy Observatory. Aim: Ai incidents and hazards monitor (incidents portal).https://oecd.ai/en/incidents,

  62. [62]

    OECD portal documenting AI incidents and hazards

  63. [63]

    OpenAI Text-to-Speech API.https://platform.o penai.com/docs/guides/text-to-speech, 2024

    OpenAI. OpenAI Text-to-Speech API.https://platform.o penai.com/docs/guides/text-to-speech, 2024. Accessed: 2024

  64. [64]

    Librispeech: an asr corpus based on public domain au- dio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khu- danpur. Librispeech: an asr corpus based on public domain au- dio books. In2015 IEEE international conferenceon acoustics, speechandsignalprocessing(ICASSP),pages5206–5210.IEEE, 2015

  65. [65]

    InProceedings of 5th International ConferenceonSpeechProsody(SpeechProsody2010),Chicago, Illinois, 2010

    KishorePrahallad,EVeeraRaghavendra,andAlanWBlack.Semi- supervised learning of acoustic driven prosodic phrase breaks for text-to-speech systems. InProceedings of 5th International ConferenceonSpeechProsody(SpeechProsody2010),Chicago, Illinois, 2010

  66. [66]

    Openvoice: Versatile instant voice cloning

    ZengyiQin,WenliangZhao,XuminYu,andXinSun. Openvoice: Versatileinstantvoicecloning. arXivpreprintarXiv:2312.01479, 2023

  67. [67]

    Qwen3-tts technical report.https://github.com /QwenLM/Qwen3-TTS/blob/main/assets/Qwen3_TTS.pdf

    Qwen Team. Qwen3-tts technical report.https://github.com /QwenLM/Qwen3-TTS/blob/main/assets/Qwen3_TTS.pdf . GitHub repository PDF, accessed Feb. 5, 2026

  68. [68]

    Robust speech recognition via large-scale weak supervision

    AlecRadford,JongWookKim,TaoXu,GregBrockman,Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machinelearning, pages 28492–28518. PMLR, 2023

  69. [69]

    Millions}of people are watching{you

    Patrawat Samermit, Anna Turner, Patrick Gage Kelley, Tara Matthews, Vanessia Wu, Sunny Consolvo, and Kurt Thomas. {“Millions}of people are watching{you”}: Understanding the {Digital-Safety}needsandpracticesofcreators. In 32ndUSENIX Security Symposium(USENIXSecurity 23), pages 5629–5645, 2023

  70. [70]

    Human perception of audio deepfakes: the roleoflanguageandspeakingstyle

    Eugenia San Segundo, Aurora López-Jareño, Xin Wang, and Junichi Yamagishi. Human perception of audio deepfakes: the roleoflanguageandspeakingstyle. AvailableatSSRN5954496, 2025

  71. [71]

    Verifying humanness: Personhood credentials for the digital identity crisis

    Tanusree Sharma. Verifying humanness: Personhood credentials for the digital identity crisis. 2025

  72. [72]

    An analysis of phish- ing emails and how the human vulnerabilities are exploited

    Tanusree Sharma and Masooda Bashir. An analysis of phish- ing emails and how the human vulnerabilities are exploited. In International Conference on Applied Human Factors and Ergonomics, pages 49–55. Springer, 2020

  73. [73]

    Aligning ai with public values: Deliberation and decision-making for governing multimodal llms in political video analysis

    Tanusree Sharma, Yujin Potter, Zachary Kilhoffer, Yun Huang, Dawn Song, and Yang Wang. Aligning ai with public values: Deliberation and decision-making for governing multimodal llms in political video analysis. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 2345– 2359, 2025

  74. [74]

    Disability- first design and creation of a dataset showing private visual information collected with people who are blind

    Tanusree Sharma, Abigale Stangl, Lotus Zhang, Yu-Yun Tseng, InanXu,LeahFindlater,DannaGurari,andYangWang. Disability- first design and creation of a dataset showing private visual information collected with people who are blind. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2023. 15

  75. [75]

    be- fore, i asked my mom, now i ask chatgpt

    TanusreeSharma,Yu-YunTseng,LotusZhang,AyaeIde,KellyAv- ery Mack, Leah Findlater, Danna Gurari, and Yang Wang. “be- fore, i asked my mom, now i ask chatgpt”: Visual privacy management with generative ai for blind and low-vision peo- ple. InProceedingsofthe27thInternationalACMSIGACCESS Conferenceon Computersand Accessibility, pages 1–14, 2025

  76. [76]

    Prac3(privacy, reputation, accountability, consent, credit, compensation): Voice actors in ai data-economy

    TanusreeSharma,YihaoZhou,andVisarBerisha. Prac3(privacy, reputation, accountability, consent, credit, compensation): Voice actors in ai data-economy. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 2360– 2372, 2025

  77. [77]

    Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction

    Renee Shelby, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas, N’Mah Yilla-Akbari, Jess Gallegos, Andrew Smart, Emilio Garcia, et al. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. InProceedings of the 2023 AAAI/ACMConference onAI, Ethics,andSociety, pages 723–741, 2023

  78. [78]

    Tacolm: Gated attention equipped codec language model are efficient zero-shot text to speech synthesizers

    Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Guanrou Yang, and Xie Chen. Tacolm: Gated attention equipped codec language model are efficient zero-shot text to speech synthesizers. arXivpreprint arXiv:2406.15752, 2024

  79. [79]

    Data is the new what? popularmetaphors&professionalethicsinemergingdataculture

    Luke Stark and Anna Lauren Hoffmann. Data is the new what? popularmetaphors&professionalethicsinemergingdataculture. Journal of Cultural Analytics, 4(1), 2019

  80. [80]

    Natu- ralspeech: End-to-end text-to-speech synthesis with human-level quality

    XuTan,JiaweiChen,HaoheLiu,JianCong,ChenZhang,Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Natu- ralspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

Showing first 80 references.