pith. sign in

arxiv: 2605.28064 · v1 · pith:YM6UBXN4new · submitted 2026-05-27 · 📡 eess.AS · cs.AI· cs.HC

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

Pith reviewed 2026-06-29 10:14 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.HC
keywords synthetic speech detectionvoice deepfakeshuman perceptiontrust cueslocalization taskperceptual qualitydeepfake detectionsocio-technical systems
0
0 comments X

The pith

Humans detected fully synthetic speech below chance levels while quality ratings revealed implicit discrimination by utterance type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how people actually spot synthetic speech in context rather than in isolated lab tests. It presents a task where participants locate suspected fake segments in real, fully fake, and mixed utterances while trust cues like instructions, priming, and labels are varied. Fully synthetic speech went undetected at rates worse than guessing, yet listeners' ratings of mechanicalness and other qualities still aligned with the true utterance classes. This suggests that overt detection fails even as some perceptual processing succeeds. The work matters because it shows that human judgment alone may not reliably counter voice deepfakes in everyday settings.

Core claim

In a localization task with 47 participants, utterance class determined both detection accuracy and perceptual quality ratings; manipulated trust cues affected motivation but produced no main effects on performance. Fully synthetic speech was identified at below-chance levels, while quality ratings for mechanicalness, expressiveness, and related dimensions tracked the authentic, fully synthetic, and partially synthetic categories, indicating implicit discrimination where explicit detection failed.

What carries the argument

The localization task in which participants mark suspected synthetic segments under three manipulated trust cues (instructional framing, affective priming, and provenance labeling).

If this is right

  • Human oversight cannot be relied upon to catch fully synthetic speech at usable rates.
  • Perceptual quality ratings may serve as indirect signals of synthetic content even when listeners cannot name the source.
  • Trust cues influence willingness to report suspicion but not the accuracy of that suspicion.
  • Design of detection systems should incorporate implicit perceptual measures rather than depending solely on explicit judgments.
  • Partially synthetic utterances may be easier to flag than fully synthetic ones under the same conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training listeners on quality-rating dimensions could improve downstream detection performance without changing overt task instructions.
  • Systems that combine human quality ratings with automatic classifiers might achieve better coverage than either alone.
  • Real-world deployment of voice interfaces may need provenance indicators that go beyond simple labels if trust cues show no effect in controlled settings.
  • The gap between implicit and explicit detection points to a need for experiments that measure reaction times or eye movements during listening.

Load-bearing premise

The controlled localization task with its specific manipulations accurately represents how humans encounter synthetic speech outside the experiment.

What would settle it

A field study in which participants encounter synthetic speech in ordinary conversations or media and show detection rates at or above chance.

Figures

Figures reproduced from arXiv: 2605.28064 by Lelia Erscoi (1), Tomi Kinnunen (1) ((1) Computational Speech Group, University of Eastern Finland).

Figure 1
Figure 1. Figure 1: How user markers translate to evaluation metrics. Each speech excerpt is divided into windows (length = 0.2s) to extract the ground-truth label and to verify whether the user placed a marker (flag or segment) within that window. A ±200ms temporal margin is applied to each marker to account for reaction-time delays. 3.3. Speech evaluation Participants also completed a set of subjective evaluation state￾ment… view at source ↗
Figure 2
Figure 2. Figure 2: Decision-making patterns across conditions. (A) Positive correlation between action count and trial duration across all utterance types; synthetic speech trials required significantly more time and annotations. (B) Mean trial duration. The positive instruction framing (I+) increased duration by +22 s, whereas positive valence (V +) and provenance labeling (P+) reduced it by −33 s and −32 s, respectively. (… view at source ↗
Figure 3
Figure 3. Figure 3: Crowd-level detection performance. (A) Raw discriminability. Distributions show the internal decision variable for real and fake windows, indicating near-chance discrimination. (B) Confidence calibration. Observed accuracy as a function of self-reported confidence indicates that participants were systematically overconfident. (C) Majority-vote accuracy by utterance type. Authentic and partially synthetic c… view at source ↗
Figure 4
Figure 4. Figure 4: Percentage of positive ratings (”Agree”, ”Strongly Agree”). Authentic speech consistently received the highest subjective ratings across all dimensions, while fully synthetic speech received the lowest. Note that mechanicalness is pre￾sented as rated (higher = more mechanical), such that lower scores indicate higher perceived quality for this dimension. 7. Conclusion This study challenges the ”one solution… view at source ↗
read the original abstract

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports an empirical user study with 47 participants performing a localization task to identify suspected synthetic segments in authentic, fully synthetic, and partially synthetic utterances. Three trust cues (instructional framing, affective priming, provenance labeling) were manipulated. Participants also provided perceptual quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and evaluation confidence. The central claims are that utterance class is the primary driver of both overt detection accuracy and quality ratings, trust cues produce no main effects on detection but do motivate behavior, fully synthetic speech is detected at below-chance levels, and quality ratings reveal implicit discrimination even when overt localization fails.

Significance. If the empirical results hold after proper statistical reporting and controls, the work contributes to socio-technical research on deepfake speech by documenting human limitations in explicit detection alongside evidence of implicit sensitivity via quality judgments. This could inform the design of warning systems, provenance interfaces, and training for human detectors, highlighting that contextual cues may matter more for behavior than for accuracy.

major comments (2)
  1. [Abstract / Methods] Abstract and Methods: The claims of main effects, below-chance detection of fully synthetic speech, and implicit discrimination via quality ratings are presented without any statistical tests, exact definition of the chance baseline, per-condition hit/false-alarm rates, participant demographics, or exclusion criteria. These omissions are load-bearing because they prevent evaluation of whether the data actually support the reported effects and the below-chance result.
  2. [Methods] Methods: The controlled localization task with manipulated instructional framing, affective priming, and provenance labeling is presented as modeling the socio-technical environment, yet no validation or discussion of ecological validity is supplied to justify that the artificial task and cue manipulations generalize to naturalistic encounters with synthetic speech.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly stated the scoring rule used for 'chance' performance and the primary statistical approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments below, agreeing to incorporate additional statistical details and a discussion of ecological validity in the revised version.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: The claims of main effects, below-chance detection of fully synthetic speech, and implicit discrimination via quality ratings are presented without any statistical tests, exact definition of the chance baseline, per-condition hit/false-alarm rates, participant demographics, or exclusion criteria. These omissions are load-bearing because they prevent evaluation of whether the data actually support the reported effects and the below-chance result.

    Authors: We agree with the referee that the abstract and methods sections would be strengthened by including the requested statistical information. In the revision, we will add the results of the appropriate statistical tests supporting the main effects and the below-chance detection, provide an exact definition of the chance baseline for the localization task, report per-condition hit and false-alarm rates, include participant demographics, and specify exclusion criteria. These additions will clarify how the data support the claims regarding utterance class as the primary driver and implicit discrimination via quality ratings. revision: yes

  2. Referee: [Methods] Methods: The controlled localization task with manipulated instructional framing, affective priming, and provenance labeling is presented as modeling the socio-technical environment, yet no validation or discussion of ecological validity is supplied to justify that the artificial task and cue manipulations generalize to naturalistic encounters with synthetic speech.

    Authors: Regarding ecological validity, we recognize that the manuscript would benefit from explicit discussion of this issue. Although the task is designed as a controlled experiment to examine the effects of trust cues in a socio-technical context, we will add a dedicated paragraph in the Discussion section addressing the limitations of the laboratory setting and the extent to which the findings may generalize to naturalistic encounters with synthetic speech. We will also elaborate on how the cue manipulations are intended to model real-world trust signals. revision: yes

Circularity Check

0 steps flagged

Empirical user study with no derivation chain present

full rationale

The paper reports results from a controlled localization task with 47 human participants, including detection rates, quality ratings across utterance types, and effects of manipulated trust cues. No equations, parameters, derivations, or predictive models are described in the abstract or referenced full text. All claims reduce directly to observed experimental outcomes rather than any self-referential construction, fitted input, or self-citation chain. This matches the default case of a self-contained empirical report with no load-bearing mathematical steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical perceptual experiment. It rests on standard assumptions about human response data and experimental validity rather than free parameters or new entities.

axioms (1)
  • domain assumption Participant responses in a localization task can be aggregated to measure detection accuracy and quality perception under controlled cue manipulations.
    The study design assumes that the chosen task and rating scales validly capture implicit versus overt discrimination.

pith-pipeline@v0.9.1-grok · 5682 in / 1278 out tokens · 34649 ms · 2026-06-29T10:14:42.752141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

    Introduction Generative artificial intelligence (genAI), particularly conver- sational agents, extends a socio-cultural transition of epistemic authority and information sourcing, from traditional news out- lets, to social media platforms and now to personalized chat- bots [1, 2]. Within this shifting landscape, trust becomes a currency of the attention e...

  2. [2]

    Over time, its def- inition has expanded to encompass generative techniques across audio, text, images, videos, and multimodal media

    Socio-technical background The term ”deepfake” was coined in the mid-2010s to describe non-consensual impersonating pornographic synthetic media produced with deep learning techniques [24]. Over time, its def- inition has expanded to encompass generative techniques across audio, text, images, videos, and multimodal media. Benefi- cial applications such as...

  3. [3]

    Dataset The LlamaPartialSpoof [38] is a recent dataset that contains both fully synthetic and partially spoofed speech

    Methods 3.1. Dataset The LlamaPartialSpoof [38] is a recent dataset that contains both fully synthetic and partially spoofed speech. It contains English utterances from 40 LibriTTS [39] speakers, with syn- thetic counterparts generated using five open-source models (LJ JETS8, YourTTS9, XTTS V2 10, GPT-SoVITS11, CosyV oice12) plus one commercial service (E...

  4. [4]

    Results Each utterance was processed by aligning participant marker placements to ground-truth synthetic regions via sliding win- dow scoring (see Figure 1), producing window-level metrics from which an overall verdict per utterance was derived and inter-participant agreement calculated. The results reveal a crit- ical tension in the socio-technical trust...

  5. [5]

    Discussion Utterance type emerged as the dominant determinant of detec- tion, suggesting that the environmental dimension of trust may fall to a secondary role in real-time detection tasks. This pattern may reflect a mismatch in listener expectations: participants en- tered the task expecting robotic or background artifacts [6, 18], yet encountered realis...

  6. [6]

    Limitations and future directions Exploring context-driven human decision-making factors is a multidimensional problem. If automated detection studies test the strength of countermeasures [7, 16, 17, 22, 29], with human listener studies exploring themes of detection [6, 7, 18, 46], our study tackled the environment dimension by integrating vari- ables tha...

  7. [7]

    Conclusion This study challenges the ”one solution fits all” approach to synthetic speech detection by investigating (partial) voice deep- fake detection as asocio-technical process shaped by trust. Al- though participants expressed confidence in their judgments, they were generally unable to detect synthetic speech, with detection performance improving o...

  8. [8]

    349605, project ”SPEECH- FAKES”)

    Acknowledgments This work was carried out as part of the V oCS (V oice in Com- munication Sciences) doctoral network, funded by the Euro- pean Union’s Horizon Europe Framework programme under Grant Agreement No 101168998 and partially supported by the Academy of Finland (Decision No. 349605, project ”SPEECH- FAKES”). This study was submitted to and receiv...

  9. [9]

    Public intellectuals on new platforms: con- structing critical authority in a digital media culture,

    M. B. Johansen, “Public intellectuals on new platforms: con- structing critical authority in a digital media culture,” inRethink- ing cultural criticism: New voices in the digital age. Springer, 2020, pp. 17–42

  10. [10]

    Large language models are echo chambers,

    J. Nehring, A. Gabryszak, P. J ¨urgens, A. Burchardt, S. Schaffer, M. Spielkamp, and B. Stark, “Large language models are echo chambers,” inProceedings of the 2024 joint international confer- ence on computational linguistics, language resources and evalu- ation (lrec-coling 2024), 2024, pp. 10 117–10 123

  11. [11]

    Attention economy theory,

    J. Myers, “Attention economy theory,” inMedia Ecology for the 21st Century: Theories of Culture, Communications, and Con- sciousness. Springer, 2025, pp. 101–109

  12. [12]

    Synthetic media detection, the wheel, and the bur- den of proof,

    K. R. Harris, “Synthetic media detection, the wheel, and the bur- den of proof,”Philosophy & Technology, vol. 37, no. 4, p. 131, 2024

  13. [13]

    Deepfakes and trust in technology,

    O. Laas, “Deepfakes and trust in technology,”Synthese, vol. 202, no. 5, p. 132, 2023

  14. [14]

    ” better be computer or i’m dumb

    K. Warren, T. Tucker, A. Crowder, D. Olszewski, A. Lu, C. Fedele, M. Pasternak, S. Layton, K. Butler, C. Gateset al., “” better be computer or i’m dumb”: A large-scale evaluation of humans as audio deepfake detectors,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 2696–2710

  15. [15]

    Partial fake speech attacks in the real world using deepfake audio,

    A. Alali and G. Theodorakopoulos, “Partial fake speech attacks in the real world using deepfake audio,”Journal of Cybersecurity and Privacy, vol. 5, no. 1, p. 6, 2025

  16. [16]

    Can you tell it’s ai? human perception of synthetic voices in vishing scenarios,

    Z. H. Bhatti, B. Ahtisham, S. Tausif, N. George, M. Javedet al., “Can you tell it’s ai? human perception of synthetic voices in vishing scenarios,”arXiv preprint arXiv:2602.20061, 2026

  17. [17]

    When machines speak with feeling: Investi- gating emotional prosody, authenticity, and trust in ai vs. human voices,

    G. Fan and D. Liu, “When machines speak with feeling: Investi- gating emotional prosody, authenticity, and trust in ai vs. human voices,” inProceedings of the Annual Meeting of the Cognitive Science Society, vol. 47, 2025

  18. [18]

    Spoofing and countermeasures for speaker verification: A sur- vey,

    Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: A sur- vey,”speech communication, vol. 66, pp. 130–153, 2015

  19. [19]

    Why do people spread false information online? the effects of message and viewer characteristics on self-reported likelihood of sharing social media disinformation,

    T. Buchanan, “Why do people spread false information online? the effects of message and viewer characteristics on self-reported likelihood of sharing social media disinformation,”Plos one, vol. 15, no. 10, p. e0239666, 2020

  20. [20]

    Ai or your lying eyes: Some shortcomings of arti- ficially intelligent deepfake detectors,

    K. R. Harris, “Ai or your lying eyes: Some shortcomings of arti- ficially intelligent deepfake detectors,”Philosophy & Technology, vol. 37, no. 1, p. 7, 2024

  21. [21]

    How we trust, perceive, and learn from virtual humans: The influence of voice quality,

    E. K. Chiou, N. L. Schroeder, and S. D. Craig, “How we trust, perceive, and learn from virtual humans: The influence of voice quality,”Computers & Education, vol. 146, p. 103756, 2020

  22. [22]

    ” human, all too human

    K. M. Scott, S. Ashby, and J. Hanna, “” human, all too human”: Noaa weather radio and the emotional impact of synthetic voices,” inProceedings of the 2020 CHI conference on human factors in computing systems, 2020, pp. 1–9

  23. [23]

    Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,

    E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. G ¨olge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” inInternational confer- ence on machine learning. PMLR, 2022, pp. 2709–2720

  24. [24]

    Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautschet al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 31, pp. 2507–2522, 2023

  25. [25]

    emnlp-main.225/

    X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunenet al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024

  26. [26]

    Detect- ing the undetectable: Human judgments and the challenge of syn- thetic voices,

    S. Amirkhani, G. Stevens, M. Shajalal, and A. Boden, “Detect- ing the undetectable: Human judgments and the challenge of syn- thetic voices,” inProceedings of the 12th International Confer- ence on Communities & Technologies (C&T 2025). European Society for Socially Embedded Technologies (EUSSET), 2025

  27. [27]

    Human trust in artificial intelli- gence: Review of empirical research,

    E. Glikson and A. W. Woolley, “Human trust in artificial intelli- gence: Review of empirical research,”Academy of management annals, vol. 14, no. 2, pp. 627–660, 2020

  28. [28]

    A review of trust in artificial intelligence: Challenges, vulnerabilities and future directions,

    S. Lockey, N. Gillespie, D. Holm, and I. A. Someh, “A review of trust in artificial intelligence: Challenges, vulnerabilities and future directions,” 2021

  29. [29]

    Deepfakes: Deceptions, mitigations, and opportuni- ties,

    M. Mustak, J. Salminen, M. M ¨antym¨aki, A. Rahman, and Y . K. Dwivedi, “Deepfakes: Deceptions, mitigations, and opportuni- ties,”Journal of Business Research, vol. 154, p. 113368, 2023

  30. [30]

    Where are we in audio deep- fake detection? a systematic analysis over generative and detec- tion models,

    X. Li, P.-Y . Chen, and W. Wei, “Where are we in audio deep- fake detection? a systematic analysis over generative and detec- tion models,”ACM Transactions on Internet Technology, 2025

  31. [31]

    Human performance in deepfake detection: a systematic review,

    K. Somoray, D. J. Miller, and M. Holmes, “Human performance in deepfake detection: a systematic review,”Human Behavior and Emerging Technologies, vol. 2025, no. 1, p. 1833228, 2025

  32. [32]

    Ctrl-alt-del: Gamergate as a precursor to the rise of the alt-right,

    K. M. Bezio, “Ctrl-alt-del: Gamergate as a precursor to the rise of the alt-right,”Leadership, vol. 14, no. 5, pp. 556–566, 2018

  33. [33]

    Key challenges in using automatic dubbing to trans- late educational youtube videos,

    R. Ba ˜nos, “Key challenges in using automatic dubbing to trans- late educational youtube videos,”Linguistica Antverpiensia, New Series–Themes in Translation Studies, vol. 22, 2023

  34. [34]

    Grooming an ideal chatbot by training the algorithm: Exploring the exploitation of replika users’ immaterial labor,

    S. Pan, L. Fortunati, and A. Edwards, “Grooming an ideal chatbot by training the algorithm: Exploring the exploitation of replika users’ immaterial labor,”New Media & Society, vol. 27, no. 10, pp. 5489–5507, 2025

  35. [35]

    Sycophantic ai decreases prosocial intentions and promotes de- pendence,

    M. Cheng, C. Lee, P. Khadpe, S. Yu, D. Han, and D. Jurafsky, “Sycophantic ai decreases prosocial intentions and promotes de- pendence,”arXiv preprint arXiv:2510.01395, 2025

  36. [36]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act),

    “Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act),” 2024. [Online]. Available: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

  37. [37]

    Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,

    A. Khan, K. M. Malik, J. Ryan, and M. Saravanan, “Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,” Artificial Intelligence Review, vol. 56, no. Suppl 1, pp. 513–566, 2023

  38. [38]

    It’s not just “press record

    L. Krikheli, S. El-Wahsh, and R. Cave, “It’s not just “press record”: a viewpoint for providing ethical voice banking,” Evidence-Based Communication Assessment and Intervention, pp. 1–10, 2026

  39. [39]

    Can emotion fool anti-spoofing?

    A. Mahapatra, I. R. Ulgen, A. R. Naini, C. Busso, and B. Sisman, “Can emotion fool anti-spoofing?”arXiv preprint arXiv:2505.23962, 2025

  40. [40]

    Every breath you don’t take: Deep- fake speech detection using breath,

    S. Layton, T. De Andrade, D. Olszewski, K. Warren, C. Gates, K. Butler, and P. Traynor, “Every breath you don’t take: Deep- fake speech detection using breath,”Digital Threats: Research and Practice, vol. 6, no. 3, pp. 1–18, 2025

  41. [41]

    What is a digital persona?

    D. De Kerckhove and C. M. De Almeida, “What is a digital persona?”Technoetic Arts: A Journal of Speculative Research, vol. 11, no. 3, pp. 277–287, 2013

  42. [42]

    Rating naturalness in speech synthesis: The effect of style and expectation,

    R. Dall, J. Yamagishi, and S. King, “Rating naturalness in speech synthesis: The effect of style and expectation,” inSpeech Prosody 2014, 2014

  43. [43]

    The role of affect in decision making,

    G. Loewenstein, J. S. Lerneret al., “The role of affect in decision making,”Handbook of affective science, vol. 619, no. 642, p. 3, 2003

  44. [44]

    Emotions don’t lie: An audio-visual deepfake detection method using affective cues,

    T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “Emotions don’t lie: An audio-visual deepfake detection method using affective cues,” inProceedings of the 28th ACM interna- tional conference on multimedia, 2020, pp. 2823–2832

  45. [45]

    La- beling synthetic content: User perceptions of label designs for ai-generated content on social media,

    D. Gamage, D. Sewwandi, M. Zhang, and A. K. Bandara, “La- beling synthetic content: User perceptions of label designs for ai-generated content on social media,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 2025, pp. 1–29

  46. [46]

    Llama- partialspoof: An llm-driven fake speech dataset simulating dis- information generation,

    H.-T. Luong, H. Li, L. Zhang, K. A. Lee, and E. S. Chng, “Llama- partialspoof: An llm-driven fake speech dataset simulating dis- information generation,” inICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  47. [47]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text- to-speech,”arXiv preprint arXiv:1904.02882, 2019

  48. [48]

    Understanding fatigue and its im- pact in crowdsourcing,

    Y . Zhang, X. Ding, and N. Gu, “Understanding fatigue and its im- pact in crowdsourcing,” in2018 IEEE 22nd International Con- ference on Computer Supported Cooperative Work in Design ((CSCWD)). IEEE, 2018, pp. 57–62

  49. [49]

    The international soundscape database: An integrated multimedia database of urban soundscape surveys–questionnaires with acoustical and contextual informa- tion,

    A. Mitchell, T. Oberman, F. Aletta, M. Erfanian, M. Kach- licka, M. Lionello, and J. Kang, “The international soundscape database: An integrated multimedia database of urban soundscape surveys–questionnaires with acoustical and contextual informa- tion,”Online. URL: https://www. zenodo. org/record/5914715#. YnwwoGDP00Q, 2021

  50. [50]

    Pre- senting and processing information in background noise: A com- bined speaker–listener perspective,

    A. Bockstael, L. Samyn, P. Corthals, and D. Botteldooren, “Pre- senting and processing information in background noise: A com- bined speaker–listener perspective,”The Journal of the Acoustical Society of America, vol. 143, no. 1, pp. 210–218, 2018

  51. [51]

    Introducing the open af- fective standardized image set (oasis),

    B. Kurdi, S. Lozano, and M. R. Banaji, “Introducing the open af- fective standardized image set (oasis),”Behavior research meth- ods, vol. 49, no. 2, pp. 457–470, 2017

  52. [52]

    Decisional carry- over effects in interval timing: Evidence of a generalized response bias,

    J. J. Wehrman, J. Wearden, and P. Sowman, “Decisional carry- over effects in interval timing: Evidence of a generalized response bias,”Attention, Perception, & Psychophysics, vol. 82, no. 4, pp. 2147–2164, 2020

  53. [53]

    Good Practices for Evaluation of Synthesized Speech,

    E. Cooper, S. L. Maguer, E. Klabbers, and J. Yamagishi, “Good practices for evaluation of synthesized speech,”arXiv preprint arXiv:2503.03250, 2025

  54. [54]

    Human perception of audio deepfakes: the role of language and speaking style,

    E. San Segundo, A. L ´opez-Jare˜no, X. Wang, and J. Yamagishi, “Human perception of audio deepfakes: the role of language and speaking style,”Available at SSRN 5954496, 2025

  55. [55]

    The intelligibility benefits of modern computer-synthesized speech for normal-hearing and hearing- impaired listeners in non-ideal listening conditions,

    Y . Ma and Y . Tang, “The intelligibility benefits of modern computer-synthesized speech for normal-hearing and hearing- impaired listeners in non-ideal listening conditions,”Journal of Otorhinolaryngology, Hearing and Balance Medicine, vol. 5, no. 1, p. 5, 2024

  56. [56]

    Automatic speaker verification on compressed au- dio,

    O. Sokol, H. Naumenko, V . Derkach, V . Kuznetsov, D. Progonov, and V . Husiev, “Automatic speaker verification on compressed au- dio,” in2022 12th International Conference on Dependable Sys- tems, Services and Technologies (DESSERT). IEEE, 2022, pp. 1–7

  57. [57]

    Trust in artifi- cial voices: A

    I. Torre, J. Goslin, L. White, and D. Zanatto, “Trust in artifi- cial voices: A ”congruency effect” of first impressions and be- havioural experience,” 04 2018

  58. [58]

    ”human, all too human

    K. Scott, S. Ashby, and J. Hanna, “”human, all too human”: Noaa weather radio and the emotional impact of synthetic voices,” 04 2020, pp. 1–9

  59. [59]

    Unmasking illusions: Understanding human perception of au- diovisual deepfakes,

    A. Hashmi, S. A. Shahzad, C.-W. Lin, Y . Tsao, and H.-M. Wang, “Unmasking illusions: Understanding human perception of au- diovisual deepfakes,”arXiv preprint arXiv:2405.04097, 2024

  60. [60]

    The perils of automatic- ity,

    J. Toner, B. G. Montero, and A. Moran, “The perils of automatic- ity,”Review of General Psychology, vol. 19, no. 4, pp. 431–442, 2015

  61. [61]

    The effects of task difficulty and multitasking on performance,

    R. F. Adler and R. Benbunan-Fich, “The effects of task difficulty and multitasking on performance,”Interacting with Computers, vol. 27, no. 4, pp. 430–439, 2015

  62. [62]

    Quality control in crowd- sourcing systems: Issues and directions,

    M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari- Nezhad, E. Bertino, and S. Dustdar, “Quality control in crowd- sourcing systems: Issues and directions,”IEEE Internet Comput- ing, vol. 17, no. 2, pp. 76–81, 2013

  63. [63]

    Something ai should tell you–the case for labelling synthetic content,

    S. A. Fisher, “Something ai should tell you–the case for labelling synthetic content,”Journal of Applied Philosophy, vol. 42, no. 1, pp. 272–286, 2025

  64. [64]

    Statsmodels: econometric and sta- tistical modeling with python

    S. Seabold, J. Perktoldet al., “Statsmodels: econometric and sta- tistical modeling with python.”scipy, vol. 7, no. 1, pp. 92–96, 2010

  65. [65]

    Trust in artifi- cial voices: A

    I. Torre, J. Goslin, L. White, and D. Zanatto, “Trust in artifi- cial voices: A” congruency effect” of first impressions and be- havioural experience,” inProceedings of the technology, mind, and society, 2018, pp. 1–6

  66. [66]

    How do voice acoustics affect the perceived trustworthiness of a speaker? a systematic review,

    C. Maltezou-Papastylianou, R. Scherer, and S. Paulmann, “How do voice acoustics affect the perceived trustworthiness of a speaker? a systematic review,”Frontiers in Psychology, vol. 16, p. 1495456, 2025