As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli
Pith reviewed 2026-05-24 03:32 UTC · model grok-4.3
The pith
People detect AI-generated images, audio, and video at rates close to random chance of 50 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study finds that participants' mean accuracy in distinguishing authentic from synthetic media hovers at chance level near 50 percent, with lower performance when any synthetic content is present, when media is single-modality, when images show human faces, when audiovisual items mix real and synthetic parts, and when foreign languages appear; accuracy also declines with participant age but shows no significant relation to self-reported knowledge of synthetic media.
What carries the argument
The large-scale perceptual detection task that measures participant accuracy rates against a 50 percent chance baseline across four media types.
If this is right
- Relying on human perception alone leaves individuals exposed to weaponized synthetic media.
- Single-modality media and items containing human faces require stronger external safeguards.
- Prior education about AI generation does not improve detection rates.
- Age-related differences suggest older adults face elevated risk from synthetic content.
Where Pith is reading between the lines
- Detection training programs are unlikely to raise performance enough to restore human judgment as a reliable filter.
- Technical provenance systems or watermarking may become necessary because perceptual checks will not scale.
- Real-world accuracy could fall further as generative quality improves beyond the stimuli used here.
Load-bearing premise
The synthetic examples shown to participants match the quality and variety of AI media that ordinary people encounter outside the lab.
What would settle it
A replication using the latest generative models that produces average detection accuracy well above 60 percent would falsify the central claim.
Figures
read the original abstract
One of the current principal defenses against weaponized synthetic media continues to be the ability of the targeted individual to visually or auditorily recognize AI-generated content when they encounter it. However, as the realism of synthetic media continues to rapidly improve, it is vital to have an accurate understanding of just how susceptible people currently are to potentially being misled by convincing but false AI generated content. We conducted a perceptual study with 1276 participants to assess how capable people were at distinguishing between authentic and synthetic images, audio, video, and audiovisual media. We find that on average, people struggled to distinguish between synthetic and authentic media, with the mean detection performance close to a chance level performance of 50%. We also find that accuracy rates worsen when the stimuli contain any degree of synthetic content, features foreign languages, and the media type is a single modality. People are also less accurate at identifying synthetic images when they feature human faces, and when audiovisual stimuli have heterogeneous authenticity. Finally, we find that higher degrees of prior knowledgeability about synthetic media does not significantly impact detection accuracy rates, but age does, with older individuals performing worse than their younger counterparts. Collectively, these results highlight that it is no longer feasible to rely on the perceptual capabilities of people to protect themselves against the growing threat of weaponized synthetic media, and that the need for alternative countermeasures is more critical than ever before.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a perceptual study with 1276 participants evaluating human ability to detect AI-generated vs. authentic images, videos, audio, and audiovisual stimuli. It claims mean detection accuracy is near chance (50%), with lower accuracy for synthetic content, single-modality media, foreign languages, human faces in images, and heterogeneous audiovisual authenticity; prior knowledge has no effect but age does (older participants worse). The conclusion is that human perceptual detection is no longer a viable defense against synthetic media.
Significance. If the synthetic stimuli are representative of current generative models, the near-chance result would provide direct empirical support for shifting from human vigilance to technical countermeasures in HCI, security, and misinformation research. The large sample and multi-modality design strengthen generalizability within the tested conditions.
major comments (3)
- [Methods] Methods section: no stimulus-generation protocol, model names/versions, training data, resolution matching, post-processing steps, or selection/exclusion criteria for the synthetic set are supplied. This directly undermines evaluation of the central claim that detection performance near 50% generalizes to media an ordinary person might encounter.
- [Results] Results and Abstract: headline percentages and statistical claims are presented without confidence intervals, exact p-values, effect sizes, or full demographic breakdown, preventing assessment of whether the 'close to chance' finding is robust or driven by specific subgroups.
- [Discussion] Discussion: the claim that 'it is no longer feasible to rely on the perceptual capabilities of people' rests on the untested assumption that the tested stimuli match the current state of generative models; without that, the policy implication does not follow from the data.
minor comments (2)
- [Abstract] Abstract and Results: 'mean detection performance close to a chance level' should be accompanied by the exact mean and standard deviation or CI in the abstract itself.
- [Figures/Tables] Table/Figure captions: ensure all stimuli characteristics (e.g., resolution, duration) are reported so readers can judge representativeness.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript's transparency and precision without altering its core findings.
read point-by-point responses
-
Referee: [Methods] Methods section: no stimulus-generation protocol, model names/versions, training data, resolution matching, post-processing steps, or selection/exclusion criteria for the synthetic set are supplied. This directly undermines evaluation of the central claim that detection performance near 50% generalizes to media an ordinary person might encounter.
Authors: We agree that comprehensive details on stimulus generation are necessary to evaluate generalizability. The original submission referenced the models in supplementary materials but did not fully integrate them into the main Methods section. In revision we will add a dedicated subsection detailing the exact generative models and versions (e.g., Stable Diffusion variants for images, specific audio and video models), training data sources, resolution and format matching procedures, post-processing steps, and explicit inclusion/exclusion criteria for the synthetic stimuli. This will directly address the concern and allow readers to assess representativeness. revision: yes
-
Referee: [Results] Results and Abstract: headline percentages and statistical claims are presented without confidence intervals, exact p-values, effect sizes, or full demographic breakdown, preventing assessment of whether the 'close to chance' finding is robust or driven by specific subgroups.
Authors: We accept this criticism. The revised manuscript will report 95% confidence intervals around all accuracy percentages, exact p-values for all statistical comparisons, effect sizes (Cohen's d or equivalent), and a complete demographic table (age, gender, education, prior exposure) for the full sample of 1276 participants. These additions will be placed in both the Results section and, where appropriate, the Abstract to enable proper evaluation of robustness. revision: yes
-
Referee: [Discussion] Discussion: the claim that 'it is no longer feasible to rely on the perceptual capabilities of people' rests on the untested assumption that the tested stimuli match the current state of generative models; without that, the policy implication does not follow from the data.
Authors: We will revise the Discussion to qualify the policy claim more precisely. The revised text will explicitly state that the stimuli were produced with models that represented the state of the art at the time of data collection (late 2023–early 2024), note the specific models used, and acknowledge that subsequent generations may be even harder to detect. We will add a dedicated limitations paragraph discussing temporal specificity and the trajectory of generative improvement, thereby grounding the conclusion in the tested conditions while still highlighting the practical implications for countermeasures. revision: partial
Circularity Check
Pure empirical measurement study with no derivation chain
full rationale
The paper reports results from a human perceptual experiment (N=1276) measuring detection accuracy for synthetic vs authentic media. No equations, parameters, or derivations are present. The central claim (mean accuracy near 50%) is a direct statistical summary of participant responses, not a prediction derived from fitted inputs or self-citations. No load-bearing self-citation chains, uniqueness theorems, or ansatzes appear. The study is self-contained as an empirical measurement against external benchmarks (participant performance), warranting score 0 per the rules for papers without mathematical reductions.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Explainable Detection of Machine Generated Music and Early Systematic Evaluation
The authors provide the first systematic benchmark of traditional ML, DNN, Transformer, state-space, and multimodal models for machine-generated music detection, augmented with XAI analysis, and report ResNet18 as the...
Reference graph
Works this paper leans on
-
[1]
Cartella, G., Cuculo, V., Cornia, M. and Cucchiara, R. 2024. Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images. arXiv
work page 2024
-
[2]
Doss, C., Monschein, J., Shu, D., Wolfson, T., Kopecky, D., Fitton-Kane, V.A., Bush, L. and Tucker, C. 2022. Deepfakes and Scientific Knowledge Dissemination. In Review
work page 2022
-
[3]
Dunaway, J. and Soroka, S. 2021. Smartphone-size screens constrain cognitive access to video news stories. Information, Communication & Society. 24, 1 (Jan. 2021), 69–84. DOI:https://doi.org/10.1080/1369118X.2019.1631367
-
[4]
Gopinath, B., Liew, G., Burlutsky, G., McMahon, C.M. and Mitchell, P. 2017. Visual and hearing impairment and retirement in older adults: A population-based cohort study. Maturitas. 100, (Jun. 2017), 77–81. DOI:https://doi.org/10.1016/j.maturitas.2017.03.318
-
[5]
Groh, M., Epstein, Z., Firestone, C. and Picard, R. 2022. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences. 119, 1 (Jan. 2022), e2110013119. DOI:https://doi.org/10.1073/pnas.2110013119
-
[6]
Groh, M., Sankaranarayanan, A., Singh, N., Kim, D.Y., Lippman, A. and Picard, R. 2023. Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video. arXiv
work page 2023
-
[7]
Hua, Y., Niu, S., Cai, J., Chilton, L.B., Heuer, H. and Wohn, D.Y. 2024. Generative AI in User-Generated Content. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (New York, NY, USA, May 2024), 1– 7
work page 2024
-
[8]
Josephs, E., Fosco, C. and Oliva, A. 2023. Artifact magnification on deepfake videos increases human detection and subjective confidence. arXiv
work page 2023
-
[9]
Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J. and Aila, T. 2021. Alias-Free Generative Adversarial Networks. arXiv
work page 2021
-
[10]
Karras, T., Laine, S. and Aila, T. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv
work page 2019
-
[11]
Keys, R.T., Taubert, J. and Wardle, S.G. 2021. A visual search advantage for illusory faces in objects. Attention, Perception, & Psychophysics. 83, 5 (2021), 1942–1953. DOI:https://doi.org/10.3758/s13414-021-02267-4
-
[12]
Khanjani, Z., Watson, G. and Janeja, V.P. 2023. Audio deepfakes: A survey. Frontiers in Big Data. 5, (2023)
work page 2023
-
[13]
Köbis, N.C., Doležalová, B. and Soraperra, I. 2021. Fooled twice: People cannot detect deepfakes but think they can. iScience. 24, 11 (Nov. 2021), 103364. DOI:https://doi.org/10.1016/j.isci.2021.103364
-
[14]
Mai, K.T., Bray, S.D., Davies, T. and Griffin, L.D. 2023. Warning: Humans Cannot Reliably Detect Speech Deepfakes. PLOS ONE. 18, 8 (Aug. 2023), e0285333. DOI:https://doi.org/10.1371/journal.pone.0285333
-
[15]
Mirsky, Y. and Lee, W. 2022. The Creation and Detection of Deepfakes: A Survey. ACM Computing Surveys. 54, 1 (Jan. 2022), 1–41. DOI:https://doi.org/10.1145/3425780
-
[16]
Müller, N.M., Pizzi, K. and Williams, J. 2022. Human Perception of Audio Deepfakes. Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (Lisboa Portugal, Oct. 2022), 85–91. 18
work page 2022
-
[17]
Munaro, A.C., Hübner Barcelos, R., Francisco Maffezzolli, E.C., Santos Rodrigues, J.P. and Cabrera Paraiso, E. 2021. To engage or not engage? The features of video content on YouTube affecting digital consumer engagement. Journal of Consumer Behaviour. 20, 5 (2021), 1336–1352. DOI:https://doi.org/10.1002/cb.1939
-
[18]
Navarra, J. and Soto-Faraco, S. 2007. Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological Research. 71, 1 (Jan. 2007), 4–12. DOI:https://doi.org/10.1007/s00426-005-0031-5
-
[19]
Nazarieh, F., Feng, Z., Awais, M., Wang, W. and Kittler, J. 2024. A Survey of Cross-Modal Visual Content Generation. IEEE Transactions on Circuits and Systems for Video Technology. (2024), 1–1. DOI:https://doi.org/10.1109/TCSVT.2024.3351601
-
[20]
Nightingale, S.J. and Farid, H. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences. 119, 8 (Feb. 2022), e2120481119. DOI:https://doi.org/10.1073/pnas.2120481119
-
[21]
Online News: Research Update: 2024. https://www.ofcom.org.uk/siteassets/resources/documents/research-and- data/multi-sector/media-plurality/2024/0324-online-news-research-update.pdf?v=356802
work page 2024
-
[22]
Pepper, J.L. and Nuttall, H.E. 2023. Age-Related Changes to Multisensory Integration and Audiovisual Speech Perception. Brain Sciences. 13, 8 (Jul. 2023), 1126. DOI:https://doi.org/10.3390/brainsci13081126
-
[23]
Prasad, S.S., Hadar, O., Vu, T. and Polian, I. 2022. Human vs. Automatic Detection of Deepfake Videos Over Noisy Channels. 2022 IEEE International Conference on Multimedia and Expo (ICME). (Jul. 2022), 1–6. DOI:https://doi.org/10.1109/ICME52920.2022.9859954
-
[24]
Rosenblum, L. 2019. Oxford Research Encyclopedia, Linguistics. Audiovisual speech perception and the McGurk effect. ) Oxford University Press USA
work page 2019
-
[25]
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J. and Niessner, M. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (Seoul, Korea (South), Oct. 2019), 1–11
work page 2019
-
[26]
Sayler, K. and Harris, L. 2023. Deepfakes and National Security. Technical Report #IF11333. Congresssional Research Service
work page 2023
-
[27]
Sekiyama, K. 1997. Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects. Perception & Psychophysics. 59, 1 (Jan. 1997), 73–80. DOI:https://doi.org/10.3758/BF03206849
-
[28]
https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing
Statistics About Hearing, Balance, & Dizziness: 2024. https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing
work page 2024
-
[29]
Swart, J. 2023. Tactics of news literacy: How young people access, evaluate, and engage with news on social media. New Media & Society. 25, 3 (Mar. 2023), 505–521. DOI:https://doi.org/10.1177/14614448211011447
-
[30]
Taubert, J., Apthorp, D., Aagten-Murphy, D. and Alais, D. 2011. The role of holistic processing in face perception: Evidence from the face inversion effect. Vision Research. 51, 11 (Jun. 2011), 1273–1278. DOI:https://doi.org/10.1016/j.visres.2011.04.002
-
[31]
https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/
Top video content type by global reach Q2 2023: 2023. https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/. Accessed: 2023-11-30. 19
-
[32]
Tucciarelli, R., Vehar, N., Chandaria, S. and Tsakiris, M. 2022. On the realness of people who do not exist: The social processing of artificial faces. iScience. 25, 12 (Dec. 2022), 105441. DOI:https://doi.org/10.1016/j.isci.2022.105441
-
[33]
https://w3techs.com/technologies/overview/content_language
Usage Statistics and Market Share of Content Languages for Websites, November 2023: 2023. https://w3techs.com/technologies/overview/content_language. Accessed: 2023-11-30
work page 2023
-
[34]
Vraga, E., Bode, L. and Troller-Renfree, S. 2016. Beyond Self-Reports: Using Eye Tracking to Measure Topic and Style Differences in Attention to Social Media Content. Communication Methods and Measures. 10, 2–3 (Apr. 2016), 149–164. DOI:https://doi.org/10.1080/19312458.2016.1150443
-
[35]
Walker, M. 2019. Americans favor mobile devices over desktops and laptops for getting news. Pew Research Center
work page 2019
-
[36]
Wang, Y., Behne, D.M. and Jiang, H. 2009. Influence of native language phonetic system on audio-visual speech perception. Journal of Phonetics. 37, 3 (Jul. 2009), 344–356. DOI:https://doi.org/10.1016/j.wocn.2009.04.002
-
[37]
Woods, C., Luo, Z., Watling, D. and Durant, S. 2022. Twenty seconds of visual behaviour on social media gives insight into personality. Scientific Reports. 12, 1 (Jan. 2022), 1178. DOI:https://doi.org/10.1038/s41598-022-05095-0
- [38]
-
[39]
DOI:https://doi.org/10.1049/bme2.12031
- [40]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.