As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli

Abigail Edwards; Di Cooke; Kathryn Kelly; Sophia Barkoff

arxiv: 2403.16760 · v5 · submitted 2024-03-25 · 💻 cs.HC · cs.AI· cs.SD· eess.AS

As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli

Di Cooke , Abigail Edwards , Sophia Barkoff , Kathryn Kelly This is my paper

Pith reviewed 2026-05-24 03:32 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.SDeess.AS

keywords AI-generated mediahuman detectionsynthetic mediaperceptual studydeepfakesmultimodal detectionchance-level accuracy

0 comments

The pith

People detect AI-generated images, audio, and video at rates close to random chance of 50 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a study of 1276 participants who attempted to identify real versus AI-made images, videos, audio clips, and audiovisual combinations. Detection accuracy averaged near 50 percent across conditions, with performance dropping further when stimuli included synthetic elements, foreign languages, single modalities, human faces, or mismatched authenticity in combined media. Age reduced accuracy but prior knowledge about synthetic media did not. The work tests the assumption that individuals can serve as their own first line of defense by recognizing fakes on sight.

Core claim

The study finds that participants' mean accuracy in distinguishing authentic from synthetic media hovers at chance level near 50 percent, with lower performance when any synthetic content is present, when media is single-modality, when images show human faces, when audiovisual items mix real and synthetic parts, and when foreign languages appear; accuracy also declines with participant age but shows no significant relation to self-reported knowledge of synthetic media.

What carries the argument

The large-scale perceptual detection task that measures participant accuracy rates against a 50 percent chance baseline across four media types.

If this is right

Relying on human perception alone leaves individuals exposed to weaponized synthetic media.
Single-modality media and items containing human faces require stronger external safeguards.
Prior education about AI generation does not improve detection rates.
Age-related differences suggest older adults face elevated risk from synthetic content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection training programs are unlikely to raise performance enough to restore human judgment as a reliable filter.
Technical provenance systems or watermarking may become necessary because perceptual checks will not scale.
Real-world accuracy could fall further as generative quality improves beyond the stimuli used here.

Load-bearing premise

The synthetic examples shown to participants match the quality and variety of AI media that ordinary people encounter outside the lab.

What would settle it

A replication using the latest generative models that produces average detection accuracy well above 60 percent would falsify the central claim.

Figures

Figures reproduced from arXiv: 2403.16760 by Abigail Edwards, Di Cooke, Kathryn Kelly, Sophia Barkoff.

**Figure 3.** Figure 3: Mean detection accuracy by language familiarity between video-only clips and their audiovisual counterparts, with error bars representing 95% CI. As seen in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

One of the current principal defenses against weaponized synthetic media continues to be the ability of the targeted individual to visually or auditorily recognize AI-generated content when they encounter it. However, as the realism of synthetic media continues to rapidly improve, it is vital to have an accurate understanding of just how susceptible people currently are to potentially being misled by convincing but false AI generated content. We conducted a perceptual study with 1276 participants to assess how capable people were at distinguishing between authentic and synthetic images, audio, video, and audiovisual media. We find that on average, people struggled to distinguish between synthetic and authentic media, with the mean detection performance close to a chance level performance of 50%. We also find that accuracy rates worsen when the stimuli contain any degree of synthetic content, features foreign languages, and the media type is a single modality. People are also less accurate at identifying synthetic images when they feature human faces, and when audiovisual stimuli have heterogeneous authenticity. Finally, we find that higher degrees of prior knowledgeability about synthetic media does not significantly impact detection accuracy rates, but age does, with older individuals performing worse than their younger counterparts. Collectively, these results highlight that it is no longer feasible to rely on the perceptual capabilities of people to protect themselves against the growing threat of weaponized synthetic media, and that the need for alternative countermeasures is more critical than ever before.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large multi-modal study finds near-chance detection of AI media, but the result hinges on whether the stimuli match current real-world generators.

read the letter

The main thing to know is that this paper ran a 1276-participant study and reports detection accuracy near 50% for synthetic versus real images, video, audio, and audiovisual clips, with lower accuracy on faces, foreign languages, and mixed-authenticity clips. Age mattered but self-reported knowledge did not. That is the core empirical contribution. It is new in the sense that it tests four modalities together with several moderators in one sample, which earlier single-modality papers did not do at this scale. The data collection itself looks like standard perceptual-experiment practice and supplies a useful set of measured accuracies. The soft spot is exactly the one flagged in the stress-test note. The headline claim that human perception is no longer a viable defense rests on the synthetic stimuli being representative of what an ordinary person would actually encounter. The abstract gives no generator details, no post-processing description, and no selection criteria, so it is impossible to judge whether the near-chance result would hold for today's best models or for unfiltered outputs. Without those methods the strong policy conclusion does not yet follow from the numbers. This paper is for HCI and AI-safety readers who need quantitative baselines on detection performance. A serious referee should see it because the sample size and multi-modal scope make the raw data worth checking, even if the interpretation will need tightening on stimulus construction and statistical reporting.

Referee Report

3 major / 2 minor

Summary. The paper reports a perceptual study with 1276 participants evaluating human ability to detect AI-generated vs. authentic images, videos, audio, and audiovisual stimuli. It claims mean detection accuracy is near chance (50%), with lower accuracy for synthetic content, single-modality media, foreign languages, human faces in images, and heterogeneous audiovisual authenticity; prior knowledge has no effect but age does (older participants worse). The conclusion is that human perceptual detection is no longer a viable defense against synthetic media.

Significance. If the synthetic stimuli are representative of current generative models, the near-chance result would provide direct empirical support for shifting from human vigilance to technical countermeasures in HCI, security, and misinformation research. The large sample and multi-modality design strengthen generalizability within the tested conditions.

major comments (3)

[Methods] Methods section: no stimulus-generation protocol, model names/versions, training data, resolution matching, post-processing steps, or selection/exclusion criteria for the synthetic set are supplied. This directly undermines evaluation of the central claim that detection performance near 50% generalizes to media an ordinary person might encounter.
[Results] Results and Abstract: headline percentages and statistical claims are presented without confidence intervals, exact p-values, effect sizes, or full demographic breakdown, preventing assessment of whether the 'close to chance' finding is robust or driven by specific subgroups.
[Discussion] Discussion: the claim that 'it is no longer feasible to rely on the perceptual capabilities of people' rests on the untested assumption that the tested stimuli match the current state of generative models; without that, the policy implication does not follow from the data.

minor comments (2)

[Abstract] Abstract and Results: 'mean detection performance close to a chance level' should be accompanied by the exact mean and standard deviation or CI in the abstract itself.
[Figures/Tables] Table/Figure captions: ensure all stimuli characteristics (e.g., resolution, duration) are reported so readers can judge representativeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript's transparency and precision without altering its core findings.

read point-by-point responses

Referee: [Methods] Methods section: no stimulus-generation protocol, model names/versions, training data, resolution matching, post-processing steps, or selection/exclusion criteria for the synthetic set are supplied. This directly undermines evaluation of the central claim that detection performance near 50% generalizes to media an ordinary person might encounter.

Authors: We agree that comprehensive details on stimulus generation are necessary to evaluate generalizability. The original submission referenced the models in supplementary materials but did not fully integrate them into the main Methods section. In revision we will add a dedicated subsection detailing the exact generative models and versions (e.g., Stable Diffusion variants for images, specific audio and video models), training data sources, resolution and format matching procedures, post-processing steps, and explicit inclusion/exclusion criteria for the synthetic stimuli. This will directly address the concern and allow readers to assess representativeness. revision: yes
Referee: [Results] Results and Abstract: headline percentages and statistical claims are presented without confidence intervals, exact p-values, effect sizes, or full demographic breakdown, preventing assessment of whether the 'close to chance' finding is robust or driven by specific subgroups.

Authors: We accept this criticism. The revised manuscript will report 95% confidence intervals around all accuracy percentages, exact p-values for all statistical comparisons, effect sizes (Cohen's d or equivalent), and a complete demographic table (age, gender, education, prior exposure) for the full sample of 1276 participants. These additions will be placed in both the Results section and, where appropriate, the Abstract to enable proper evaluation of robustness. revision: yes
Referee: [Discussion] Discussion: the claim that 'it is no longer feasible to rely on the perceptual capabilities of people' rests on the untested assumption that the tested stimuli match the current state of generative models; without that, the policy implication does not follow from the data.

Authors: We will revise the Discussion to qualify the policy claim more precisely. The revised text will explicitly state that the stimuli were produced with models that represented the state of the art at the time of data collection (late 2023–early 2024), note the specific models used, and acknowledge that subsequent generations may be even harder to detect. We will add a dedicated limitations paragraph discussing temporal specificity and the trajectory of generative improvement, thereby grounding the conclusion in the tested conditions while still highlighting the practical implications for countermeasures. revision: partial

Circularity Check

0 steps flagged

Pure empirical measurement study with no derivation chain

full rationale

The paper reports results from a human perceptual experiment (N=1276) measuring detection accuracy for synthetic vs authentic media. No equations, parameters, or derivations are present. The central claim (mean accuracy near 50%) is a direct statistical summary of participant responses, not a prediction derived from fitted inputs or self-citations. No load-bearing self-citation chains, uniqueness theorems, or ansatzes appear. The study is self-contained as an empirical measurement against external benchmarks (participant performance), warranting score 0 per the rules for papers without mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical perceptual study; no mathematical free parameters, axioms, or invented entities underpin the central performance claim.

pith-pipeline@v0.9.0 · 5794 in / 967 out tokens · 32338 ms · 2026-05-24T03:32:32.149678+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explainable Detection of Machine Generated Music and Early Systematic Evaluation
cs.SD 2024-12 unverdicted novelty 5.0

The authors provide the first systematic benchmark of traditional ML, DNN, Transformer, state-space, and multimodal models for machine-generated music detection, augmented with XAI analysis, and report ResNet18 as the...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper

[1]

and Cucchiara, R

Cartella, G., Cuculo, V., Cornia, M. and Cucchiara, R. 2024. Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images. arXiv

work page 2024
[2]

and Tucker, C

Doss, C., Monschein, J., Shu, D., Wolfson, T., Kopecky, D., Fitton-Kane, V.A., Bush, L. and Tucker, C. 2022. Deepfakes and Scientific Knowledge Dissemination. In Review

work page 2022
[3]

and Soroka, S

Dunaway, J. and Soroka, S. 2021. Smartphone-size screens constrain cognitive access to video news stories. Information, Communication & Society. 24, 1 (Jan. 2021), 69–84. DOI:https://doi.org/10.1080/1369118X.2019.1631367

work page doi:10.1080/1369118x.2019.1631367 2021
[4]

and Mitchell, P

Gopinath, B., Liew, G., Burlutsky, G., McMahon, C.M. and Mitchell, P. 2017. Visual and hearing impairment and retirement in older adults: A population-based cohort study. Maturitas. 100, (Jun. 2017), 77–81. DOI:https://doi.org/10.1016/j.maturitas.2017.03.318

work page doi:10.1016/j.maturitas.2017.03.318 2017
[5]

and Picard, R

Groh, M., Epstein, Z., Firestone, C. and Picard, R. 2022. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences. 119, 1 (Jan. 2022), e2110013119. DOI:https://doi.org/10.1073/pnas.2110013119

work page doi:10.1073/pnas.2110013119 2022
[6]

and Picard, R

Groh, M., Sankaranarayanan, A., Singh, N., Kim, D.Y., Lippman, A. and Picard, R. 2023. Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video. arXiv

work page 2023
[7]

and Wohn, D.Y

Hua, Y., Niu, S., Cai, J., Chilton, L.B., Heuer, H. and Wohn, D.Y. 2024. Generative AI in User-Generated Content. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (New York, NY, USA, May 2024), 1– 7

work page 2024
[8]

and Oliva, A

Josephs, E., Fosco, C. and Oliva, A. 2023. Artifact magnification on deepfake videos increases human detection and subjective confidence. arXiv

work page 2023
[9]

and Aila, T

Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J. and Aila, T. 2021. Alias-Free Generative Adversarial Networks. arXiv

work page 2021
[10]

and Aila, T

Karras, T., Laine, S. and Aila, T. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv

work page 2019
[11]

and Wardle, S.G

Keys, R.T., Taubert, J. and Wardle, S.G. 2021. A visual search advantage for illusory faces in objects. Attention, Perception, & Psychophysics. 83, 5 (2021), 1942–1953. DOI:https://doi.org/10.3758/s13414-021-02267-4

work page doi:10.3758/s13414-021-02267-4 2021
[12]

and Janeja, V.P

Khanjani, Z., Watson, G. and Janeja, V.P. 2023. Audio deepfakes: A survey. Frontiers in Big Data. 5, (2023)

work page 2023
[13]

and Soraperra, I

Köbis, N.C., Doležalová, B. and Soraperra, I. 2021. Fooled twice: People cannot detect deepfakes but think they can. iScience. 24, 11 (Nov. 2021), 103364. DOI:https://doi.org/10.1016/j.isci.2021.103364

work page doi:10.1016/j.isci.2021.103364 2021
[14]

and Griffin, L.D

Mai, K.T., Bray, S.D., Davies, T. and Griffin, L.D. 2023. Warning: Humans Cannot Reliably Detect Speech Deepfakes. PLOS ONE. 18, 8 (Aug. 2023), e0285333. DOI:https://doi.org/10.1371/journal.pone.0285333

work page doi:10.1371/journal.pone.0285333 2023
[15]

and Lee, W

Mirsky, Y. and Lee, W. 2022. The Creation and Detection of Deepfakes: A Survey. ACM Computing Surveys. 54, 1 (Jan. 2022), 1–41. DOI:https://doi.org/10.1145/3425780

work page doi:10.1145/3425780 2022
[16]

and Williams, J

Müller, N.M., Pizzi, K. and Williams, J. 2022. Human Perception of Audio Deepfakes. Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (Lisboa Portugal, Oct. 2022), 85–91. 18

work page 2022
[17]

and Cabrera Paraiso, E

Munaro, A.C., Hübner Barcelos, R., Francisco Maffezzolli, E.C., Santos Rodrigues, J.P. and Cabrera Paraiso, E. 2021. To engage or not engage? The features of video content on YouTube affecting digital consumer engagement. Journal of Consumer Behaviour. 20, 5 (2021), 1336–1352. DOI:https://doi.org/10.1002/cb.1939

work page doi:10.1002/cb.1939 2021
[18]

and Soto-Faraco, S

Navarra, J. and Soto-Faraco, S. 2007. Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological Research. 71, 1 (Jan. 2007), 4–12. DOI:https://doi.org/10.1007/s00426-005-0031-5

work page doi:10.1007/s00426-005-0031-5 2007
[19]

and Kittler, J

Nazarieh, F., Feng, Z., Awais, M., Wang, W. and Kittler, J. 2024. A Survey of Cross-Modal Visual Content Generation. IEEE Transactions on Circuits and Systems for Video Technology. (2024), 1–1. DOI:https://doi.org/10.1109/TCSVT.2024.3351601

work page doi:10.1109/tcsvt.2024.3351601 2024
[20]

and Farid, H

Nightingale, S.J. and Farid, H. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences. 119, 8 (Feb. 2022), e2120481119. DOI:https://doi.org/10.1073/pnas.2120481119

work page doi:10.1073/pnas.2120481119 2022
[21]

https://www.ofcom.org.uk/siteassets/resources/documents/research-and- data/multi-sector/media-plurality/2024/0324-online-news-research-update.pdf?v=356802

Online News: Research Update: 2024. https://www.ofcom.org.uk/siteassets/resources/documents/research-and- data/multi-sector/media-plurality/2024/0324-online-news-research-update.pdf?v=356802

work page 2024
[22]

and Nuttall, H.E

Pepper, J.L. and Nuttall, H.E. 2023. Age-Related Changes to Multisensory Integration and Audiovisual Speech Perception. Brain Sciences. 13, 8 (Jul. 2023), 1126. DOI:https://doi.org/10.3390/brainsci13081126

work page doi:10.3390/brainsci13081126 2023
[23]

and Polian, I

Prasad, S.S., Hadar, O., Vu, T. and Polian, I. 2022. Human vs. Automatic Detection of Deepfake Videos Over Noisy Channels. 2022 IEEE International Conference on Multimedia and Expo (ICME). (Jul. 2022), 1–6. DOI:https://doi.org/10.1109/ICME52920.2022.9859954

work page doi:10.1109/icme52920.2022.9859954 2022
[24]

Rosenblum, L. 2019. Oxford Research Encyclopedia, Linguistics. Audiovisual speech perception and the McGurk effect. ) Oxford University Press USA

work page 2019
[25]

and Niessner, M

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J. and Niessner, M. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (Seoul, Korea (South), Oct. 2019), 1–11

work page 2019
[26]

and Harris, L

Sayler, K. and Harris, L. 2023. Deepfakes and National Security. Technical Report #IF11333. Congresssional Research Service

work page 2023
[27]

Sekiyama, K. 1997. Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects. Perception & Psychophysics. 59, 1 (Jan. 1997), 73–80. DOI:https://doi.org/10.3758/BF03206849

work page doi:10.3758/bf03206849 1997
[28]

https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing

Statistics About Hearing, Balance, & Dizziness: 2024. https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing

work page 2024
[29]

Swart, J. 2023. Tactics of news literacy: How young people access, evaluate, and engage with news on social media. New Media & Society. 25, 3 (Mar. 2023), 505–521. DOI:https://doi.org/10.1177/14614448211011447

work page doi:10.1177/14614448211011447 2023
[30]

and Alais, D

Taubert, J., Apthorp, D., Aagten-Murphy, D. and Alais, D. 2011. The role of holistic processing in face perception: Evidence from the face inversion effect. Vision Research. 51, 11 (Jun. 2011), 1273–1278. DOI:https://doi.org/10.1016/j.visres.2011.04.002

work page doi:10.1016/j.visres.2011.04.002 2011
[31]

https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/

Top video content type by global reach Q2 2023: 2023. https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/. Accessed: 2023-11-30. 19

work page arXiv 2023
[32]

and Tsakiris, M

Tucciarelli, R., Vehar, N., Chandaria, S. and Tsakiris, M. 2022. On the realness of people who do not exist: The social processing of artificial faces. iScience. 25, 12 (Dec. 2022), 105441. DOI:https://doi.org/10.1016/j.isci.2022.105441

work page doi:10.1016/j.isci.2022.105441 2022
[33]

https://w3techs.com/technologies/overview/content_language

Usage Statistics and Market Share of Content Languages for Websites, November 2023: 2023. https://w3techs.com/technologies/overview/content_language. Accessed: 2023-11-30

work page 2023
[34]

and Troller-Renfree, S

Vraga, E., Bode, L. and Troller-Renfree, S. 2016. Beyond Self-Reports: Using Eye Tracking to Measure Topic and Style Differences in Attention to Social Media Content. Communication Methods and Measures. 10, 2–3 (Apr. 2016), 149–164. DOI:https://doi.org/10.1080/19312458.2016.1150443

work page doi:10.1080/19312458.2016.1150443 2016
[35]

Walker, M. 2019. Americans favor mobile devices over desktops and laptops for getting news. Pew Research Center

work page 2019
[36]

and Jiang, H

Wang, Y., Behne, D.M. and Jiang, H. 2009. Influence of native language phonetic system on audio-visual speech perception. Journal of Phonetics. 37, 3 (Jul. 2009), 344–356. DOI:https://doi.org/10.1016/j.wocn.2009.04.002

work page doi:10.1016/j.wocn.2009.04.002 2009
[37]

and Durant, S

Woods, C., Luo, Z., Watling, D. and Durant, S. 2022. Twenty seconds of visual behaviour on social media gives insight into personality. Scientific Reports. 12, 1 (Jan. 2022), 1178. DOI:https://doi.org/10.1038/s41598-022-05095-0

work page doi:10.1038/s41598-022-05095-0 2022
[38]

and Lu, Y

Yu, P., Xia, Z., Fei, J. and Lu, Y. 2021. A Survey on Deepfake Video Detection. IET Biometrics. 10, 6 (Nov. 2021), 607–

work page 2021
[39]

DOI:https://doi.org/10.1049/bme2.12031

work page doi:10.1049/bme2.12031
[40]

NVlabs/ffhq-dataset

2023. NVlabs/ffhq-dataset. NVIDIA Research Projects

work page 2023

[1] [1]

and Cucchiara, R

Cartella, G., Cuculo, V., Cornia, M. and Cucchiara, R. 2024. Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images. arXiv

work page 2024

[2] [2]

and Tucker, C

Doss, C., Monschein, J., Shu, D., Wolfson, T., Kopecky, D., Fitton-Kane, V.A., Bush, L. and Tucker, C. 2022. Deepfakes and Scientific Knowledge Dissemination. In Review

work page 2022

[3] [3]

and Soroka, S

Dunaway, J. and Soroka, S. 2021. Smartphone-size screens constrain cognitive access to video news stories. Information, Communication & Society. 24, 1 (Jan. 2021), 69–84. DOI:https://doi.org/10.1080/1369118X.2019.1631367

work page doi:10.1080/1369118x.2019.1631367 2021

[4] [4]

and Mitchell, P

Gopinath, B., Liew, G., Burlutsky, G., McMahon, C.M. and Mitchell, P. 2017. Visual and hearing impairment and retirement in older adults: A population-based cohort study. Maturitas. 100, (Jun. 2017), 77–81. DOI:https://doi.org/10.1016/j.maturitas.2017.03.318

work page doi:10.1016/j.maturitas.2017.03.318 2017

[5] [5]

and Picard, R

Groh, M., Epstein, Z., Firestone, C. and Picard, R. 2022. Deepfake detection by human crowds, machines, and machine-informed crowds. Proceedings of the National Academy of Sciences. 119, 1 (Jan. 2022), e2110013119. DOI:https://doi.org/10.1073/pnas.2110013119

work page doi:10.1073/pnas.2110013119 2022

[6] [6]

and Picard, R

Groh, M., Sankaranarayanan, A., Singh, N., Kim, D.Y., Lippman, A. and Picard, R. 2023. Human Detection of Political Speech Deepfakes across Transcripts, Audio, and Video. arXiv

work page 2023

[7] [7]

and Wohn, D.Y

Hua, Y., Niu, S., Cai, J., Chilton, L.B., Heuer, H. and Wohn, D.Y. 2024. Generative AI in User-Generated Content. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (New York, NY, USA, May 2024), 1– 7

work page 2024

[8] [8]

and Oliva, A

Josephs, E., Fosco, C. and Oliva, A. 2023. Artifact magnification on deepfake videos increases human detection and subjective confidence. arXiv

work page 2023

[9] [9]

and Aila, T

Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J. and Aila, T. 2021. Alias-Free Generative Adversarial Networks. arXiv

work page 2021

[10] [10]

and Aila, T

Karras, T., Laine, S. and Aila, T. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv

work page 2019

[11] [11]

and Wardle, S.G

Keys, R.T., Taubert, J. and Wardle, S.G. 2021. A visual search advantage for illusory faces in objects. Attention, Perception, & Psychophysics. 83, 5 (2021), 1942–1953. DOI:https://doi.org/10.3758/s13414-021-02267-4

work page doi:10.3758/s13414-021-02267-4 2021

[12] [12]

and Janeja, V.P

Khanjani, Z., Watson, G. and Janeja, V.P. 2023. Audio deepfakes: A survey. Frontiers in Big Data. 5, (2023)

work page 2023

[13] [13]

and Soraperra, I

Köbis, N.C., Doležalová, B. and Soraperra, I. 2021. Fooled twice: People cannot detect deepfakes but think they can. iScience. 24, 11 (Nov. 2021), 103364. DOI:https://doi.org/10.1016/j.isci.2021.103364

work page doi:10.1016/j.isci.2021.103364 2021

[14] [14]

and Griffin, L.D

Mai, K.T., Bray, S.D., Davies, T. and Griffin, L.D. 2023. Warning: Humans Cannot Reliably Detect Speech Deepfakes. PLOS ONE. 18, 8 (Aug. 2023), e0285333. DOI:https://doi.org/10.1371/journal.pone.0285333

work page doi:10.1371/journal.pone.0285333 2023

[15] [15]

and Lee, W

Mirsky, Y. and Lee, W. 2022. The Creation and Detection of Deepfakes: A Survey. ACM Computing Surveys. 54, 1 (Jan. 2022), 1–41. DOI:https://doi.org/10.1145/3425780

work page doi:10.1145/3425780 2022

[16] [16]

and Williams, J

Müller, N.M., Pizzi, K. and Williams, J. 2022. Human Perception of Audio Deepfakes. Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia (Lisboa Portugal, Oct. 2022), 85–91. 18

work page 2022

[17] [17]

and Cabrera Paraiso, E

Munaro, A.C., Hübner Barcelos, R., Francisco Maffezzolli, E.C., Santos Rodrigues, J.P. and Cabrera Paraiso, E. 2021. To engage or not engage? The features of video content on YouTube affecting digital consumer engagement. Journal of Consumer Behaviour. 20, 5 (2021), 1336–1352. DOI:https://doi.org/10.1002/cb.1939

work page doi:10.1002/cb.1939 2021

[18] [18]

and Soto-Faraco, S

Navarra, J. and Soto-Faraco, S. 2007. Hearing lips in a second language: visual articulatory information enables the perception of second language sounds. Psychological Research. 71, 1 (Jan. 2007), 4–12. DOI:https://doi.org/10.1007/s00426-005-0031-5

work page doi:10.1007/s00426-005-0031-5 2007

[19] [19]

and Kittler, J

Nazarieh, F., Feng, Z., Awais, M., Wang, W. and Kittler, J. 2024. A Survey of Cross-Modal Visual Content Generation. IEEE Transactions on Circuits and Systems for Video Technology. (2024), 1–1. DOI:https://doi.org/10.1109/TCSVT.2024.3351601

work page doi:10.1109/tcsvt.2024.3351601 2024

[20] [20]

and Farid, H

Nightingale, S.J. and Farid, H. 2022. AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences. 119, 8 (Feb. 2022), e2120481119. DOI:https://doi.org/10.1073/pnas.2120481119

work page doi:10.1073/pnas.2120481119 2022

[21] [21]

https://www.ofcom.org.uk/siteassets/resources/documents/research-and- data/multi-sector/media-plurality/2024/0324-online-news-research-update.pdf?v=356802

Online News: Research Update: 2024. https://www.ofcom.org.uk/siteassets/resources/documents/research-and- data/multi-sector/media-plurality/2024/0324-online-news-research-update.pdf?v=356802

work page 2024

[22] [22]

and Nuttall, H.E

Pepper, J.L. and Nuttall, H.E. 2023. Age-Related Changes to Multisensory Integration and Audiovisual Speech Perception. Brain Sciences. 13, 8 (Jul. 2023), 1126. DOI:https://doi.org/10.3390/brainsci13081126

work page doi:10.3390/brainsci13081126 2023

[23] [23]

and Polian, I

Prasad, S.S., Hadar, O., Vu, T. and Polian, I. 2022. Human vs. Automatic Detection of Deepfake Videos Over Noisy Channels. 2022 IEEE International Conference on Multimedia and Expo (ICME). (Jul. 2022), 1–6. DOI:https://doi.org/10.1109/ICME52920.2022.9859954

work page doi:10.1109/icme52920.2022.9859954 2022

[24] [24]

Rosenblum, L. 2019. Oxford Research Encyclopedia, Linguistics. Audiovisual speech perception and the McGurk effect. ) Oxford University Press USA

work page 2019

[25] [25]

and Niessner, M

Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J. and Niessner, M. 2019. FaceForensics++: Learning to Detect Manipulated Facial Images. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (Seoul, Korea (South), Oct. 2019), 1–11

work page 2019

[26] [26]

and Harris, L

Sayler, K. and Harris, L. 2023. Deepfakes and National Security. Technical Report #IF11333. Congresssional Research Service

work page 2023

[27] [27]

Sekiyama, K. 1997. Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects. Perception & Psychophysics. 59, 1 (Jan. 1997), 73–80. DOI:https://doi.org/10.3758/BF03206849

work page doi:10.3758/bf03206849 1997

[28] [28]

https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing

Statistics About Hearing, Balance, & Dizziness: 2024. https://www.nidcd.nih.gov/health/statistics/quick-statistics- hearing

work page 2024

[29] [29]

Swart, J. 2023. Tactics of news literacy: How young people access, evaluate, and engage with news on social media. New Media & Society. 25, 3 (Mar. 2023), 505–521. DOI:https://doi.org/10.1177/14614448211011447

work page doi:10.1177/14614448211011447 2023

[30] [30]

and Alais, D

Taubert, J., Apthorp, D., Aagten-Murphy, D. and Alais, D. 2011. The role of holistic processing in face perception: Evidence from the face inversion effect. Vision Research. 51, 11 (Jun. 2011), 1273–1278. DOI:https://doi.org/10.1016/j.visres.2011.04.002

work page doi:10.1016/j.visres.2011.04.002 2011

[31] [31]

https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/

Top video content type by global reach Q2 2023: 2023. https://www.statista.com/statistics/1254810/top-video- content-type-by-global-reach/. Accessed: 2023-11-30. 19

work page arXiv 2023

[32] [32]

and Tsakiris, M

Tucciarelli, R., Vehar, N., Chandaria, S. and Tsakiris, M. 2022. On the realness of people who do not exist: The social processing of artificial faces. iScience. 25, 12 (Dec. 2022), 105441. DOI:https://doi.org/10.1016/j.isci.2022.105441

work page doi:10.1016/j.isci.2022.105441 2022

[33] [33]

https://w3techs.com/technologies/overview/content_language

Usage Statistics and Market Share of Content Languages for Websites, November 2023: 2023. https://w3techs.com/technologies/overview/content_language. Accessed: 2023-11-30

work page 2023

[34] [34]

and Troller-Renfree, S

Vraga, E., Bode, L. and Troller-Renfree, S. 2016. Beyond Self-Reports: Using Eye Tracking to Measure Topic and Style Differences in Attention to Social Media Content. Communication Methods and Measures. 10, 2–3 (Apr. 2016), 149–164. DOI:https://doi.org/10.1080/19312458.2016.1150443

work page doi:10.1080/19312458.2016.1150443 2016

[35] [35]

Walker, M. 2019. Americans favor mobile devices over desktops and laptops for getting news. Pew Research Center

work page 2019

[36] [36]

and Jiang, H

Wang, Y., Behne, D.M. and Jiang, H. 2009. Influence of native language phonetic system on audio-visual speech perception. Journal of Phonetics. 37, 3 (Jul. 2009), 344–356. DOI:https://doi.org/10.1016/j.wocn.2009.04.002

work page doi:10.1016/j.wocn.2009.04.002 2009

[37] [37]

and Durant, S

Woods, C., Luo, Z., Watling, D. and Durant, S. 2022. Twenty seconds of visual behaviour on social media gives insight into personality. Scientific Reports. 12, 1 (Jan. 2022), 1178. DOI:https://doi.org/10.1038/s41598-022-05095-0

work page doi:10.1038/s41598-022-05095-0 2022

[38] [38]

and Lu, Y

Yu, P., Xia, Z., Fei, J. and Lu, Y. 2021. A Survey on Deepfake Video Detection. IET Biometrics. 10, 6 (Nov. 2021), 607–

work page 2021

[39] [39]

DOI:https://doi.org/10.1049/bme2.12031

work page doi:10.1049/bme2.12031

[40] [40]

NVlabs/ffhq-dataset

2023. NVlabs/ffhq-dataset. NVIDIA Research Projects

work page 2023