pith. sign in

arxiv: 2604.03437 · v1 · submitted 2026-04-03 · 📊 stat.AP · cs.CY

Is it Cake or is it AI? A Systematic Review of Human Uncertainty in Distinguishing Generative Artificial Intelligence Content

Pith reviewed 2026-05-13 17:42 UTC · model grok-4.3

classification 📊 stat.AP cs.CY
keywords human detectiongenerative AIAI-generated contentsystematic reviewtextimagesvoicechance performance
0
0 comments X

The pith

Humans detect generative AI content at chance levels across text, images, and voice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This systematic review pulled together results from 30 studies that tested whether people can tell AI-generated material apart from human-created material. Detection accuracy in the studies clustered around 50 percent, meaning performance was no better than random guessing. A reader would care because this pattern questions the common assumption that we can reliably verify the origin of digital content we see or hear every day. The finding suggests that strategies for judging trustworthiness may have to move beyond trying to spot fakes by eye or ear.

Core claim

The review of 30 empirical studies shows that human detection accuracy for generative AI content varies but generally clusters around chance performance, indicating that people are generally unreliable detectors of such content across text, image, and voice modalities.

What carries the argument

Aggregation of measured detection accuracy rates from controlled studies covering text, image, and voice modalities.

If this is right

  • Trust in digital content would need to rest on signals other than human-detectable authenticity.
  • Media evaluation practices may shift away from individual verification of origin.
  • Misinformation countermeasures that rely on users spotting fakes become less effective.
  • Platform and policy approaches to synthetic media must account for widespread inability to detect it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training or education aimed at improving detection may face inherent limits if performance stays near chance.
  • Legal standards that assume people can identify synthetic media may require revision.
  • Hybrid detection systems combining human judgment with automated tools gain practical importance.

Load-bearing premise

The 30 included studies form an unbiased sample of human detection performance without major publication bias or inconsistent measurement methods across modalities.

What would settle it

A large new study that finds humans achieve consistent accuracy well above 50 percent when distinguishing AI-generated content from human content in the same modalities would challenge the central claim.

read the original abstract

This systematic review synthesized empirical evidence on human ability to distinguish generative artificial intelligence content from human produced content across text, image, and voice modalities. A structured search of Scopus identified 22,541 records from 2025 to 2026, of which 1200 were screened and 30 studies were included. Across these studies, human detection accuracy varied widely but generally clustered around chance performance. Overall, the literature shows that humans are generally unreliable detectors of gen AI content, raising broader questions about whether the ability to tell should matter for how we evaluate or trust content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This systematic review synthesizes empirical evidence on human ability to distinguish generative AI content from human-produced content across text, image, and voice modalities. A Scopus search from 2025-2026 identified 22,541 records, with 1,200 screened and 30 studies included. The synthesis finds that human detection accuracy varies widely but generally clusters around chance levels, leading to the conclusion that humans are generally unreliable detectors of gen AI content and raising questions about the role of detection ability in content trust and evaluation.

Significance. If the synthesis is robust, the review consolidates cross-modal evidence on a timely issue in AI ethics and media studies, highlighting limitations of human oversight for generative content. It provides a broad overview that could inform policy, education, and development of automated detection tools, while explicitly noting the need to question reliance on human judgment for authenticity.

major comments (3)
  1. [Methods] Methods section: Inclusion criteria, screening process, and quality assessment are insufficiently detailed. No risk-of-bias tool is applied to the 30 studies, and no explicit list of excluded studies or reasons is provided, undermining confidence in the representativeness of the sample for the 'around chance' synthesis.
  2. [Results] Results section: The claim that accuracies 'generally clustered around chance performance' relies on qualitative description without meta-analytic pooling, heterogeneity statistics (e.g., I²), or subgroup analysis by modality. This leaves the central finding vulnerable to influence from heterogeneous methods (forced-choice vs. ratings; text vs. image vs. voice) and potential publication bias.
  3. [Discussion] Discussion section: Broader implications for content trust are drawn from the synthesis, but without addressing Scopus-only search limitations, gray literature exclusion, or cross-study measurement inconsistencies, the generalizability of the 'unreliable detectors' conclusion requires stronger justification.
minor comments (2)
  1. [Abstract] Abstract: The date range '2025 to 2026' appears anomalous given current publication timelines and should be clarified or corrected.
  2. Consider including a PRISMA flow diagram to document the record screening and inclusion process for greater transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our systematic review. We have carefully considered each major comment and agree that several clarifications and expansions will strengthen the manuscript. Below we respond point by point, indicating the revisions we plan to implement.

read point-by-point responses
  1. Referee: [Methods] Methods section: Inclusion criteria, screening process, and quality assessment are insufficiently detailed. No risk-of-bias tool is applied to the 30 studies, and no explicit list of excluded studies or reasons is provided, undermining confidence in the representativeness of the sample for the 'around chance' synthesis.

    Authors: We agree that the methods section requires greater transparency. In the revised manuscript we will expand the inclusion/exclusion criteria with explicit operational definitions, provide a detailed account of the screening process (including number of independent reviewers and disagreement resolution), and describe our quality assessment approach. We will add a PRISMA flow diagram that reports reasons for exclusion at each stage. Although standard risk-of-bias tools are not ideally suited to the heterogeneous experimental designs in this literature, we will include a narrative quality appraisal of the 30 studies and note their limitations. These additions will directly address concerns about representativeness. revision: yes

  2. Referee: [Results] Results section: The claim that accuracies 'generally clustered around chance performance' relies on qualitative description without meta-analytic pooling, heterogeneity statistics (e.g., I²), or subgroup analysis by modality. This leaves the central finding vulnerable to influence from heterogeneous methods (forced-choice vs. ratings; text vs. image vs. voice) and potential publication bias.

    Authors: We acknowledge that a quantitative synthesis would be desirable but maintain that the extreme heterogeneity in outcome measures, experimental paradigms, and modalities makes formal meta-analysis inappropriate and potentially misleading. In revision we will add (1) a table of individual study accuracies with modality and task-type annotations, (2) a narrative assessment of heterogeneity, (3) modality-specific subgroup summaries, and (4) explicit discussion of possible publication bias. These changes will support the qualitative claim with greater rigor while avoiding over-interpretation of pooled statistics. revision: partial

  3. Referee: [Discussion] Discussion section: Broader implications for content trust are drawn from the synthesis, but without addressing Scopus-only search limitations, gray literature exclusion, or cross-study measurement inconsistencies, the generalizability of the 'unreliable detectors' conclusion requires stronger justification.

    Authors: We will revise the discussion to explicitly acknowledge these limitations. We will note the restriction to Scopus, discuss the potential impact of excluding gray literature, and elaborate on measurement inconsistencies across studies (e.g., forced-choice vs. continuous ratings). At the same time we will argue that the consistent pattern of near-chance performance across the included studies still supports the core conclusion, while qualifying the generalizability claims accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: systematic review synthesizes external studies

full rationale

The paper is a systematic review that identifies 30 external studies via Scopus search and summarizes their reported human detection accuracies. The central claim (humans cluster around chance performance) is an aggregation of independent empirical results from those studies, not a derivation from the review's own fitted parameters, self-defined quantities, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's inputs by construction. This is the expected non-finding for a literature synthesis whose evidence base lies outside the review itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a single-database search captured a representative set of studies and that the included papers used comparable detection tasks; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The Scopus search from 2025-2026 plus the screening process identified all relevant empirical studies on human detection of generative AI content.
    Standard systematic-review assumption invoked to justify the final sample of 30 studies.

pith-pipeline@v0.9.0 · 5389 in / 1085 out tokens · 43311 ms · 2026-05-13T17:42:13.918085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry

    Köbis N, Mossink LD. Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior. 2021 Jan 1;114:106553. doi:10.1016/j.chb.2020.106553

  2. [2]

    Can AI tell good stories? Narrative transportation and persuasion with ChatGPT

    Chu H, Liu S. Can AI tell good stories? Narrative transportation and persuasion with ChatGPT. Journal of Communication. 2024 Oct 1;74(5):347–58. doi:10.1093/joc/jqae029

  3. [3]

    Artificial intelligence, deepfakes, and the uncertain future of truth

    Villasenor J. Artificial intelligence, deepfakes, and the uncertain future of truth. Brookings [Internet]. 2019 [cited 2026 Apr 1]. Available from: https://www.brookings.edu/articles/artificial-intelligence-deepfakes-and-the- uncertain-future-of-truth/

  4. [4]

    Opinion | How Do You Know a Human Wrote This? The New York Times [Internet]

    Manjoo F . Opinion | How Do You Know a Human Wrote This? The New York Times [Internet]. 2020 [cited 2026 Apr 1]. Available from: https://www.nytimes.com/2020/07/29/opinion/gpt-3-ai-automation.html

  5. [5]

    Perceiving emotion in human and AI voices: sensitivity to acoustic cues in Korean speech

    Yoon D, Oh G, Kent R. Perceiving emotion in human and AI voices: sensitivity to acoustic cues in Korean speech. Lingua. 2026 Jan 1;330:104083. doi:10.1016/j.lingua.2025.104083

  6. [6]

    AI or human? Exploring the effects of user awareness in conversational dynamics with virtual avatars

    Kober SE, Streit S, Wood G. AI or human? Exploring the effects of user awareness in conversational dynamics with virtual avatars. Computers in Human Behavior. 2026 Aug;181:108984. doi:10.1016/j.chb.2026.108984

  7. [7]

    Content camouflage: How diversified posting patterns influence human detection of AI-enabled social bots

    Saucier CJ, Wack M, Linvill D, Okoronkwo A, Tatineni G, Sezgin A. Content camouflage: How diversified posting patterns influence human detection of AI-enabled social bots. Computers in Human Behavior. 2026 Apr 1;177:108881. doi:10.5167/uzh-282286

  8. [8]

    The invisible author: Citizen sociolinguistic perspectives on identifying human and AI-generated narrative texts

    Szabó G, Krizsai F , Deme A. The invisible author: Citizen sociolinguistic perspectives on identifying human and AI-generated narrative texts. Social Sciences & Humanities Open. 2026 Jun;13:102646. doi:10.1016/j.ssaho.2026.102646

  9. [9]

    Human versus artificial creativity: A case study in poetry

    Holyoak KJ. Human versus artificial creativity: A case study in poetry. Journal of Creativity. 2026 Apr;36(1):100118. doi:10.1016/j.yjoc.2025.100118

  10. [10]

    Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports

    Zanardo M, Albano D, Molinari V , Fabrizio R, Conca M, Asmundo L, et al. Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports. Eur Radiol Exp. 2026 Feb 23;10(1):16. doi:10.1186/s41747- 026-00682-6

  11. [11]

    Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

    Linde P , Fichter F , Dietlein M, Sudbrock F , Afshar K, Dapper H, et al. Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties. npj Digit Med. 2026 Jan 8;9(1):132. doi:10.1038/s41746-025-02313-7

  12. [12]

    Artificial Intelligence vs Human Authorship in Spine Surgery Fellowship Personal Statements: Can ChatGPT Outperform Applicants? Global Spine J

    Karakash WJ, Avetisian H, Ragheb JM, Wang JC, Hah RJ, Alluri RK. Artificial Intelligence vs Human Authorship in Spine Surgery Fellowship Personal Statements: Can ChatGPT Outperform Applicants? Global Spine J. 2026 Jan;16(1):313–8. doi:10.1177/21925682251344248 PubMed PMID: 40392947; PubMed Central PMCID: PMC12092409

  13. [14]

    Death of the Personal Statement: Qualitative Comparison Between Human-Authored and Artificial Intelligence-Generated Medical School Admissions Essays

    Vaccaro MJ, Sharma I, Espina-Rey AP , Lyman N, Palacios C, Zhang Y , et al. Death of the Personal Statement: Qualitative Comparison Between Human-Authored and Artificial Intelligence-Generated Medical School Admissions Essays. J Am Coll Surg. 2026 Jan 1;242(1):47–52. doi:10.1097/XCS.0000000000001602 PubMed PMID: 41051105

  14. [15]

    Can OMFS experts distinguish AI from human manuscripts? A double-blind evaluation using ChatGPT-4

    Jain A. Can OMFS experts distinguish AI from human manuscripts? A double-blind evaluation using ChatGPT-4. J Craniomaxillofac Surg. 2026 Mar;54(3):104468. doi:10.1016/j.jcms.2026.104468 PubMed PMID: 41534249

  15. [16]

    Phishing 2.0: Human Ability to Detect AI-Generated Content

    Madleňák M, Hubočan S. Phishing 2.0: Human Ability to Detect AI-Generated Content. Transportation Research Procedia. 2026;93:1125–32. doi:10.1016/j.trpro.2025.12.051

  16. [17]

    Children’s Susceptibility to Content Generated by Artificial Intelligence

    Langer A, Martinez S, Marshall P , Chein J. Children’s Susceptibility to Content Generated by Artificial Intelligence. Technology in Society. 2026 Mar 1;86:103303. doi:10.1016/j.techsoc.2026.103303

  17. [18]

    Framing digital inauthenticity: Comparing user detection of AI-generated faces to messaged-based scam methods

    Sarno DM, Solorio J, Ballar S, Chadwick S, Harris K, Moss D, et al. Framing digital inauthenticity: Comparing user detection of AI-generated faces to messaged-based scam methods. Acta Psychol (Amst). 2026 Feb;262:105995. doi:10.1016/j.actpsy.2025.105995 PubMed PMID: 41349270

  18. [19]

    Domain-general object recognition predicts human ability to tell real from AI-generated faces

    Chow JK, McGugin RW, Gauthier I. Domain-general object recognition predicts human ability to tell real from AI-generated faces. Journal of Experimental Psychology: General. 2026;155(3):629–48. doi:10.1037/xge0001881

  19. [20]

    Genuine or Fake? Explaining Consumers’ Perception and Detection of AI-Generated Fake Reviews

    Fröhnel K, Santelmann B, Zarnekow R. Genuine or Fake? Explaining Consumers’ Perception and Detection of AI-Generated Fake Reviews. In: Proceedings of the 58th Hawaii International Conference on System Sciences [Internet]. 2025 [cited 2026 Mar 31]. Available from: https://hdl.handle.net/10125/109350 doi:10.24251/HICSS.2025.505

  20. [21]

    People are poorly equipped to detect AI-powered voice clones

    Barrington S, Cooper EA, Farid H. People are poorly equipped to detect AI-powered voice clones. Sci Rep. 2025 Mar 31;15(1):11004. doi:10.1038/s41598-025-94170-3

  21. [22]

    Voice clones sound realistic but not (yet) hyperrealistic

    Lavan N, Irvine M, Rosi V , McGettigan C. Voice clones sound realistic but not (yet) hyperrealistic. PLOS ONE. 2025 Sep 24;20(9):e0332692. doi:10.1371/journal.pone.0332692

  22. [23]

    Convincingness of AI-Generated Restaurant Reviews

    Tuomi A, Abidin HZ, Tuominen P , Ascenção MP . Convincingness of AI-Generated Restaurant Reviews. Springer Proceedings in Business and Economics. 2025;437–48

  23. [24]

    Acceptance and trust in AI-generated exercise plans among recreational athletes and quality evaluation by experienced coaches: a pilot study

    Wachholz F , Manno S, Schlachter D, Gamper N, Schnitzer M. Acceptance and trust in AI-generated exercise plans among recreational athletes and quality evaluation by experienced coaches: a pilot study. BMC Res Notes. 2025 Mar 13;18(1):112. doi:10.1186/s13104-025-07172-9

  24. [25]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Zhu T, Weissburg I, Zhang K, Wang WY . Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated. In: Che W, Nabende J, Shutova E, Pilehvar MT, editors. Findings of the Association for Computational Linguistics: ACL 2025 [Internet]. Vienna, Austria: Association for Computational Linguistics; 2025 [cited 2026 Mar 31]. p. 2...

  25. [26]

    Artificial intelligence vs

    Franke Föyen L, Zapel E, Lekander M, Hedman-Lagerlöf E, Lindsäter E. Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship. Internet Interv. 2025 Sep;41:100841. doi:10.1016/j.invent.2025.100841 PubMed PMID: 40525210; PubMe...

  26. [27]

    Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses☆

    Fiedler A, Döpke J. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses☆. International Review of Economics Education [Internet]. 2025 [cited 2026 Mar 31];49(C). Available from: https://ideas.repec.org//a/eee/ireced/v49y2025ics1477388025000131.html

  27. [28]

    Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers

    Stadler RD, Sudah SY , Moverman MA, Denard PJ, Duralde XA, Garrigues GE, et al. Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers. Arthroscopy. 2025 Apr;41(4):916-924.e2. doi:10.1016/j.arthro.2024.06.045 PubMed PMID: 38992513

  28. [29]

    A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,

    Cardia F , Pentangelo V , Lambiase S, Gravino C, Palomba F , Marras M. Toward Realistic AI-Generated Student Questions to Support Instructor Training. In: Two Decades of TEL. From Lessons Learnt to Challenges Ahead: 20th European Conference on Technology Enhanced Learning, EC-TEL 2025, Newcastle upon Tyne and Durham, UK, September 15–19, 2025, Proceedings...

  29. [30]

    Human or Machine? A Comparative Analysis of Artificial Intelligence-Generated Writing Detection in Personal Statements

    Goodman MA, Lee AM, Schreck Z, Hollman JH. Human or Machine? A Comparative Analysis of Artificial Intelligence-Generated Writing Detection in Personal Statements. J Phys Ther Educ. 2025 Dec 1;39(4):329–38. doi:10.1097/JTE.0000000000000396 PubMed PMID: 39808529

  30. [31]

    Alkhofi A. Man vs. machine: can AI outperform ESL student translations? Front Artif Intell. 2025 Jul 9;8:1624754. doi:10.3389/frai.2025.1624754 PubMed PMID: 40703308; PubMed Central PMCID: PMC12283786

  31. [32]

    Interpretation of AI- Generated vs

    Velásquez-Salamanca D, Martín-Pascual MÁ, Andreu-Sánchez C. Interpretation of AI- Generated vs. Human-Made Images. Journal of Imaging. 2025 Jul;11(7):227. doi:10.3390/jimaging11070227

  32. [33]

    What you see is not what you get anymore: a mixed- methods approach on human perception of AI-generated images

    Högemann M, Betke J, Thomas O. What you see is not what you get anymore: a mixed- methods approach on human perception of AI-generated images. Front Artif Intell. 2025;8:1707336. doi:10.3389/frai.2025.1707336 PubMed PMID: 41346853; PubMed Central PMCID: PMC12672458

  33. [34]

    Generative Art in Your Pocket: User Perception and Acceptance of AI- Generated Abstract Art for Mobile Wallpapers

    Wang Z, Jin Y . Generative Art in Your Pocket: User Perception and Acceptance of AI- Generated Abstract Art for Mobile Wallpapers. In: Proceedings of the Twelfth International Symposium of Chinese CHI [Internet]. New York, NY , USA: Association for Computing Machinery; 2025 [cited 2026 Mar 31]. p. 716–21. (CHCHI ’24). Available from: https://dl.acm.org/do...

  34. [35]

    The Conversation [Internet]

    Stone J. The Conversation [Internet]. 2024 [cited 2026 Mar 31]. People can’t tell the difference between human and AI-generated poetry – new study. Available from: https://theconversation.com/people-cant-tell-the-difference-between-human-and-ai- generated-poetry-new-study-243750 doi:10.64628/AB.99e9sddjt

  35. [36]

    Hancock, and Mor Naaman

    Jakesch M, Hancock JT, Naaman M. Human heuristics for AI-generated language are flawed. Proceedings of the National Academy of Sciences. 2023 Mar 14;120(11):e2208839120. doi:10.1073/pnas.2208839120

  36. [37]

    New results in AI research: Humans barely able to recognize AI- generated media [Internet]

    Koltermann F . New results in AI research: Humans barely able to recognize AI- generated media [Internet]. 2024 [cited 2026 Mar 31]. Available from: http://cispa.de/en/holz-ai-generated-media

  37. [38]

    End User: AI is becoming too realistic

    Ellenberg L, Radcliffe S. End User: AI is becoming too realistic. The Ithacan [Internet]. 2025 Nov 19 [cited 2026 Mar 31]. Available from: https://theithacan.org/64577/opinion/columns/ai-is-becoming-too-realistic/

  38. [39]

    and Griffin, L.D

    Mai KT, Bray S, Davies T, Griffin LD. Warning: Humans cannot reliably detect speech deepfakes. PLOS ONE. 2023 Aug 2;18(8):e0285333. doi:10.1371/journal.pone.0285333

  39. [40]

    Photo forensics from lighting shadows and reflections [Internet]

    Farid H. Photo forensics from lighting shadows and reflections [Internet]. 2023 [cited 2026 Mar 31]. Available from: https://contentauthenticity.org/blog/photo-forensics- from-lighting-shadows-and-reflections

  40. [41]

    and Soraperra, I

    Köbis NC, Doležalová B, Soraperra I. Fooled twice: People cannot detect deepfakes but think they can. iScience. 2021 Oct 29;24(11):103364. doi:10.1016/j.isci.2021.103364 PubMed PMID: 34820608; PubMed Central PMCID: PMC8602050

  41. [42]

    Nat Methods

    Using AI responsibly in scientific publishing. Nat Methods. 2026 Feb;23(2):271–271. doi:10.1038/s41592-026-03020-1

  42. [43]

    Troops, Trolls and Troublemakers: A Global Inventory of Organized Social Media Manipulation

    Bradshaw S, Howard PN. Troops, Trolls and Troublemakers: A Global Inventory of Organized Social Media Manipulation. 2017

  43. [44]

    Architects of Networked Disinformation: Behind the Scenes of Troll Accounts and Fake News Production in the Philippines [Monograph] [Internet]

    Ong JC, Cabanes JVA. Architects of Networked Disinformation: Behind the Scenes of Troll Accounts and Fake News Production in the Philippines [Monograph] [Internet]. Leeds; 2018 [cited 2026 Apr 1]. Available from: http://newtontechfordev.com/wp- content/uploads/2018/02/ARCHITECTS-OF-NETWORKED-DISINFORMATION-FULL- REPORT.pdf