Is it Cake or is it AI? A Systematic Review of Human Uncertainty in Distinguishing Generative Artificial Intelligence Content
Pith reviewed 2026-05-13 17:42 UTC · model grok-4.3
The pith
Humans detect generative AI content at chance levels across text, images, and voice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The review of 30 empirical studies shows that human detection accuracy for generative AI content varies but generally clusters around chance performance, indicating that people are generally unreliable detectors of such content across text, image, and voice modalities.
What carries the argument
Aggregation of measured detection accuracy rates from controlled studies covering text, image, and voice modalities.
If this is right
- Trust in digital content would need to rest on signals other than human-detectable authenticity.
- Media evaluation practices may shift away from individual verification of origin.
- Misinformation countermeasures that rely on users spotting fakes become less effective.
- Platform and policy approaches to synthetic media must account for widespread inability to detect it.
Where Pith is reading between the lines
- Training or education aimed at improving detection may face inherent limits if performance stays near chance.
- Legal standards that assume people can identify synthetic media may require revision.
- Hybrid detection systems combining human judgment with automated tools gain practical importance.
Load-bearing premise
The 30 included studies form an unbiased sample of human detection performance without major publication bias or inconsistent measurement methods across modalities.
What would settle it
A large new study that finds humans achieve consistent accuracy well above 50 percent when distinguishing AI-generated content from human content in the same modalities would challenge the central claim.
read the original abstract
This systematic review synthesized empirical evidence on human ability to distinguish generative artificial intelligence content from human produced content across text, image, and voice modalities. A structured search of Scopus identified 22,541 records from 2025 to 2026, of which 1200 were screened and 30 studies were included. Across these studies, human detection accuracy varied widely but generally clustered around chance performance. Overall, the literature shows that humans are generally unreliable detectors of gen AI content, raising broader questions about whether the ability to tell should matter for how we evaluate or trust content.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This systematic review synthesizes empirical evidence on human ability to distinguish generative AI content from human-produced content across text, image, and voice modalities. A Scopus search from 2025-2026 identified 22,541 records, with 1,200 screened and 30 studies included. The synthesis finds that human detection accuracy varies widely but generally clusters around chance levels, leading to the conclusion that humans are generally unreliable detectors of gen AI content and raising questions about the role of detection ability in content trust and evaluation.
Significance. If the synthesis is robust, the review consolidates cross-modal evidence on a timely issue in AI ethics and media studies, highlighting limitations of human oversight for generative content. It provides a broad overview that could inform policy, education, and development of automated detection tools, while explicitly noting the need to question reliance on human judgment for authenticity.
major comments (3)
- [Methods] Methods section: Inclusion criteria, screening process, and quality assessment are insufficiently detailed. No risk-of-bias tool is applied to the 30 studies, and no explicit list of excluded studies or reasons is provided, undermining confidence in the representativeness of the sample for the 'around chance' synthesis.
- [Results] Results section: The claim that accuracies 'generally clustered around chance performance' relies on qualitative description without meta-analytic pooling, heterogeneity statistics (e.g., I²), or subgroup analysis by modality. This leaves the central finding vulnerable to influence from heterogeneous methods (forced-choice vs. ratings; text vs. image vs. voice) and potential publication bias.
- [Discussion] Discussion section: Broader implications for content trust are drawn from the synthesis, but without addressing Scopus-only search limitations, gray literature exclusion, or cross-study measurement inconsistencies, the generalizability of the 'unreliable detectors' conclusion requires stronger justification.
minor comments (2)
- [Abstract] Abstract: The date range '2025 to 2026' appears anomalous given current publication timelines and should be clarified or corrected.
- Consider including a PRISMA flow diagram to document the record screening and inclusion process for greater transparency.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our systematic review. We have carefully considered each major comment and agree that several clarifications and expansions will strengthen the manuscript. Below we respond point by point, indicating the revisions we plan to implement.
read point-by-point responses
-
Referee: [Methods] Methods section: Inclusion criteria, screening process, and quality assessment are insufficiently detailed. No risk-of-bias tool is applied to the 30 studies, and no explicit list of excluded studies or reasons is provided, undermining confidence in the representativeness of the sample for the 'around chance' synthesis.
Authors: We agree that the methods section requires greater transparency. In the revised manuscript we will expand the inclusion/exclusion criteria with explicit operational definitions, provide a detailed account of the screening process (including number of independent reviewers and disagreement resolution), and describe our quality assessment approach. We will add a PRISMA flow diagram that reports reasons for exclusion at each stage. Although standard risk-of-bias tools are not ideally suited to the heterogeneous experimental designs in this literature, we will include a narrative quality appraisal of the 30 studies and note their limitations. These additions will directly address concerns about representativeness. revision: yes
-
Referee: [Results] Results section: The claim that accuracies 'generally clustered around chance performance' relies on qualitative description without meta-analytic pooling, heterogeneity statistics (e.g., I²), or subgroup analysis by modality. This leaves the central finding vulnerable to influence from heterogeneous methods (forced-choice vs. ratings; text vs. image vs. voice) and potential publication bias.
Authors: We acknowledge that a quantitative synthesis would be desirable but maintain that the extreme heterogeneity in outcome measures, experimental paradigms, and modalities makes formal meta-analysis inappropriate and potentially misleading. In revision we will add (1) a table of individual study accuracies with modality and task-type annotations, (2) a narrative assessment of heterogeneity, (3) modality-specific subgroup summaries, and (4) explicit discussion of possible publication bias. These changes will support the qualitative claim with greater rigor while avoiding over-interpretation of pooled statistics. revision: partial
-
Referee: [Discussion] Discussion section: Broader implications for content trust are drawn from the synthesis, but without addressing Scopus-only search limitations, gray literature exclusion, or cross-study measurement inconsistencies, the generalizability of the 'unreliable detectors' conclusion requires stronger justification.
Authors: We will revise the discussion to explicitly acknowledge these limitations. We will note the restriction to Scopus, discuss the potential impact of excluding gray literature, and elaborate on measurement inconsistencies across studies (e.g., forced-choice vs. continuous ratings). At the same time we will argue that the consistent pattern of near-chance performance across the included studies still supports the core conclusion, while qualifying the generalizability claims accordingly. revision: yes
Circularity Check
No circularity: systematic review synthesizes external studies
full rationale
The paper is a systematic review that identifies 30 external studies via Scopus search and summarizes their reported human detection accuracies. The central claim (humans cluster around chance performance) is an aggregation of independent empirical results from those studies, not a derivation from the review's own fitted parameters, self-defined quantities, or self-citation chain. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's inputs by construction. This is the expected non-finding for a literature synthesis whose evidence base lies outside the review itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Scopus search from 2025-2026 plus the screening process identified all relevant empirical studies on human detection of generative AI content.
Reference graph
Works this paper leans on
-
[1]
Köbis N, Mossink LD. Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior. 2021 Jan 1;114:106553. doi:10.1016/j.chb.2020.106553
-
[2]
Can AI tell good stories? Narrative transportation and persuasion with ChatGPT
Chu H, Liu S. Can AI tell good stories? Narrative transportation and persuasion with ChatGPT. Journal of Communication. 2024 Oct 1;74(5):347–58. doi:10.1093/joc/jqae029
-
[3]
Artificial intelligence, deepfakes, and the uncertain future of truth
Villasenor J. Artificial intelligence, deepfakes, and the uncertain future of truth. Brookings [Internet]. 2019 [cited 2026 Apr 1]. Available from: https://www.brookings.edu/articles/artificial-intelligence-deepfakes-and-the- uncertain-future-of-truth/
work page 2019
-
[4]
Opinion | How Do You Know a Human Wrote This? The New York Times [Internet]
Manjoo F . Opinion | How Do You Know a Human Wrote This? The New York Times [Internet]. 2020 [cited 2026 Apr 1]. Available from: https://www.nytimes.com/2020/07/29/opinion/gpt-3-ai-automation.html
work page 2020
-
[5]
Perceiving emotion in human and AI voices: sensitivity to acoustic cues in Korean speech
Yoon D, Oh G, Kent R. Perceiving emotion in human and AI voices: sensitivity to acoustic cues in Korean speech. Lingua. 2026 Jan 1;330:104083. doi:10.1016/j.lingua.2025.104083
-
[6]
AI or human? Exploring the effects of user awareness in conversational dynamics with virtual avatars
Kober SE, Streit S, Wood G. AI or human? Exploring the effects of user awareness in conversational dynamics with virtual avatars. Computers in Human Behavior. 2026 Aug;181:108984. doi:10.1016/j.chb.2026.108984
-
[7]
Saucier CJ, Wack M, Linvill D, Okoronkwo A, Tatineni G, Sezgin A. Content camouflage: How diversified posting patterns influence human detection of AI-enabled social bots. Computers in Human Behavior. 2026 Apr 1;177:108881. doi:10.5167/uzh-282286
-
[8]
Szabó G, Krizsai F , Deme A. The invisible author: Citizen sociolinguistic perspectives on identifying human and AI-generated narrative texts. Social Sciences & Humanities Open. 2026 Jun;13:102646. doi:10.1016/j.ssaho.2026.102646
-
[9]
Human versus artificial creativity: A case study in poetry
Holyoak KJ. Human versus artificial creativity: A case study in poetry. Journal of Creativity. 2026 Apr;36(1):100118. doi:10.1016/j.yjoc.2025.100118
-
[10]
Zanardo M, Albano D, Molinari V , Fabrizio R, Conca M, Asmundo L, et al. Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports. Eur Radiol Exp. 2026 Feb 23;10(1):16. doi:10.1186/s41747- 026-00682-6
-
[11]
Linde P , Fichter F , Dietlein M, Sudbrock F , Afshar K, Dapper H, et al. Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties. npj Digit Med. 2026 Jan 8;9(1):132. doi:10.1038/s41746-025-02313-7
-
[12]
Karakash WJ, Avetisian H, Ragheb JM, Wang JC, Hah RJ, Alluri RK. Artificial Intelligence vs Human Authorship in Spine Surgery Fellowship Personal Statements: Can ChatGPT Outperform Applicants? Global Spine J. 2026 Jan;16(1):313–8. doi:10.1177/21925682251344248 PubMed PMID: 40392947; PubMed Central PMCID: PMC12092409
-
[14]
Vaccaro MJ, Sharma I, Espina-Rey AP , Lyman N, Palacios C, Zhang Y , et al. Death of the Personal Statement: Qualitative Comparison Between Human-Authored and Artificial Intelligence-Generated Medical School Admissions Essays. J Am Coll Surg. 2026 Jan 1;242(1):47–52. doi:10.1097/XCS.0000000000001602 PubMed PMID: 41051105
-
[15]
Can OMFS experts distinguish AI from human manuscripts? A double-blind evaluation using ChatGPT-4
Jain A. Can OMFS experts distinguish AI from human manuscripts? A double-blind evaluation using ChatGPT-4. J Craniomaxillofac Surg. 2026 Mar;54(3):104468. doi:10.1016/j.jcms.2026.104468 PubMed PMID: 41534249
-
[16]
Phishing 2.0: Human Ability to Detect AI-Generated Content
Madleňák M, Hubočan S. Phishing 2.0: Human Ability to Detect AI-Generated Content. Transportation Research Procedia. 2026;93:1125–32. doi:10.1016/j.trpro.2025.12.051
-
[17]
Children’s Susceptibility to Content Generated by Artificial Intelligence
Langer A, Martinez S, Marshall P , Chein J. Children’s Susceptibility to Content Generated by Artificial Intelligence. Technology in Society. 2026 Mar 1;86:103303. doi:10.1016/j.techsoc.2026.103303
-
[18]
Sarno DM, Solorio J, Ballar S, Chadwick S, Harris K, Moss D, et al. Framing digital inauthenticity: Comparing user detection of AI-generated faces to messaged-based scam methods. Acta Psychol (Amst). 2026 Feb;262:105995. doi:10.1016/j.actpsy.2025.105995 PubMed PMID: 41349270
-
[19]
Domain-general object recognition predicts human ability to tell real from AI-generated faces
Chow JK, McGugin RW, Gauthier I. Domain-general object recognition predicts human ability to tell real from AI-generated faces. Journal of Experimental Psychology: General. 2026;155(3):629–48. doi:10.1037/xge0001881
-
[20]
Genuine or Fake? Explaining Consumers’ Perception and Detection of AI-Generated Fake Reviews
Fröhnel K, Santelmann B, Zarnekow R. Genuine or Fake? Explaining Consumers’ Perception and Detection of AI-Generated Fake Reviews. In: Proceedings of the 58th Hawaii International Conference on System Sciences [Internet]. 2025 [cited 2026 Mar 31]. Available from: https://hdl.handle.net/10125/109350 doi:10.24251/HICSS.2025.505
-
[21]
People are poorly equipped to detect AI-powered voice clones
Barrington S, Cooper EA, Farid H. People are poorly equipped to detect AI-powered voice clones. Sci Rep. 2025 Mar 31;15(1):11004. doi:10.1038/s41598-025-94170-3
-
[22]
Voice clones sound realistic but not (yet) hyperrealistic
Lavan N, Irvine M, Rosi V , McGettigan C. Voice clones sound realistic but not (yet) hyperrealistic. PLOS ONE. 2025 Sep 24;20(9):e0332692. doi:10.1371/journal.pone.0332692
-
[23]
Convincingness of AI-Generated Restaurant Reviews
Tuomi A, Abidin HZ, Tuominen P , Ascenção MP . Convincingness of AI-Generated Restaurant Reviews. Springer Proceedings in Business and Economics. 2025;437–48
work page 2025
-
[24]
Wachholz F , Manno S, Schlachter D, Gamper N, Schnitzer M. Acceptance and trust in AI-generated exercise plans among recreational athletes and quality evaluation by experienced coaches: a pilot study. BMC Res Notes. 2025 Mar 13;18(1):112. doi:10.1186/s13104-025-07172-9
-
[25]
In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V
Zhu T, Weissburg I, Zhang K, Wang WY . Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated. In: Che W, Nabende J, Shutova E, Pilehvar MT, editors. Findings of the Association for Computational Linguistics: ACL 2025 [Internet]. Vienna, Austria: Association for Computational Linguistics; 2025 [cited 2026 Mar 31]. p. 2...
-
[26]
Franke Föyen L, Zapel E, Lekander M, Hedman-Lagerlöf E, Lindsäter E. Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship. Internet Interv. 2025 Sep;41:100841. doi:10.1016/j.invent.2025.100841 PubMed PMID: 40525210; PubMe...
-
[27]
Fiedler A, Döpke J. Do humans identify AI-generated text better than machines? Evidence based on excerpts from German theses☆. International Review of Economics Education [Internet]. 2025 [cited 2026 Mar 31];49(C). Available from: https://ideas.repec.org//a/eee/ireced/v49y2025ics1477388025000131.html
work page 2025
-
[28]
Stadler RD, Sudah SY , Moverman MA, Denard PJ, Duralde XA, Garrigues GE, et al. Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers. Arthroscopy. 2025 Apr;41(4):916-924.e2. doi:10.1016/j.arthro.2024.06.045 PubMed PMID: 38992513
-
[29]
A Handwritten Text Recognition Dataset for Ajami Manuscripts in Fulfulde and Hausa,
Cardia F , Pentangelo V , Lambiase S, Gravino C, Palomba F , Marras M. Toward Realistic AI-Generated Student Questions to Support Instructor Training. In: Two Decades of TEL. From Lessons Learnt to Challenges Ahead: 20th European Conference on Technology Enhanced Learning, EC-TEL 2025, Newcastle upon Tyne and Durham, UK, September 15–19, 2025, Proceedings...
-
[30]
Goodman MA, Lee AM, Schreck Z, Hollman JH. Human or Machine? A Comparative Analysis of Artificial Intelligence-Generated Writing Detection in Personal Statements. J Phys Ther Educ. 2025 Dec 1;39(4):329–38. doi:10.1097/JTE.0000000000000396 PubMed PMID: 39808529
-
[31]
Alkhofi A. Man vs. machine: can AI outperform ESL student translations? Front Artif Intell. 2025 Jul 9;8:1624754. doi:10.3389/frai.2025.1624754 PubMed PMID: 40703308; PubMed Central PMCID: PMC12283786
-
[32]
Interpretation of AI- Generated vs
Velásquez-Salamanca D, Martín-Pascual MÁ, Andreu-Sánchez C. Interpretation of AI- Generated vs. Human-Made Images. Journal of Imaging. 2025 Jul;11(7):227. doi:10.3390/jimaging11070227
-
[33]
Högemann M, Betke J, Thomas O. What you see is not what you get anymore: a mixed- methods approach on human perception of AI-generated images. Front Artif Intell. 2025;8:1707336. doi:10.3389/frai.2025.1707336 PubMed PMID: 41346853; PubMed Central PMCID: PMC12672458
-
[34]
Wang Z, Jin Y . Generative Art in Your Pocket: User Perception and Acceptance of AI- Generated Abstract Art for Mobile Wallpapers. In: Proceedings of the Twelfth International Symposium of Chinese CHI [Internet]. New York, NY , USA: Association for Computing Machinery; 2025 [cited 2026 Mar 31]. p. 716–21. (CHCHI ’24). Available from: https://dl.acm.org/do...
-
[35]
Stone J. The Conversation [Internet]. 2024 [cited 2026 Mar 31]. People can’t tell the difference between human and AI-generated poetry – new study. Available from: https://theconversation.com/people-cant-tell-the-difference-between-human-and-ai- generated-poetry-new-study-243750 doi:10.64628/AB.99e9sddjt
-
[36]
Jakesch M, Hancock JT, Naaman M. Human heuristics for AI-generated language are flawed. Proceedings of the National Academy of Sciences. 2023 Mar 14;120(11):e2208839120. doi:10.1073/pnas.2208839120
-
[37]
New results in AI research: Humans barely able to recognize AI- generated media [Internet]
Koltermann F . New results in AI research: Humans barely able to recognize AI- generated media [Internet]. 2024 [cited 2026 Mar 31]. Available from: http://cispa.de/en/holz-ai-generated-media
work page 2024
-
[38]
End User: AI is becoming too realistic
Ellenberg L, Radcliffe S. End User: AI is becoming too realistic. The Ithacan [Internet]. 2025 Nov 19 [cited 2026 Mar 31]. Available from: https://theithacan.org/64577/opinion/columns/ai-is-becoming-too-realistic/
work page 2025
-
[39]
Mai KT, Bray S, Davies T, Griffin LD. Warning: Humans cannot reliably detect speech deepfakes. PLOS ONE. 2023 Aug 2;18(8):e0285333. doi:10.1371/journal.pone.0285333
-
[40]
Photo forensics from lighting shadows and reflections [Internet]
Farid H. Photo forensics from lighting shadows and reflections [Internet]. 2023 [cited 2026 Mar 31]. Available from: https://contentauthenticity.org/blog/photo-forensics- from-lighting-shadows-and-reflections
work page 2023
-
[41]
Köbis NC, Doležalová B, Soraperra I. Fooled twice: People cannot detect deepfakes but think they can. iScience. 2021 Oct 29;24(11):103364. doi:10.1016/j.isci.2021.103364 PubMed PMID: 34820608; PubMed Central PMCID: PMC8602050
-
[42]
Using AI responsibly in scientific publishing. Nat Methods. 2026 Feb;23(2):271–271. doi:10.1038/s41592-026-03020-1
-
[43]
Troops, Trolls and Troublemakers: A Global Inventory of Organized Social Media Manipulation
Bradshaw S, Howard PN. Troops, Trolls and Troublemakers: A Global Inventory of Organized Social Media Manipulation. 2017
work page 2017
-
[44]
Ong JC, Cabanes JVA. Architects of Networked Disinformation: Behind the Scenes of Troll Accounts and Fake News Production in the Philippines [Monograph] [Internet]. Leeds; 2018 [cited 2026 Apr 1]. Available from: http://newtontechfordev.com/wp- content/uploads/2018/02/ARCHITECTS-OF-NETWORKED-DISINFORMATION-FULL- REPORT.pdf
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.