pith. sign in

arxiv: 2606.28510 · v1 · pith:OIYPBGPZnew · submitted 2026-06-26 · 💻 cs.HC · cs.AI· cs.CY

Generative AI Literacy Training Improves Intelligence Analysts' Discrimination of Real and AI-Generated Images

Pith reviewed 2026-06-30 01:05 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY
keywords generative AIimage discriminationtraining interventionintelligence analystsdeepfake detectionvisual misinformationhuman judgmentAI literacy
0
0 comments X

The pith

A 30-minute training raises intelligence analysts' accuracy distinguishing real from AI-generated images by 9 percentage points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a brief structured training improves professional intelligence analysts' ability to separate authentic photographs from those produced by generative AI systems. Thirty-two analysts completed pre- and post-training judgments on matched pairs of real and synthetic images that varied in pose and scene. Overall accuracy rose from a 72 percent baseline to 81 percent, with the increase driven by a 14.2 percentage point improvement in correctly labeling real images as real. The counterbalanced within-subject design supports attributing the change to the training rather than repeated exposure or image selection.

Core claim

A 30-minute expert-led training that presents visual patterns from seven real and fifty AI-generated images increases intelligence analysts' discrimination accuracy by 9 percentage points from a 72 percent baseline, with the effect driven by a 14.2 percentage point gain in correctly identifying real images.

What carries the argument

The 30-minute training intervention that highlights patterns in real and AI-generated images, measured through pre-post image judgments in a counterbalanced within-subject design with matched image pairs.

If this is right

  • Accuracy gains concentrate on real images rather than AI-generated ones.
  • Training effects vary with participants' prior levels of digital forensics and generative AI experience.
  • The intervention shows differential effectiveness across image categories defined by pose complexity and scene context.
  • Brief trainings supply causal evidence that organizations can use to address visual misinformation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training format could be adapted and tested with other groups that routinely evaluate visual evidence.
  • Retention of accuracy gains could be measured by re-testing the same analysts after days or weeks.
  • Pairing the training with automated detection tools might produce combined performance above either method alone.

Load-bearing premise

The matched image pairs and counterbalanced within-subject design isolate the effect of the training content from practice effects, fatigue, or image-specific features.

What would settle it

A replication study with new matched image pairs that finds no pre-to-post accuracy increase would indicate the observed gains were not caused by the training content.

Figures

Figures reproduced from arXiv: 2606.28510 by Candice Rockell Gerstner, Jessica Hullman, Matthew Groh, Negar Kamali.

Figure 1
Figure 1. Figure 1: illustrates our counterbalanced within-subject de￾sign. We split the stimulus set into two disjoint pools (Set A and Set B). Each participant evaluated 40 images from their assigned set in each phase. Assignment of images to Set A versus Set B was randomized, with the constraint that each half preserved the same scene-complexity com￾position as the full set (25% portrait, 25% full body, 25% posed group, 25… view at source ↗
Figure 2
Figure 2. Figure 2: Experiment interface for image evaluation. Partic [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ranked stimulus- and pair-level training effects. (A) Real images, (B) AI-generated images, and (C) real–fake image [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Participant-level accuracy and training effects by expertise and tool use. Each point represents one participant. Pink points indicate group means, with error bars showing 95% bootstrapped confidence intervals within each group. Panels (a–b) show baseline (pre-training) accuracy grouped by self-reported digital forensics expertise and generative AI tool use. Panels (c–d) show participant-level training eff… view at source ↗
Figure 5
Figure 5. Figure 5: Artifact themes mentioned in analyst free-text [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Across social and online platforms, people are increasingly exposed to AI-generated images. As a consequence, the task of distinguishing AI-generated from authentic images is becoming a central challenge for information ecosystems. While humans perform better than chance, accuracy falls short of many operational needs. Initial evidence shows that visually oriented training can improve deepfake detection but does not improve participants' ability to identify real images as real. Here, we investigate the efficacy of a brief training intervention for intelligence analysts employed by the United States government in 2024. We conducted a counterbalanced within-subject randomized experiment in which we showed participants real and AI-generated images varying in pose complexity and scene context and asked them whether each image was real or AI-generated, both before and after an expert delivered a 30-minute training that pointed out patterns in seven real and 50 AI-generated images. We collected 2,544 image-level judgments from 32 intelligence analysts. We find training increased overall accuracy by 9 percentage points (95% CI: [2.7, 15.4]) from a baseline of 72%. We find the improvement is driven by a 14.2 percentage point increase in accuracy for real images (95% CI: [0.7, 27.7]). Through a careful experimental setup that curated matched pairs of real and AI-generated images across pose complexity categories, we reveal how these trainings influence people with different levels of digital forensics and generative AI experience and identify the kind of image-based content where this training intervention appears to be most effective. Ultimately, these results provide causal evidence that a brief, structured training can improve human judgment across a diverse array of real and AI-generated images, informing organizational responses to AI-generated visual misinformation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper reports a counterbalanced within-subject experiment with 32 U.S. intelligence analysts who made 2,544 real/AI image judgments before and after a 30-minute expert training on visual patterns. It claims the training produced a 9 pp overall accuracy gain (95% CI [2.7, 15.4]) from a 72% baseline, driven by a 14.2 pp gain on real images (95% CI [0.7, 27.7]), and presents this as causal evidence that brief structured training improves discrimination across image types.

Significance. If the accuracy gains are attributable to the specific training content rather than repeated exposure, the result has applied value for government and organizational training programs addressing AI-generated visual misinformation. The use of domain-expert participants and matched real/AI image pairs across pose and context categories strengthens ecological relevance. The large number of judgments (2,544) and reporting of confidence intervals are positive features of the empirical design.

major comments (1)
  1. [Abstract] Abstract and experimental setup: The central causal claim that the 30-minute training produced the observed accuracy increases rests on the assumption that the pre-post within-subject design isolates the intervention from practice effects, task familiarization, or implicit learning across the two sessions. No parallel control arm (e.g., filler activity of comparable duration) is described, so general improvements from repeated testing cannot be separated from training-specific effects. This directly affects interpretation of both the 9 pp overall and 14.2 pp real-image gains.
minor comments (1)
  1. [Abstract] Abstract: The 95% CI for the real-image accuracy change ([0.7, 27.7]) is wide and includes values near zero; this should be discussed when interpreting the practical magnitude of the effect on real images.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for identifying a key limitation in our experimental design. We address this point directly below and agree that revisions are warranted to ensure accurate interpretation of our findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental setup: The central causal claim that the 30-minute training produced the observed accuracy increases rests on the assumption that the pre-post within-subject design isolates the intervention from practice effects, task familiarization, or implicit learning across the two sessions. No parallel control arm (e.g., filler activity of comparable duration) is described, so general improvements from repeated testing cannot be separated from training-specific effects. This directly affects interpretation of both the 9 pp overall and 14.2 pp real-image gains.

    Authors: We agree that this is a substantive limitation of the current design. The study used a pre-post within-subjects approach with counterbalancing of image presentation order and categories, but did not include a parallel control condition (such as a filler task of equivalent duration). Consequently, we cannot fully isolate training-specific effects from general practice or familiarization effects. While the differential improvement on real-image identification (rather than uniform gains) is consistent with the training content, this does not rule out alternative explanations. We will revise the abstract, results, and discussion sections to qualify the causal language—replacing 'causal evidence' with phrasing that describes the observed association between the intervention and performance changes while explicitly noting the design limitation. We will also expand the limitations paragraph to recommend control-arm studies in future work. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement of pre-post accuracy changes

full rationale

The paper reports measured accuracy improvements from a controlled within-subject experiment with 32 participants providing 2,544 judgments before and after a 30-minute training intervention. No equations, parameter fits, or derivations are present that reduce the reported 9 pp overall or 14.2 pp real-image gains to inputs by construction. The design description (counterbalanced matched pairs) and causal attribution rest on standard experimental logic rather than self-referential definitions or self-citation chains. This is a straightforward empirical report with no load-bearing steps that collapse to the paper's own fitted quantities or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical human-subjects experiment whose central claim rests on collected judgment data and the experimental protocol rather than new mathematical derivations or postulated entities.

axioms (1)
  • standard math Standard assumptions underlying calculation of 95% confidence intervals for paired accuracy differences (approximate normality or bootstrap validity).
    The reported CIs for the 9 pp and 14.2 pp improvements rely on these conventional statistical assumptions.

pith-pipeline@v0.9.1-grok · 5863 in / 1415 out tokens · 60968 ms · 2026-06-30T01:05:34.799858+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    ArXiv:2409.12320

    The Effect of Education in Prompt Engineering: Evidence from Journalists.Proceedings of the International AAAI Conference on Web and Social Media (ICWSM). ArXiv:2409.12320. Bateman, J

  2. [2]

    Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

    Deepfake-eval-2024: A multi-modal in- the-wild benchmark of deepfakes circulated in 2024.arXiv preprint arXiv:2503.02857. Chen, E.; Seo, H.; Ruffin, M.; Lee, D.; Wang, G.; and Xiong, A

  3. [3]

    arXiv:2304.06408

    Intriguing properties of synthetic images: from generative adversarial networks to diffusion models. arXiv:2304.06408. Cozzolino, D.; Poggi, G.; Corvi, R.; Nießner, M.; and Ver- doliva, L

  4. [4]

    arXiv:2312.00195

    Raising the Bar of AI-generated Image De- tection with CLIP. arXiv:2312.00195. Diel, A.; et al

  5. [5]

    InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, CHI ’26

    De- signing Effective Digital Literacy Interventions for Boost- ing Deepfake Discernment. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems, CHI ’26. New York, NY , USA: Association for Computing Ma- chinery. ISBN 9798400722783. Gomila, R

  6. [6]

    arXiv:2406.08651

    How to Distinguish AI-Generated Images from Authentic Photographs. arXiv:2406.08651. Kamali, N.; Nakamura, K.; Kumar, A.; Chatzimparmpas, A.; Hullman, J.; and Groh, M

  7. [7]

    In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25

    Characterizing Photoreal- ism and Artifacts in Diffusion Model-Generated Images. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25. Kim, S. J.; Lu, Y .; and Peng, Y

  8. [8]

    InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems

    User Experience Design Professionals’ Perceptions of Gen- erative Artificial Intelligence. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM. ArXiv:2309.15237. Lin, L.; Gupta, N.; Zhang, Y .; Ren, H.; Liu, C.-H.; Ding, F.; Wang, X.; Li, X.; Verdoliva, L.; and Hu, S

  9. [9]

    arXiv:2402.00045

    Detect- ing Multimedia Generated by Large AI Models: A Survey. arXiv:2402.00045. Lintner, T

  10. [10]

    InProceedings of the 2020 CHI Conference on Human Factors in Computing Sys- tems, 1–16

    What is AI Literacy? Com- petencies and Design Considerations. InProceedings of the 2020 CHI Conference on Human Factors in Computing Sys- tems, 1–16. New York, NY , USA: Association for Comput- ing Machinery. Luccioni, S.; Akiki, C.; Mitchell, M.; and Jernite, Y

  11. [11]

    arXiv:2302.10174

    Towards Universal Fake Image Detectors that Generalize Across Generative Models. arXiv:2302.10174. Pennycook, G.; and Rand, D. G

  12. [12]

    arXiv:2210.14571

    Towards the Detection of Diffusion Model Deepfakes. arXiv:2210.14571. Roca, T.; Roman, A. C.; Vega, J. T.; Duarte, M.; Wang, P.; White, K.; Misra, A.; and Ferres, J. L

  13. [13]

    real or not?

    How good are humans at detecting AI-generated images? Learnings from an experiment.arXiv preprint arXiv:2507.18640. Ruffin, M.; Seo, H.; Xiong, A.; and Wang, G

  14. [14]

    Simonsohn, U.; Montealegre, A.; and Evangelidis, I

    ArXiv:2406.08271. Simonsohn, U.; Montealegre, A.; and Evangelidis, I

  15. [15]

    Can You Tell What’s Real Now? Accessed: 2024-08-27

    AI is Getting Better Fast. Can You Tell What’s Real Now? Accessed: 2024-08-27. Thompson, S