How do people watch AI-generated videos of physical scenes?

Ayush Tewari; Danqing Shi; Katherine M. Collins; Lan Jiang; Miri Zilka; Shangzhe Wu

arxiv: 2602.03374 · v2 · submitted 2026-02-03 · 💻 cs.HC

How do people watch AI-generated videos of physical scenes?

Danqing Shi , Lan Jiang , Katherine M. Collins , Shangzhe Wu , Ayush Tewari , Miri Zilka This is my paper

Pith reviewed 2026-05-16 07:50 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI-generated videoseye trackinggaze behaviorauthenticity perceptionvideo understandingAI detectionhuman visual attentionphysical scenes

0 comments

The pith

Gaze when watching videos of physical scenes tracks perceived authenticity more than actual real or AI origin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tracks the eye movements of 40 participants while they watch a mix of real and AI-generated videos of physical scenes, first to understand the content and then to detect whether each clip is synthetic. It shows that with today's high realism, where low-level visual flaws are scarce, attention patterns follow what the viewer believes about the video's source rather than the source itself. If this holds, simply knowing a video might be AI-generated changes consumption from passive watching into deliberate hunting for anomalies. The work therefore links human visual attention directly to the growing problem of synthetic media eroding trust.

Core claim

The central claim is that participants' gaze behavior in both the video-understanding and AI-detection tasks was driven primarily by their inferred perception of authenticity rather than by whether the video was actually real or AI-generated. Because the selected AI videos were realistic enough to avoid obvious low-level artifacts, cognitive expectations about possible generation overrode objective differences in directing fixations and scan paths.

What carries the argument

Eye-tracking recordings of fixation locations and scan paths compared across real and AI-generated videos in two tasks, with perception of authenticity inferred from task performance accuracy.

If this is right

Viewers actively search for anomalies once they suspect a video may be AI-generated.
Gaze patterns could serve as an implicit behavioral signal for perceived authenticity in future detection systems.
Video-generation models must remove not only low-level artifacts but also higher-level perceptual cues that trigger suspicion.
Media platforms that label possible AI content may increase user scrutiny and reduce passive acceptance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platform warnings that a video might be synthetic could permanently heighten baseline vigilance even in later unlabeled viewing.
The same perception-driven attention shift might appear in still images or audio, suggesting a general mechanism across synthetic media.
Training people on AI detection could alter their default gaze behavior on all videos, not only during explicit detection tasks.

Load-bearing premise

Participants' beliefs about whether a video is AI-generated can be accurately inferred from how well they perform the understanding and detection tasks.

What would settle it

Eye-tracking data showing that gaze patterns differ systematically between actually real and actually AI-generated videos even when task performance indicates the viewer perceives both categories the same way.

read the original abstract

The growing prevalence of realistic AI-generated videos on media platforms increasingly blurs the line between fact and fiction, eroding public trust. Understanding how people watch AI-generated videos offers a human-centered perspective for improving AI detection and guiding advancements in video generation. However, existing studies have not investigated human gaze behavior in response to AI-generated videos of physical scenes. Here, we collect and analyze the eye movements from 40 participants during video understanding and AI detection tasks involving a mix of real-world and AI-generated videos. We find that given the high realism of AI-generated videos, gaze behavior is driven less by the video's actual authenticity and more by the viewer's perception of its authenticity. Our results demonstrate that the mere awareness of potential AI generation may alter media consumption from passive viewing into an active search for anomalies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Eye-tracking shows gaze on realistic AI videos tracks perceived authenticity more than actual source, but thin methods leave the result hard to trust.

read the letter

The central finding is that participants' gaze patterns during video viewing aligned more with whether they judged the clip as AI-generated than with its actual status. They collected eye movements from 40 people across understanding and detection tasks on mixed real and synthetic physical-scene videos, and the data pointed to awareness of possible AI generation shifting viewing toward active anomaly search rather than passive watching. That application of gaze tracking to this specific stimulus set is the clearest new piece. Prior work has looked at detection accuracy or general media trust, but not at how attention moves when the content is this realistic. The two-task design is a reasonable way to separate comprehension from explicit scrutiny. The soft spots sit in the execution details. No participant demographics, video selection criteria, statistical tests, or error bars appear in the description, so the size and reliability of the effect stay unclear. More importantly, the design ties perception measurement to the detection task itself; without a passive-viewing baseline or separate post-viewing realism ratings, any gaze difference could stem from the instruction to hunt for fakes rather than from spontaneous perception. Low-level generation artifacts could produce similar shifts without the perception mechanism doing the work. This is for HCI and media-psychology readers who want empirical pointers on attention to synthetic video. Someone building detection systems or studying trust erosion could pull ideas from the setup, but they would need the full methods and controls before treating the result as solid. Send it to peer review. The question is timely and the experiment is straightforward to run; with added stats, baselines, and clearer separation of task effects it could be worth the referees' time.

Referee Report

3 major / 1 minor

Summary. The paper reports an eye-tracking experiment with 40 participants who viewed a mix of real-world and AI-generated videos of physical scenes while performing video-understanding and AI-detection tasks. The central claim is that, given the high realism of current AI videos, gaze patterns are driven primarily by viewers' perception of authenticity rather than by actual authenticity, and that mere awareness of possible AI generation shifts viewing from passive consumption to active anomaly search.

Significance. If the result holds after methodological clarification, the work supplies timely empirical evidence on human attention to realistic synthetic media, with direct relevance to media literacy, AI detection interfaces, and video-generation research. The use of objective gaze data rather than self-report alone is a positive feature.

major comments (3)

[Abstract] Abstract: the claim that gaze follows perceived rather than actual authenticity requires showing that gaze patterns align with subjective judgments even on trials where those judgments are incorrect; no such analysis or independent validation of perception (separate from the binary detection response) is described.
[Methods] Experimental design: the two-task protocol (understanding vs. detection) lacks a passive-viewing baseline without detection instructions or post-viewing realism ratings, so any gaze shift could be produced by the anomaly-search instruction itself rather than by spontaneous perception of authenticity.
[Results] Results: without reported statistical details, participant demographics, video-selection criteria, error bars, or the precise analysis linking gaze metrics to perceived vs. actual authenticity, it is impossible to assess whether the data support the central claim.

minor comments (1)

[Abstract] Abstract should include a brief statement of key statistical outcomes and sample characteristics to allow readers to evaluate the strength of the reported effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address each major comment below and will make revisions to clarify and strengthen the presentation of our findings.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that gaze follows perceived rather than actual authenticity requires showing that gaze patterns align with subjective judgments even on trials where those judgments are incorrect; no such analysis or independent validation of perception (separate from the binary detection response) is described.

Authors: We agree that explicitly demonstrating the alignment on incorrect trials would bolster the claim. Our results section does compare gaze behavior between accurate and inaccurate AI detections, showing differences consistent with perception-driven viewing. To address this directly, we will add a new analysis in the revised manuscript that examines gaze metrics specifically on trials where participants' judgments were incorrect, correlating them with their perceived authenticity. revision: yes
Referee: [Methods] Experimental design: the two-task protocol (understanding vs. detection) lacks a passive-viewing baseline without detection instructions or post-viewing realism ratings, so any gaze shift could be produced by the anomaly-search instruction itself rather than by spontaneous perception of authenticity.

Authors: The video-understanding task was designed to serve as a baseline for natural viewing without explicit AI-detection instructions. However, we recognize the value of a purely passive condition and post-viewing realism ratings. We will revise the methods to include post-viewing realism ratings collected after each video and discuss the understanding task as approximating passive viewing. We will also add a limitations section addressing the absence of a no-task baseline. revision: partial
Referee: [Results] Results: without reported statistical details, participant demographics, video-selection criteria, error bars, or the precise analysis linking gaze metrics to perceived vs. actual authenticity, it is impossible to assess whether the data support the central claim.

Authors: We apologize for the omission of these details in the initial submission. The revised manuscript will include: full statistical tests with p-values and effect sizes, participant demographics table, detailed video selection criteria (including how AI videos were generated and matched to real ones), error bars on all figures, and an expanded methods/results section describing the exact gaze metrics (e.g., fixation duration, saccade patterns) and how they were linked to perceived authenticity via regression or correlation analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical behavioral study

full rationale

The paper collects and analyzes eye-tracking data from 40 participants performing video understanding and AI detection tasks on real and AI-generated videos. No equations, parameters, derivations, or theoretical models are present that could reduce to self-definition or fitted inputs. Claims about gaze being driven by perceived authenticity rest on direct measurements of task performance and gaze patterns, not on any chain that equates outputs to inputs by construction. Self-citations, if present, are not load-bearing for any derivation since none exists.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical user study relying on standard behavioral data collection and analysis practices rather than new theoretical constructs.

axioms (1)

standard math Standard assumptions of eye-tracking data analysis and statistical inference hold for the collected gaze data.
Implicit in any eye-movement study; no custom axioms introduced.

pith-pipeline@v0.9.0 · 5446 in / 1145 out tokens · 26281 ms · 2026-05-16T07:50:47.222837+00:00 · methodology

How do people watch AI-generated videos of physical scenes?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)