pith. sign in

arxiv: 2605.27348 · v2 · pith:6PPW2XG4new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

Pith reviewed 2026-06-29 17:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords AI-generated image detectionsocial gaze consistencysemantic cuesgenerative modelsvision-language modelsdiagnostic datasetsimage forensicsgaze direction
0
0 comments X

The pith

Social gaze consistency between interacting people serves as a cue to detect AI-generated images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that recent generators have largely eliminated low-level artifacts like pixel fingerprints and frequency anomalies, especially in small person-centric edits surrounded by real content. It defines Social Gaze Consistency as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between people in an image, positioning this as an orthogonal high-level semantic detection axis. The work creates a controlled diagnostic dataset using region-specific gaze perturbations and strict pair-level grouping, paired with Block-Compositional Caption Supervision that keeps reasoning structure fixed across varied captions. This training raises balanced accuracy on interaction images for both vision-language and vision-only backbones while lifting recalls for real and fake classes alike. Gains transfer from one inpainter to multi-generator test sets due to shared diffusion-family weaknesses in periocular structure.

Core claim

Social Gaze Consistency constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. A controlled diagnostic dataset with region-specific perturbations and pair-level grouping, combined with Block-Compositional Caption Supervision holding a fixed 5-block reasoning skeleton across 1,250 captions, enables training that improves a vision-language backbone by 3.7 points on the COCOAI Interaction subset and 1.3 points on the Person subset, with parallel gains on a vision-only backbone. The same supervision produces simultaneous rises in real-class and fake-class recall and generalizes from one inpainter to multi-generator suites via paired-edit shortcut blocki

What carries the argument

Social Gaze Consistency, defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals.

If this is right

  • Detection accuracy rises simultaneously on real and fake classes rather than through a one-sided bias.
  • The same supervision improves both vision-language and vision-only backbones, showing backbone-agnostic utility.
  • Training on outputs from a single inpainter transfers to multi-generator test suites.
  • Hard-to-easy difficulty transfer occurs when the fixed reasoning skeleton is applied across surface-diverse captions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other high-level consistencies such as pose or expression coherence could be tested as additional orthogonal cues.
  • The approach could extend to video sequences where gaze dynamics provide temporal signals.
  • Combining the semantic cue with existing low-level detectors might yield hybrid systems robust to both artifact types.

Load-bearing premise

The controlled diagnostic dataset with region-specific perturbations and strict pair-level grouping prevents the model from learning generator-specific fingerprints rather than the intended gaze cue.

What would settle it

Train the supervised model on the gaze-perturbed dataset then test on new generators that render consistent periocular structure; if accuracy gains disappear while low-level cues remain blocked, the semantic-cue claim is falsified.

Figures

Figures reproduced from arXiv: 2605.27348 by Hyesong Choi, James Matthew Rehg, Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung.

Figure 1
Figure 1. Figure 1: Custom Gaze construction and Block-Compositional Caption Supervision. Only the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Block-Compositional Caption schema. Per-model effective sample counts. LMMs occasionally emit outputs that the regex parser (^This is a (real|fake) image\.) cannot match; such samples are excluded rather than coerced. Parsing-failure rates remain < 0.06% on every benchmark. Vision-only detectors emit scalar scores Model Custom Gaze COCOAI Person COCOAI Inter FakeVLM origin 4,681 (−3) 15,720 (0) 198 (0) Our… view at source ↗
read the original abstract

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that Social Gaze Consistency—defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals—constitutes an orthogonal high-level semantic cue for detecting AI-generated images. It supports this via a controlled diagnostic dataset using region-specific perturbations of gaze-consistent imagery with strict pair-level grouping on a single inpainter (FLUX.1-Fill), Block-Compositional Caption Supervision holding a fixed 5-block reasoning skeleton, and cross-architecture experiments showing balanced-accuracy gains of +3.7 pp (67.8→71.5) on the COCOAI Interaction subset and +1.3 pp (83.0→84.3) on the Person subset for FakeVLM, with consistent gains on Effort; a four-step mechanistic account is offered to explain transfer to multi-generator suites.

Significance. If the central claim holds, the work supplies a backbone-agnostic semantic detection axis that remains useful in person-centric and partial-edit regimes where low-level fingerprints are diminished. The planned code release and simultaneous rise in real- and fake-class recalls are strengths that would aid reproducibility and rule out trivial artifacts.

major comments (3)
  1. [Abstract (controlled diagnostic dataset)] Abstract (controlled diagnostic dataset): the orthogonality claim rests on the assertion that strict pair-level grouping 'forecloses generator-fingerprint memorization'; however, the construction uses only FLUX.1-Fill and the mechanistic account itself notes 'diffusion-family shared spectral weakness in periocular structure,' leaving open the possibility that the detector exploits FLUX-specific pupil-placement or eye-alignment traces correlated with the perturbation rather than the intended semantic cue. An ablation or verification that isolates the gaze cue from these traces is required.
  2. [Abstract (results)] Abstract (results): the reported lifts (+3.7 pp and +1.3 pp) are given without error bars, statistical significance tests, or an ablation that holds caption structure fixed while varying only the gaze-consistency label; without these, it is unclear whether the gains are attributable to the proposed cue or to other factors in the supervision.
  3. [Abstract (four-step mechanistic account)] Abstract (four-step mechanistic account): the account (paired-edit shortcut blocking, hard-to-easy transfer, CLIP prior preservation, diffusion-family spectral weakness) is presented without quantitative tests or controls for each step, rendering it post-hoc and insufficient to substantiate why training on one inpainter transfers.
minor comments (1)
  1. The abstract states that code will be released upon acceptance; the release should also include the exact region-specific perturbation protocol and pair-grouping code to allow independent verification of the isolation claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and indicating where we will revise the manuscript to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract (controlled diagnostic dataset)] Abstract (controlled diagnostic dataset): the orthogonality claim rests on the assertion that strict pair-level grouping 'forecloses generator-fingerprint memorization'; however, the construction uses only FLUX.1-Fill and the mechanistic account itself notes 'diffusion-family shared spectral weakness in periocular structure,' leaving open the possibility that the detector exploits FLUX-specific pupil-placement or eye-alignment traces correlated with the perturbation rather than the intended semantic cue. An ablation or verification that isolates the gaze cue from these traces is required.

    Authors: We agree that the single-inpainter construction leaves room for potential generator-specific correlations, even with pair-level grouping. While the grouping ensures that gaze-consistent and gaze-inconsistent pairs share the same generator (preventing direct fingerprint memorization), we will add a targeted ablation in the revised manuscript. This will compare models trained on gaze-perturbed pairs against those trained on non-gaze perturbations (e.g., background or lighting changes) under identical conditions to isolate the contribution of the social gaze cue. revision: yes

  2. Referee: [Abstract (results)] Abstract (results): the reported lifts (+3.7 pp and +1.3 pp) are given without error bars, statistical significance tests, or an ablation that holds caption structure fixed while varying only the gaze-consistency label; without these, it is unclear whether the gains are attributable to the proposed cue or to other factors in the supervision.

    Authors: Error bars and significance tests appear in the full experimental results (Section 4), but we acknowledge their absence from the abstract. We will revise the abstract to report these details. We will also add an explicit ablation that holds the 5-block caption skeleton fixed while varying only the gaze-consistency label, directly addressing whether the gains derive from the proposed cue. revision: yes

  3. Referee: [Abstract (four-step mechanistic account)] Abstract (four-step mechanistic account): the account (paired-edit shortcut blocking, hard-to-easy transfer, CLIP prior preservation, diffusion-family spectral weakness) is presented without quantitative tests or controls for each step, rendering it post-hoc and insufficient to substantiate why training on one inpainter transfers.

    Authors: The four-step account is offered as an explanatory synthesis of the observed transfer across generators and architectures rather than a fully quantified causal model. We will expand the relevant section with additional quantitative controls, including metrics for difficulty transfer and CLIP prior preservation, to provide stronger substantiation for the transfer mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity; cue and validation remain independent of inputs

full rationale

The paper defines Social Gaze Consistency explicitly as mutual coherence of gaze direction, head-eye alignment, and pupil placement between individuals. It constructs a diagnostic dataset via region-specific perturbations on gaze-consistent imagery plus strict pair-level grouping, then reports empirical gains on two backbones (FakeVLM +3.7 pp, Effort consistent) with simultaneous real/fake recall rise. No equations exist that equate the cue to any fitted parameter or training objective. No self-citations are invoked as load-bearing uniqueness theorems. The orthogonality claim rests on the dataset construction and cross-architecture transfer results rather than any definitional reduction or renaming of known patterns. This is the common case of an empirical contribution that does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit modeling assumptions or fitted constants are stated.

pith-pipeline@v0.9.1-grok · 5885 in / 1070 out tokens · 27607 ms · 2026-06-29T17:57:36.969999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

  2. [2]

    Flux.1 fill [dev]

    Black Forest Labs. Flux.1 fill [dev]. Technical report, Black Forest Labs, 2024

  3. [3]

    K. H. Brodersen et al. The balanced accuracy and its posterior distribution. InInternational Conference on Pattern Recognition, 2010

  4. [4]

    Cao et al

    X. Cao et al. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. arXiv:2504.02244

  5. [5]

    Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg. Detecting attended visual targets in video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  6. [6]

    Dettmers et al

    T. Dettmers et al. Qlora: Efficient finetuning of quantized llms. InAdvances in Neural Information Processing Systems, 2023

  7. [7]

    Doosti et al

    B. Doosti et al. Boosting image-based mutual gaze detection using pseudo 3d gaze. InAAAI Conference on Artificial Intelligence, 2021. arXiv:2010.07811

  8. [8]

    Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions

    Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  9. [9]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

  10. [10]

    Leveraging frequency analysis for deep fake image recognition

    Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational Conference on Machine Learning, 2020

  11. [11]

    Gebru, J

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  12. [12]

    H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu. Eyes tell all: Irregular pupil shapes reveal GAN- generated faces. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  13. [13]

    E. J. Hu et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. arXiv:2106.09685

  14. [14]

    Exposing GAN-generated faces using inconsistent corneal specular highlights

    Shu Hu, Yuezun Li, and Siwei Lyu. Exposing GAN-generated faces using inconsistent corneal specular highlights. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2500–2504, 2021. 10

  15. [15]

    Huang et al

    Z. Huang et al. Sida: Social media image deepfake detection, localization and explanation with large mul- timodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  16. [16]

    Kellnhofer, A

    P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

  17. [17]

    Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

  18. [18]

    Li, M.-C

    Y . Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. InIEEE International Workshop on Information Forensics and Security (WIFS), 2018

  19. [19]

    Visual Instruction Tuning

    H. Liu et al. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. Oral, arXiv:2304.08485

  20. [20]

    Improved Baselines with Visual Instruction Tuning

    H. Liu et al. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2310.03744

  21. [21]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  22. [22]

    RePaint: Inpainting using denoising diffusion probabilistic models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  23. [23]

    M. J. Marín-Jiménez et al. Laeo-net: Revisiting people looking at each other in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  24. [24]

    GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning, 2022

  25. [25]

    Towards universal fake image detectors that generalize across generative models, 2024

    U. Ojha et al. Towards universal fake image detectors that generalize across generative models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. arXiv:2302.10174

  26. [26]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. arXiv:2103.00020

  27. [27]

    ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

    S. Rajbhandari et al. Zero: Memory optimizations toward training trillion parameter models. InIn- ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020. arXiv:1910.02054

  28. [28]

    Where are they looking? In Advances in Neural Information Processing Systems, 2015

    Adrià Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking? In Advances in Neural Information Processing Systems, 2015

  29. [29]

    A Comprehensive Dataset for Human vs. AI Generated Image Detection

    R. Roy et al. A comprehensive dataset for human vs. ai generated image detection.arXiv preprint arXiv:2601.00553, 2026

  30. [30]

    Tan et al

    C. Tan et al. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2312.10461

  31. [31]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. CNN-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  32. [32]

    Wen et al

    J. Wen et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation. InAdvances in Neural Information Processing Systems, 2025

  33. [33]

    Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

    Z. Yan et al. Orthogonal subspace decomposition for generalizable ai-generated image detection. In International Conference on Machine Learning, 2025. Oral, arXiv:2411.15633

  34. [34]

    Yan et al

    Z. Yan et al. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations, 2025. arXiv:2406.19435. 11

  35. [35]

    DeepfakeBench: A comprehensive benchmark of deepfake detection

    Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. DeepfakeBench: A comprehensive benchmark of deepfake detection. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

  36. [36]

    Ye et al

    J. Ye et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. In International Conference on Learning Representations, 2025. Spotlight, arXiv:2410.09732

  37. [37]

    black-forest-labs/FLUX.1-Fill-dev

    X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. InProceedings of the European Conference on Computer Vision, 2020. 12 A Datasets: Custom Gaze and COCOAI This appendix provides datasheets for both datasets used in the paper: the proposedCust...

  38. [38]

    a group of zebras

    Pair-level grouping is essential for the shortcut-blocking mechanism: with identity-grouped splits, Table 9: Custom Gaze splits. Pair count equals real count equals fake count by construction (each pair contributes exactly one real and one fake image); Total=Real+Fake= 2×Pairs. Split Pairs Real Fake Total Train (80%)18,732 18,732 18,732 37,464 Val (10%)2,...