When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

Hyesong Choi; James Matthew Rehg; Jihyeon Kim; Sohee Kim; Soosan Lee; Souhwan Jung

arxiv: 2605.27348 · v2 · pith:6PPW2XG4new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

Jihyeon Kim , Sohee Kim , Soosan Lee , Souhwan Jung , James Matthew Rehg , Hyesong Choi This is my paper

Pith reviewed 2026-06-29 17:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords AI-generated image detectionsocial gaze consistencysemantic cuesgenerative modelsvision-language modelsdiagnostic datasetsimage forensicsgaze direction

0 comments

The pith

Social gaze consistency between interacting people serves as a cue to detect AI-generated images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that recent generators have largely eliminated low-level artifacts like pixel fingerprints and frequency anomalies, especially in small person-centric edits surrounded by real content. It defines Social Gaze Consistency as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between people in an image, positioning this as an orthogonal high-level semantic detection axis. The work creates a controlled diagnostic dataset using region-specific gaze perturbations and strict pair-level grouping, paired with Block-Compositional Caption Supervision that keeps reasoning structure fixed across varied captions. This training raises balanced accuracy on interaction images for both vision-language and vision-only backbones while lifting recalls for real and fake classes alike. Gains transfer from one inpainter to multi-generator test sets due to shared diffusion-family weaknesses in periocular structure.

Core claim

Social Gaze Consistency constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. A controlled diagnostic dataset with region-specific perturbations and pair-level grouping, combined with Block-Compositional Caption Supervision holding a fixed 5-block reasoning skeleton across 1,250 captions, enables training that improves a vision-language backbone by 3.7 points on the COCOAI Interaction subset and 1.3 points on the Person subset, with parallel gains on a vision-only backbone. The same supervision produces simultaneous rises in real-class and fake-class recall and generalizes from one inpainter to multi-generator suites via paired-edit shortcut blocki

What carries the argument

Social Gaze Consistency, defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals.

If this is right

Detection accuracy rises simultaneously on real and fake classes rather than through a one-sided bias.
The same supervision improves both vision-language and vision-only backbones, showing backbone-agnostic utility.
Training on outputs from a single inpainter transfers to multi-generator test suites.
Hard-to-easy difficulty transfer occurs when the fixed reasoning skeleton is applied across surface-diverse captions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other high-level consistencies such as pose or expression coherence could be tested as additional orthogonal cues.
The approach could extend to video sequences where gaze dynamics provide temporal signals.
Combining the semantic cue with existing low-level detectors might yield hybrid systems robust to both artifact types.

Load-bearing premise

The controlled diagnostic dataset with region-specific perturbations and strict pair-level grouping prevents the model from learning generator-specific fingerprints rather than the intended gaze cue.

What would settle it

Train the supervised model on the gaze-perturbed dataset then test on new generators that render consistent periocular structure; if accuracy gains disappear while low-level cues remain blocked, the semantic-cue claim is falsified.

Figures

Figures reproduced from arXiv: 2605.27348 by Hyesong Choi, James Matthew Rehg, Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung.

**Figure 2.** Figure 2: Block-Compositional Caption schema. Per-model effective sample counts. LMMs occasionally emit outputs that the regex parser (^This is a (real|fake) image\.) cannot match; such samples are excluded rather than coerced. Parsing-failure rates remain < 0.06% on every benchmark. Vision-only detectors emit scalar scores Model Custom Gaze COCOAI Person COCOAI Inter FakeVLM origin 4,681 (−3) 15,720 (0) 198 (0) Our… view at source ↗

read the original abstract

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Social gaze consistency is a reasonable new framing for detection but the experiments do not isolate it from generator-specific eye traces.

read the letter

The main things to know are that the paper tries to treat mutual gaze coherence as an orthogonal semantic signal for spotting AI images in multi-person scenes, and that the controls they built do not fully close off the possibility the model is still using low-level FLUX artifacts around the eyes.

What the work does is define the cue explicitly around gaze direction, head-eye alignment, and pupil placement between people. They create a diagnostic set by applying region-specific edits with one inpainter on real gaze-consistent photos and use strict pair-level grouping to block simple generator memorization. The block-compositional caption approach holds the reasoning skeleton fixed across many surface variations, which is a clean way to push the model toward structure rather than text patterns. They show modest balanced-accuracy gains on both a vision-language backbone and a vision-only one, with real and fake recall both improving.

The soft spots are the missing pieces that would confirm the cue is actually being used. There are no error bars or significance tests on the reported lifts. No ablation keeps the caption setup but drops the gaze perturbation to show the effect comes from the intended signal. The four-step account offered after the results includes an appeal to shared diffusion weaknesses in periocular structure, which undercuts the orthogonality claim. The stress-test point about possible FLUX-specific pupil or alignment traces is not ruled out by pair grouping alone, since all perturbations come from the same model.

This is for people building detectors that need cues beyond pixel fingerprints. A reader focused on high-level semantic signals will find the framing and caption method worth looking at. It does not yet have the controls to stand as strong evidence.

I would send it to peer review. The question is worth asking and the current version is structured enough for referees to give targeted feedback on the isolation issue.

Referee Report

3 major / 1 minor

Summary. The paper claims that Social Gaze Consistency—defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals—constitutes an orthogonal high-level semantic cue for detecting AI-generated images. It supports this via a controlled diagnostic dataset using region-specific perturbations of gaze-consistent imagery with strict pair-level grouping on a single inpainter (FLUX.1-Fill), Block-Compositional Caption Supervision holding a fixed 5-block reasoning skeleton, and cross-architecture experiments showing balanced-accuracy gains of +3.7 pp (67.8→71.5) on the COCOAI Interaction subset and +1.3 pp (83.0→84.3) on the Person subset for FakeVLM, with consistent gains on Effort; a four-step mechanistic account is offered to explain transfer to multi-generator suites.

Significance. If the central claim holds, the work supplies a backbone-agnostic semantic detection axis that remains useful in person-centric and partial-edit regimes where low-level fingerprints are diminished. The planned code release and simultaneous rise in real- and fake-class recalls are strengths that would aid reproducibility and rule out trivial artifacts.

major comments (3)

[Abstract (controlled diagnostic dataset)] Abstract (controlled diagnostic dataset): the orthogonality claim rests on the assertion that strict pair-level grouping 'forecloses generator-fingerprint memorization'; however, the construction uses only FLUX.1-Fill and the mechanistic account itself notes 'diffusion-family shared spectral weakness in periocular structure,' leaving open the possibility that the detector exploits FLUX-specific pupil-placement or eye-alignment traces correlated with the perturbation rather than the intended semantic cue. An ablation or verification that isolates the gaze cue from these traces is required.
[Abstract (results)] Abstract (results): the reported lifts (+3.7 pp and +1.3 pp) are given without error bars, statistical significance tests, or an ablation that holds caption structure fixed while varying only the gaze-consistency label; without these, it is unclear whether the gains are attributable to the proposed cue or to other factors in the supervision.
[Abstract (four-step mechanistic account)] Abstract (four-step mechanistic account): the account (paired-edit shortcut blocking, hard-to-easy transfer, CLIP prior preservation, diffusion-family spectral weakness) is presented without quantitative tests or controls for each step, rendering it post-hoc and insufficient to substantiate why training on one inpainter transfers.

minor comments (1)

The abstract states that code will be released upon acceptance; the release should also include the exact region-specific perturbation protocol and pair-grouping code to allow independent verification of the isolation claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications and indicating where we will revise the manuscript to strengthen the claims.

read point-by-point responses

Referee: [Abstract (controlled diagnostic dataset)] Abstract (controlled diagnostic dataset): the orthogonality claim rests on the assertion that strict pair-level grouping 'forecloses generator-fingerprint memorization'; however, the construction uses only FLUX.1-Fill and the mechanistic account itself notes 'diffusion-family shared spectral weakness in periocular structure,' leaving open the possibility that the detector exploits FLUX-specific pupil-placement or eye-alignment traces correlated with the perturbation rather than the intended semantic cue. An ablation or verification that isolates the gaze cue from these traces is required.

Authors: We agree that the single-inpainter construction leaves room for potential generator-specific correlations, even with pair-level grouping. While the grouping ensures that gaze-consistent and gaze-inconsistent pairs share the same generator (preventing direct fingerprint memorization), we will add a targeted ablation in the revised manuscript. This will compare models trained on gaze-perturbed pairs against those trained on non-gaze perturbations (e.g., background or lighting changes) under identical conditions to isolate the contribution of the social gaze cue. revision: yes
Referee: [Abstract (results)] Abstract (results): the reported lifts (+3.7 pp and +1.3 pp) are given without error bars, statistical significance tests, or an ablation that holds caption structure fixed while varying only the gaze-consistency label; without these, it is unclear whether the gains are attributable to the proposed cue or to other factors in the supervision.

Authors: Error bars and significance tests appear in the full experimental results (Section 4), but we acknowledge their absence from the abstract. We will revise the abstract to report these details. We will also add an explicit ablation that holds the 5-block caption skeleton fixed while varying only the gaze-consistency label, directly addressing whether the gains derive from the proposed cue. revision: yes
Referee: [Abstract (four-step mechanistic account)] Abstract (four-step mechanistic account): the account (paired-edit shortcut blocking, hard-to-easy transfer, CLIP prior preservation, diffusion-family spectral weakness) is presented without quantitative tests or controls for each step, rendering it post-hoc and insufficient to substantiate why training on one inpainter transfers.

Authors: The four-step account is offered as an explanatory synthesis of the observed transfer across generators and architectures rather than a fully quantified causal model. We will expand the relevant section with additional quantitative controls, including metrics for difficulty transfer and CLIP prior preservation, to provide stronger substantiation for the transfer mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity; cue and validation remain independent of inputs

full rationale

The paper defines Social Gaze Consistency explicitly as mutual coherence of gaze direction, head-eye alignment, and pupil placement between individuals. It constructs a diagnostic dataset via region-specific perturbations on gaze-consistent imagery plus strict pair-level grouping, then reports empirical gains on two backbones (FakeVLM +3.7 pp, Effort consistent) with simultaneous real/fake recall rise. No equations exist that equate the cue to any fitted parameter or training objective. No self-citations are invoked as load-bearing uniqueness theorems. The orthogonality claim rests on the dataset construction and cross-architecture transfer results rather than any definitional reduction or renaming of known patterns. This is the common case of an empirical contribution that does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate free parameters, axioms, or invented entities; no explicit modeling assumptions or fitted constants are stated.

pith-pipeline@v0.9.1-grok · 5885 in / 1070 out tokens · 27607 ms · 2026-06-29T17:57:36.969999+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

2023
[2]

Flux.1 fill [dev]

Black Forest Labs. Flux.1 fill [dev]. Technical report, Black Forest Labs, 2024

2024
[3]

K. H. Brodersen et al. The balanced accuracy and its posterior distribution. InInternational Conference on Pattern Recognition, 2010

2010
[4]

Cao et al

X. Cao et al. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. arXiv:2504.02244

work page arXiv 2025
[5]

Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg. Detecting attended visual targets in video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020
[6]

Dettmers et al

T. Dettmers et al. Qlora: Efficient finetuning of quantized llms. InAdvances in Neural Information Processing Systems, 2023

2023
[7]

Doosti et al

B. Doosti et al. Boosting image-based mutual gaze detection using pseudo 3d gaze. InAAAI Conference on Artificial Intelligence, 2021. arXiv:2010.07811

work page arXiv 2021
[8]

Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020
[9]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

2024
[10]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational Conference on Machine Learning, 2020

2020
[11]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[12]

H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu. Eyes tell all: Irregular pupil shapes reveal GAN- generated faces. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

2022
[13]

E. J. Hu et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Exposing GAN-generated faces using inconsistent corneal specular highlights

Shu Hu, Yuezun Li, and Siwei Lyu. Exposing GAN-generated faces using inconsistent corneal specular highlights. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2500–2504, 2021. 10

2021
[15]

Huang et al

Z. Huang et al. Sida: Social media image deepfake detection, localization and explanation with large mul- timodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[16]

Kellnhofer, A

P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

2019
[17]

Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

work page arXiv 2024
[18]

Li, M.-C

Y . Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. InIEEE International Workshop on Information Forensics and Security (WIFS), 2018

2018
[19]

Visual Instruction Tuning

H. Liu et al. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. Oral, arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Improved Baselines with Visual Instruction Tuning

H. Liu et al. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2310.03744

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[22]

RePaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[23]

M. J. Marín-Jiménez et al. Laeo-net: Revisiting people looking at each other in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

2019
[24]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning, 2022

2022
[25]

Towards universal fake image detectors that generalize across generative models, 2024

U. Ojha et al. Towards universal fake image detectors that generalize across generative models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. arXiv:2302.10174

work page arXiv 2023
[26]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari et al. Zero: Memory optimizations toward training trillion parameter models. InIn- ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020. arXiv:1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020
[28]

Where are they looking? In Advances in Neural Information Processing Systems, 2015

Adrià Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking? In Advances in Neural Information Processing Systems, 2015

2015
[29]

A Comprehensive Dataset for Human vs. AI Generated Image Detection

R. Roy et al. A comprehensive dataset for human vs. ai generated image detection.arXiv preprint arXiv:2601.00553, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Tan et al

C. Tan et al. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2312.10461

work page arXiv 2024
[31]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. CNN-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020
[32]

Wen et al

J. Wen et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation. InAdvances in Neural Information Processing Systems, 2025

2025
[33]

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Z. Yan et al. Orthogonal subspace decomposition for generalizable ai-generated image detection. In International Conference on Machine Learning, 2025. Oral, arXiv:2411.15633

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Yan et al

Z. Yan et al. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations, 2025. arXiv:2406.19435. 11

work page arXiv 2025
[35]

DeepfakeBench: A comprehensive benchmark of deepfake detection

Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. DeepfakeBench: A comprehensive benchmark of deepfake detection. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023
[36]

Ye et al

J. Ye et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. In International Conference on Learning Representations, 2025. Spotlight, arXiv:2410.09732

work page arXiv 2025
[37]

black-forest-labs/FLUX.1-Fill-dev

X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. InProceedings of the European Conference on Computer Vision, 2020. 12 A Datasets: Custom Gaze and COCOAI This appendix provides datasheets for both datasets used in the paper: the proposedCust...

2020
[38]

a group of zebras

Pair-level grouping is essential for the shortcut-blocking mechanism: with identity-grouped splits, Table 9: Custom Gaze splits. Pair count equals real count equals fake count by construction (each pair contributes exactly one real and one fake image); Total=Real+Fake= 2×Pairs. Split Pairs Real Fake Total Train (80%)18,732 18,732 18,732 37,464 Val (10%)2,...

[1] [1]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

2023

[2] [2]

Flux.1 fill [dev]

Black Forest Labs. Flux.1 fill [dev]. Technical report, Black Forest Labs, 2024

2024

[3] [3]

K. H. Brodersen et al. The balanced accuracy and its posterior distribution. InInternational Conference on Pattern Recognition, 2010

2010

[4] [4]

Cao et al

X. Cao et al. Socialgesture: Delving into multi-person gesture understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. arXiv:2504.02244

work page arXiv 2025

[5] [5]

Eunji Chong, Yongxin Wang, Nataniel Ruiz, and James M. Rehg. Detecting attended visual targets in video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020

[6] [6]

Dettmers et al

T. Dettmers et al. Qlora: Efficient finetuning of quantized llms. InAdvances in Neural Information Processing Systems, 2023

2023

[7] [7]

Doosti et al

B. Doosti et al. Boosting image-based mutual gaze detection using pseudo 3d gaze. InAAAI Conference on Artificial Intelligence, 2021. arXiv:2010.07811

work page arXiv 2021

[8] [8]

Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020

[9] [9]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

2024

[10] [10]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InInternational Conference on Machine Learning, 2020

2020

[11] [11]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[12] [12]

H. Guo, S. Hu, X. Wang, M.-C. Chang, and S. Lyu. Eyes tell all: Irregular pupil shapes reveal GAN- generated faces. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

2022

[13] [13]

E. J. Hu et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. arXiv:2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Exposing GAN-generated faces using inconsistent corneal specular highlights

Shu Hu, Yuezun Li, and Siwei Lyu. Exposing GAN-generated faces using inconsistent corneal specular highlights. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2500–2504, 2021. 10

2021

[15] [15]

Huang et al

Z. Huang et al. Sida: Social media image deepfake detection, localization and explanation with large mul- timodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

[16] [16]

Kellnhofer, A

P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba. Gaze360: Physically unconstrained gaze estimation in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2019

2019

[17] [17]

Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspectives.arXiv preprint arXiv:2408.06741, 2024

work page arXiv 2024

[18] [18]

Li, M.-C

Y . Li, M.-C. Chang, and S. Lyu. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. InIEEE International Workshop on Information Forensics and Security (WIFS), 2018

2018

[19] [19]

Visual Instruction Tuning

H. Liu et al. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023. Oral, arXiv:2304.08485

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Improved Baselines with Visual Instruction Tuning

H. Liu et al. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2310.03744

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019

[22] [22]

RePaint: Inpainting using denoising diffusion probabilistic models

Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[23] [23]

M. J. Marín-Jiménez et al. Laeo-net: Revisiting people looking at each other in videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

2019

[24] [24]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. InInternational Conference on Machine Learning, 2022

2022

[25] [25]

Towards universal fake image detectors that generalize across generative models, 2024

U. Ojha et al. Towards universal fake image detectors that generalize across generative models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. arXiv:2302.10174

work page arXiv 2023

[26] [26]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

S. Rajbhandari et al. Zero: Memory optimizations toward training trillion parameter models. InIn- ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2020. arXiv:1910.02054

work page internal anchor Pith review Pith/arXiv arXiv 2020

[28] [28]

Where are they looking? In Advances in Neural Information Processing Systems, 2015

Adrià Recasens, Aditya Khosla, Carl V ondrick, and Antonio Torralba. Where are they looking? In Advances in Neural Information Processing Systems, 2015

2015

[29] [29]

A Comprehensive Dataset for Human vs. AI Generated Image Detection

R. Roy et al. A comprehensive dataset for human vs. ai generated image detection.arXiv preprint arXiv:2601.00553, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Tan et al

C. Tan et al. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. arXiv:2312.10461

work page arXiv 2024

[31] [31]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. CNN-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020

[32] [32]

Wen et al

J. Wen et al. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation. InAdvances in Neural Information Processing Systems, 2025

2025

[33] [33]

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Z. Yan et al. Orthogonal subspace decomposition for generalizable ai-generated image detection. In International Conference on Machine Learning, 2025. Oral, arXiv:2411.15633

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Yan et al

Z. Yan et al. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations, 2025. arXiv:2406.19435. 11

work page arXiv 2025

[35] [35]

DeepfakeBench: A comprehensive benchmark of deepfake detection

Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. DeepfakeBench: A comprehensive benchmark of deepfake detection. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023

2023

[36] [36]

Ye et al

J. Ye et al. Loki: A comprehensive synthetic data detection benchmark using large multimodal models. In International Conference on Learning Representations, 2025. Spotlight, arXiv:2410.09732

work page arXiv 2025

[37] [37]

black-forest-labs/FLUX.1-Fill-dev

X. Zhang, S. Park, T. Beeler, D. Bradley, S. Tang, and O. Hilliges. ETH-XGaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. InProceedings of the European Conference on Computer Vision, 2020. 12 A Datasets: Custom Gaze and COCOAI This appendix provides datasheets for both datasets used in the paper: the proposedCust...

2020

[38] [38]

a group of zebras

Pair-level grouping is essential for the shortcut-blocking mechanism: with identity-grouped splits, Table 9: Custom Gaze splits. Pair count equals real count equals fake count by construction (each pair contributes exactly one real and one fake image); Total=Real+Fake= 2×Pairs. Split Pairs Real Fake Total Train (80%)18,732 18,732 18,732 37,464 Val (10%)2,...