TruEye: Fine-Grained Detection of AI-Generated Human Subjects in Images

Dan Lin; Jay Barot

arxiv: 2606.27505 · v1 · pith:OPLHNSUMnew · submitted 2026-06-25 · 💻 cs.CV

TruEye: Fine-Grained Detection of AI-Generated Human Subjects in Images

Jay Barot , Dan Lin This is my paper

Pith reviewed 2026-06-29 01:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI image detectionfine-grained classificationdual-stream transformerimage manipulationhuman subject detectioncompositional categoriessynthetic content

0 comments

The pith

TruEye distinguishes five compositional categories of AI-generated human content in images using a dual-stream transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TruEye for fine-grained detection of AI-manipulated humans and scenes in images. It moves beyond binary classification to identify five specific categories, including the difficult case of a real person inserted into a real background. The approach relies on separating human and scene processing streams with attention mechanisms to ensure coherence. This matters because it offers faster, more interpretable detection that generalizes better to new generators without depending on large language models.

Core claim

TruEye is the first to distinguish among five compositional categories of synthetic content, including the most challenging case in which a real human is composited into a real scene where they were never physically present. At its core is a mask-conditioned dual-stream transformer that separates human and scene tokens while preserving patch-level spatial correspondence, with region-gated cross attention and token-level supervision yielding robust predictions over 100 times faster than LLM-based methods.

What carries the argument

The mask-conditioned dual-stream transformer with region-gated cross attention and token-level supervision, which separates human and scene tokens and enforces semantic coherence between subject and background.

If this is right

Surpasses state-of-the-art detectors in accuracy on six datasets and the new FineSyn dataset.
Generalizes to unseen AI-generated or manipulated images.
Runs over 100 times faster than LLM-based competitors.
Provides interpretable predictions without invoking an LLM.
Localizes and classifies challenging real-human composite cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of streams could be adapted for detecting other manipulated objects beyond humans.
Token-level supervision might allow for pixel-precise localization in future extensions.
Speed improvements enable deployment on resource-limited devices for real-time checks.
Five-category classification could inform policies on image authenticity in social media.

Load-bearing premise

The mask-conditioned dual-stream transformer will reliably separate human and scene tokens and enforce semantic coherence across unseen generators and real-world compositions.

What would settle it

Evaluation on a new dataset of composited real humans in real scenes generated by a previously unseen method, where the model misclassifies the category at rates similar to or higher than binary detectors.

Figures

Figures reproduced from arXiv: 2606.27505 by Dan Lin, Jay Barot.

**Figure 1.** Figure 1: Samples from the FineSyn Dataset and TruEye detection results (labeled). TruEye’s key innovation is a mask-conditioned dual-stream transformer that decomposes each image into human tokens and scene tokens while preserving patch-level spatial correspondence. Each stream performs domain-specific reasoning, fortified by a magnification module that amplifies subtle artifacts, while a region-gated cross-attent… view at source ↗

**Figure 2.** Figure 2: An Overview of TruEye Architecture. (The depicted input image, predicted authenticity map, and patch numbering are included solely for demonstration purposes.) content. The mask generator is implemented as one transformer layer. The design of this component serves two main purposes. First, it decomposes the artificial artifact detection task into smaller subtasks, enabling the subsequent dual-stream cla… view at source ↗

**Figure 3.** Figure 3: Inference Speed of Models Detection Accuracy [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

AI generated images are proliferating across the Internet. While some are used for entertainment, others are weaponized for fraud and social engineering attacks on social media users. Existing detectors overfit to generators seen during training, treat detection as opaque binary classification, or rely on costly Large Language Models (LLMs) to explain their outputs. In this paper, we present TruEye, a novel model for fine grained detection and localization of AI manipulated or AI generated humans and scenes. Unlike conventional detectors that assign a single authenticity label, TruEye is the first to distinguish among five compositional categories of synthetic content, including the most challenging case in which a real human is composited into a real scene where they were never physically present. At its core is a mask conditioned dual stream transformer that separates human and scene tokens while preserving patch level spatial correspondence. Specialized reasoning within each stream and region gated cross attention enforce semantic coherence between subject and background, while token level supervision and global compositional classification yield robust, interpretable predictions without invoking an LLM. By restricting intra stream attention to semantically coherent tokens, TruEye also runs over $100\times$ faster than LLM based competitors. Experiments on 6 datasets and our newly curated FineSyn dataset, show that TruEye surpasses state of the art detectors with higher accuracy, faster inference, and stronger generalization to unseen AI generated or manipulated images.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TruEye claims a five-category taxonomy and fast dual-stream detector for AI-generated humans but supplies no numbers and the FineSyn labels for the hardest category look unverifiable.

read the letter

TruEye's main contribution is a five-way taxonomy for synthetic human content in images, including the case of real people composited into scenes they were never in, paired with a mask-conditioned dual-stream transformer that uses region-gated cross attention and token-level supervision.

The architecture is new in this combination and aims to separate human and scene tokens while enforcing coherence without calling an LLM.

It does a reasonable job identifying the overfitting problem in existing binary detectors and the speed cost of LLM-based alternatives.

The soft spots stand out clearly. The abstract asserts better accuracy and generalization across six datasets plus FineSyn but includes zero quantitative results, ablations, or dataset statistics, so those claims cannot be checked. The stress-test concern holds: the 'real human never present' category requires verifiable ground truth on non-presence, and simple pasting without metadata or provenance checks would collapse it into ordinary real-human detection.

The paper engages the literature on detector limitations in a straightforward way.

This is for computer vision researchers working on image forensics or deepfake detection. Someone looking for architecture ideas in fine-grained classification might pick up the dual-stream design, but the missing evidence limits what they can take away.

I would not send this to peer review until the full paper provides the performance numbers and addresses the FineSyn labeling process.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TruEye, a mask-conditioned dual-stream transformer for fine-grained detection and localization of AI-generated or manipulated humans in images. It claims to be the first method to distinguish among five compositional categories of synthetic content—including the hardest case of a real human composited into a real scene where they were never physically present—while achieving higher accuracy, stronger generalization across six datasets plus the new FineSyn dataset, and over 100× faster inference than LLM-based competitors, all without relying on LLMs.

Significance. If the performance claims and FineSyn label reliability hold, the work would meaningfully advance beyond binary or LLM-dependent detection by introducing compositional taxonomy and an efficient dual-stream architecture with region-gated cross-attention and token-level supervision. This could have practical impact in applications requiring interpretable, real-time synthetic-content detection.

major comments (2)

[FineSyn Dataset Curation] The central claim of distinguishing the five compositional categories, especially the 'real human never present' case, rests on FineSyn providing verifiable ground truth for non-presence. The manuscript provides no details on curation (e.g., timestamps, metadata, or scene provenance) that would confirm a human was never in the original scene; without this, the category reduces to generic real-human detection and the five-way taxonomy cannot be meaningfully evaluated.
[Experiments and Results] The abstract asserts superior accuracy, generalization, and 100× speed on six datasets plus FineSyn, yet the available description supplies no quantitative results, error bars, ablation tables, or dataset statistics. This prevents verification of the performance claims that are load-bearing for the contribution.

minor comments (2)

[Inference Efficiency] The speed comparison to LLM-based methods should include concrete wall-clock timings, hardware specifications, and batch sizes in a dedicated table or section to support the 100× claim.
[Method] Notation for the region-gated cross-attention and token-level supervision losses should be introduced with explicit equations rather than descriptive text only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [FineSyn Dataset Curation] The central claim of distinguishing the five compositional categories, especially the 'real human never present' case, rests on FineSyn providing verifiable ground truth for non-presence. The manuscript provides no details on curation (e.g., timestamps, metadata, or scene provenance) that would confirm a human was never in the original scene; without this, the category reduces to generic real-human detection and the five-way taxonomy cannot be meaningfully evaluated.

Authors: We agree that the manuscript lacks sufficient detail on FineSyn curation to allow independent verification of the non-presence labels. In the revised version we will add a dedicated subsection describing the dataset construction process, including scene sources, metadata inspection where available, manual verification protocols used to confirm that inserted humans were never physically present in the original scenes, and any timestamp or provenance records collected during curation. revision: yes
Referee: [Experiments and Results] The abstract asserts superior accuracy, generalization, and 100× speed on six datasets plus FineSyn, yet the available description supplies no quantitative results, error bars, ablation tables, or dataset statistics. This prevents verification of the performance claims that are load-bearing for the contribution.

Authors: The referee correctly observes that the abstract contains only qualitative claims. While the full manuscript includes quantitative tables, speed benchmarks, generalization results, ablations, and dataset statistics in the experimental section, we acknowledge that these should be more explicitly summarized for readers. In revision we will insert key numerical results (with error bars where computed) into the abstract and ensure all tables are clearly referenced and complete. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with external validation

full rationale

The paper presents TruEye as an empirical neural architecture (mask-conditioned dual-stream transformer with region-gated cross attention and token-level supervision) whose performance claims rest on experimental results across six datasets plus the newly curated FineSyn set. No equations, derivations, or first-principles predictions appear in the provided text; the five-category taxonomy and generalization claims are evaluated via held-out test data rather than being defined into existence by the model itself or by self-citation chains. The architecture choices are presented as design decisions, not as outputs forced by prior self-referential results. This is the standard non-circular case for an applied ML detector paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the proposed architecture components produce the reported gains and that the six datasets plus FineSyn adequately represent real-world AI-generated content distributions; no free parameters beyond standard model hyperparameters are explicitly introduced in the abstract.

axioms (1)

standard math Standard supervised learning assumptions hold for the token-level and global classification objectives.
Implicit in any transformer-based classification model trained on labeled image data.

invented entities (1)

FineSyn dataset no independent evidence
purpose: New benchmark for five-category compositional detection of AI-generated humans.
Introduced as newly curated; no independent evidence of its construction or release is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5769 in / 1382 out tokens · 18273 ms · 2026-06-29T01:52:47.320634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 10 canonical work pages

[1]

In: Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P) (2024)

Abdullah, S.M., Cheruvu, A., Kanchi, S., Chung, T., Gao, P., Jadliwala, M., Viswanath, B.: An analysis of recent advances in deepfake image detection in an evolving threat landscape. In: Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P) (2024)

2024
[2]

In: CVPR

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR. pp. 3686–3693 (2014)

2014
[3]

In: ICLR (2018)

Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: ICLR (2018)

2018
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Chen, J., Tan, S., Wang, J., Li, W.: Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14183–14191 (2021)

2021
[5]

Chen, Y., Huang, X., Zhang, Q., Li, W., Zhu, M., Yan, Q., Li, S., Chen, H., Hu, H., Yang, J., Liu, W., Hu, J.: Gim: a million-scale benchmark for generative im- age manipulation detection and localization. In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on In- novative Applications of Artificial...

work page doi:10.1609/aaai.v39i2.32231 2025
[6]

In: CVPR

Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In: CVPR. pp. 8789–8797 (2018)

2018
[7]

arXiv preprint arXiv:2408.16892 (2024),https://arxiv.org/abs/2408.16892, accessed: Nov 2025

Dagar, D., Vishwakarma, D.K.: Tex-vit: A generalizable, robust, texture-based dual-branch cross-attention deepfake detector. arXiv preprint arXiv:2408.16892 (2024),https://arxiv.org/abs/2408.16892, accessed: Nov 2025

work page arXiv 2024
[8]

In: CVPR

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)

2009
[9]

real or not?

for Good Lab, M.A.: How good are humans at detecting ai-generated images? results from the “real or not?” quiz. arXiv preprint arXiv:2507.18640 (2025), https://arxiv.org/abs/2507.18640, accessed: Nov 2025

work page arXiv 2025
[10]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guillaro, M., Verdoliva, L., Cozzolino, D.: Trufor: Leveraging trustworthiness and uncertainty for generalized image forgery detection and localization. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20606–20615 (2023)

2023
[11]

Applied Intelligence 55, 1–15 (2025)

Gupta, A., Sharma, N.: Freqfacenet: An enhanced transformer architecture with dual-order frequency attention for ai-generated face detection. Applied Intelligence 55, 1–15 (2025)

2025
[12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) TruEye: Detecting AI-Generated Human Subjects 17

Huang, T., Zhang, Q., Li, X., Zhao, H., Chen, P.Y.: Sida: Social media image ai- generation detection, localization and explanation with large multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) TruEye: Detecting AI-Generated Human Subjects 17

2025
[13]

Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detec- tion via learned self-consistency. Tech. Rep. UCB/EECS-2018-67, EECS Depart- ment, University of California, Berkeley (2018),https://www2.eecs.berkeley. edu/Pubs/TechRpts/2018/EECS-2018-67.pdf

2018
[14]

Kang, H., Wen, S., Wen, Z., Ye, J., Li, W., Feng, P., Zhou, B., Wang, B., Lin, D., Zhang, L., He, C.: Legion: Learning to ground and explain for synthetic image detection (2025),https://arxiv.org/abs/2503.15264

work page arXiv 2025
[15]

In: ICLR (2018)

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. In: ICLR (2018)

2018
[16]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4401–4410 (2019)

2019
[17]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4401–4410 (2019), introduced FFHQ dataset

2019
[18]

In: CVPR

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR. pp. 8110–8119 (2020)

2020
[19]

In: ECCV (2022)

Ke, Z., Sun, C., Zhu, L., Xu, K., Lau, R.W.H.: Harmonizer: Learning to perform white-box image and video harmonization. In: ECCV (2022)

2022
[20]

In: Proceedings of the ACM International Conference on Content-Based Multimedia Indexing (CBMI)

Khan, R., Yu, X., Liu, Y.: Hybrid transformer network for ai-generated image detection. In: Proceedings of the ACM International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–6 (2022)

2022
[21]

arXiv preprint arXiv:2303.00917 (2023), https://arxiv.org/abs/2303.00917, accessed: Nov 2025

Kong, C., Li, H., Wang, S.: Enhancing general face forgery detection via vision transformer with low-rank adaptation. arXiv preprint arXiv:2303.00917 (2023), https://arxiv.org/abs/2303.00917, accessed: Nov 2025

work page arXiv 2023
[22]

In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021

Le, T.N., Nguyen, H.H., Yamagishi, J., Echizen, I.: Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10117–10127 (2021).https://doi.org/10.1109/ICCV48922.2021. 00997

work page doi:10.1109/iccv48922.2021 2021
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5001–5010 (2020)

2020
[24]

In: ECCV

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)

2014
[25]

arXiv preprint arXiv:2306.02412 (2023),https: //arxiv.org/abs/2306.02412, accessed: Nov 2025

Liu, J., Wang, T., Zhang, S., Liu, F.: Frequency-modulated vision transformers for generalized deepfake detection. arXiv preprint arXiv:2306.02412 (2023),https: //arxiv.org/abs/2306.02412, accessed: Nov 2025

work page arXiv 2023
[26]

arXiv preprint arXiv:2312.13015 (2023)

Liu, Y., He, B., Zhang, W., Guo, Y.: Forgery-aware adaptive transformer for gen- eralizable synthetic image detection. arXiv preprint arXiv:2312.13015 (2023)

work page arXiv 2023
[27]

News, A.: How south-east asia’s pig butchering scam- mers are using ai face-swapping and other tools (May 15 2024),https://www.abc.net.au/news/2024-05-16/ pig-butchering-scams-artificial-intelligence-ai-face-swapping-/ 103804830

2024
[28]

arXiv preprint arXiv:2411.01256 (2024)

Nguyen, T., Patel, V.M.: Fakeformer: Efficient vulnerability-driven transformers for generalisable deepfake detection. arXiv preprint arXiv:2411.01256 (2024)

work page arXiv 2024
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ojha, U., Li, Y., Lee, Y.J.: Towards universal synthetic image detectors that gen- eralize across generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24480–24489 (2023)

2023
[30]

In: CVPR

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR. pp. 2337–2346 (2019) 18 J. Barot and D. Lin

2019
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qu, C., Zhong, Y., Liu, C., Xu, G., Peng, D., Guo, F., Jin, L.: Towards mod- ern image manipulation localization: A large-scale dataset and novel methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10781–10790 (June 2024)

2024
[32]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

2022
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

R¨ ossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1–11 (2019)

2019
[34]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Tan, H., Zhao, Y., Liu, F.: Frequency-aware vision transformers for detecting ai- generated visual content. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024
[35]

Mathematics11(12), 2782 (2023)

Wang, M., Li, P., Chen, L.: Frequency domain filtered residual network for ai- generated image detection. Mathematics11(12), 2782 (2023)

2023
[36]

Wang, S.Y., Zhang, O., Owens, A., Efros, A.A.: Cnn-generated images are sur- prisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8695–8704 (2020)

2020
[37]

Wang, Y., Huang, Z., Hong, X.: Opensdi: Spotting diffusion-generated images in the open world (2025),https://arxiv.org/abs/2503.19653

work page arXiv 2025
[38]

Wired: Meta finally breaks its silence on pig butchering (Nov 21 2024),https: //www.wired.com/story/meta-pig-butchering-report-2024/

2024
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing net- work for detection and localization of image forgeries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9543–9552 (2019)

2019
[40]

In: Proceedings of the AAAI Conference on Ar- tificial Intelligence

Zhang, Z., Li, M., Chang, M.C.: A new benchmark and model for challenging image manipulation detection. In: Proceedings of the AAAI Conference on Ar- tificial Intelligence. pp. 12759–12768 (2024),https://dl.acm.org/doi/10.1609/ aaai.v38i7.28571

2024
[41]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, H., Dong, J., Chen, M.: Multi-attentional deepfake detection. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2185–2194 (2021)

2021
[42]

Communica- tions of the ACM (2025),https://cacm.acm.org/research/ as-good-as-a-coin-toss-human-detection-of-ai-generated-content/, early Access, Accessed: Nov 2025

Zhou, X., Williams, J., Groves, M.: As good as a coin toss: Human detection of ai-generated content. Communica- tions of the ACM (2025),https://cacm.acm.org/research/ as-good-as-a-coin-toss-human-detection-of-ai-generated-content/, early Access, Accessed: Nov 2025

2025
[43]

In: CVPR

Zhou, Y., Fang, X., Chen, P.Y., Wang, S.: Sig: Spatially interpretable grounded deepfake detection via vision-language alignment. In: CVPR. pp. 11854–11863 (2024)

2024
[44]

In: ICCV

Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV. pp. 2223–2232 (2017)

2017

[1] [1]

In: Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P) (2024)

Abdullah, S.M., Cheruvu, A., Kanchi, S., Chung, T., Gao, P., Jadliwala, M., Viswanath, B.: An analysis of recent advances in deepfake image detection in an evolving threat landscape. In: Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P) (2024)

2024

[2] [2]

In: CVPR

Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR. pp. 3686–3693 (2014)

2014

[3] [3]

In: ICLR (2018)

Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. In: ICLR (2018)

2018

[4] [4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Chen, J., Tan, S., Wang, J., Li, W.: Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14183–14191 (2021)

2021

[5] [5]

Chen, Y., Huang, X., Zhang, Q., Li, W., Zhu, M., Yan, Q., Li, S., Chen, H., Hu, H., Yang, J., Liu, W., Hu, J.: Gim: a million-scale benchmark for generative im- age manipulation detection and localization. In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on In- novative Applications of Artificial...

work page doi:10.1609/aaai.v39i2.32231 2025

[6] [6]

In: CVPR

Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In: CVPR. pp. 8789–8797 (2018)

2018

[7] [7]

arXiv preprint arXiv:2408.16892 (2024),https://arxiv.org/abs/2408.16892, accessed: Nov 2025

Dagar, D., Vishwakarma, D.K.: Tex-vit: A generalizable, robust, texture-based dual-branch cross-attention deepfake detector. arXiv preprint arXiv:2408.16892 (2024),https://arxiv.org/abs/2408.16892, accessed: Nov 2025

work page arXiv 2024

[8] [8]

In: CVPR

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)

2009

[9] [9]

real or not?

for Good Lab, M.A.: How good are humans at detecting ai-generated images? results from the “real or not?” quiz. arXiv preprint arXiv:2507.18640 (2025), https://arxiv.org/abs/2507.18640, accessed: Nov 2025

work page arXiv 2025

[10] [10]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guillaro, M., Verdoliva, L., Cozzolino, D.: Trufor: Leveraging trustworthiness and uncertainty for generalized image forgery detection and localization. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20606–20615 (2023)

2023

[11] [11]

Applied Intelligence 55, 1–15 (2025)

Gupta, A., Sharma, N.: Freqfacenet: An enhanced transformer architecture with dual-order frequency attention for ai-generated face detection. Applied Intelligence 55, 1–15 (2025)

2025

[12] [12]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) TruEye: Detecting AI-Generated Human Subjects 17

Huang, T., Zhang, Q., Li, X., Zhao, H., Chen, P.Y.: Sida: Social media image ai- generation detection, localization and explanation with large multimodal models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) TruEye: Detecting AI-Generated Human Subjects 17

2025

[13] [13]

Huh, M., Liu, A., Owens, A., Efros, A.A.: Fighting fake news: Image splice detec- tion via learned self-consistency. Tech. Rep. UCB/EECS-2018-67, EECS Depart- ment, University of California, Berkeley (2018),https://www2.eecs.berkeley. edu/Pubs/TechRpts/2018/EECS-2018-67.pdf

2018

[14] [14]

Kang, H., Wen, S., Wen, Z., Ye, J., Li, W., Feng, P., Zhou, B., Wang, B., Lin, D., Zhang, L., He, C.: Legion: Learning to ground and explain for synthetic image detection (2025),https://arxiv.org/abs/2503.15264

work page arXiv 2025

[15] [15]

In: ICLR (2018)

Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for im- proved quality, stability, and variation. In: ICLR (2018)

2018

[16] [16]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4401–4410 (2019)

2019

[17] [17]

In: CVPR

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR. pp. 4401–4410 (2019), introduced FFHQ dataset

2019

[18] [18]

In: CVPR

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: CVPR. pp. 8110–8119 (2020)

2020

[19] [19]

In: ECCV (2022)

Ke, Z., Sun, C., Zhu, L., Xu, K., Lau, R.W.H.: Harmonizer: Learning to perform white-box image and video harmonization. In: ECCV (2022)

2022

[20] [20]

In: Proceedings of the ACM International Conference on Content-Based Multimedia Indexing (CBMI)

Khan, R., Yu, X., Liu, Y.: Hybrid transformer network for ai-generated image detection. In: Proceedings of the ACM International Conference on Content-Based Multimedia Indexing (CBMI). pp. 1–6 (2022)

2022

[21] [21]

arXiv preprint arXiv:2303.00917 (2023), https://arxiv.org/abs/2303.00917, accessed: Nov 2025

Kong, C., Li, H., Wang, S.: Enhancing general face forgery detection via vision transformer with low-rank adaptation. arXiv preprint arXiv:2303.00917 (2023), https://arxiv.org/abs/2303.00917, accessed: Nov 2025

work page arXiv 2023

[22] [22]

In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021

Le, T.N., Nguyen, H.H., Yamagishi, J., Echizen, I.: Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10117–10127 (2021).https://doi.org/10.1109/ICCV48922.2021. 00997

work page doi:10.1109/iccv48922.2021 2021

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: Face x-ray for more general face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5001–5010 (2020)

2020

[24] [24]

In: ECCV

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)

2014

[25] [25]

arXiv preprint arXiv:2306.02412 (2023),https: //arxiv.org/abs/2306.02412, accessed: Nov 2025

Liu, J., Wang, T., Zhang, S., Liu, F.: Frequency-modulated vision transformers for generalized deepfake detection. arXiv preprint arXiv:2306.02412 (2023),https: //arxiv.org/abs/2306.02412, accessed: Nov 2025

work page arXiv 2023

[26] [26]

arXiv preprint arXiv:2312.13015 (2023)

Liu, Y., He, B., Zhang, W., Guo, Y.: Forgery-aware adaptive transformer for gen- eralizable synthetic image detection. arXiv preprint arXiv:2312.13015 (2023)

work page arXiv 2023

[27] [27]

News, A.: How south-east asia’s pig butchering scam- mers are using ai face-swapping and other tools (May 15 2024),https://www.abc.net.au/news/2024-05-16/ pig-butchering-scams-artificial-intelligence-ai-face-swapping-/ 103804830

2024

[28] [28]

arXiv preprint arXiv:2411.01256 (2024)

Nguyen, T., Patel, V.M.: Fakeformer: Efficient vulnerability-driven transformers for generalisable deepfake detection. arXiv preprint arXiv:2411.01256 (2024)

work page arXiv 2024

[29] [29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Ojha, U., Li, Y., Lee, Y.J.: Towards universal synthetic image detectors that gen- eralize across generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24480–24489 (2023)

2023

[30] [30]

In: CVPR

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR. pp. 2337–2346 (2019) 18 J. Barot and D. Lin

2019

[31] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qu, C., Zhong, Y., Liu, C., Xu, G., Peng, D., Guo, F., Jin, L.: Towards mod- ern image manipulation localization: A large-scale dataset and novel methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10781–10790 (June 2024)

2024

[32] [32]

In: CVPR

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)

2022

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

R¨ ossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Face- forensics++: Learning to detect manipulated facial images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1–11 (2019)

2019

[34] [34]

In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

Tan, H., Zhao, Y., Liu, F.: Frequency-aware vision transformers for detecting ai- generated visual content. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

2024

[35] [35]

Mathematics11(12), 2782 (2023)

Wang, M., Li, P., Chen, L.: Frequency domain filtered residual network for ai- generated image detection. Mathematics11(12), 2782 (2023)

2023

[36] [36]

Wang, S.Y., Zhang, O., Owens, A., Efros, A.A.: Cnn-generated images are sur- prisingly easy to spot... for now. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8695–8704 (2020)

2020

[37] [37]

Wang, Y., Huang, Z., Hong, X.: Opensdi: Spotting diffusion-generated images in the open world (2025),https://arxiv.org/abs/2503.19653

work page arXiv 2025

[38] [38]

Wired: Meta finally breaks its silence on pig butchering (Nov 21 2024),https: //www.wired.com/story/meta-pig-butchering-report-2024/

2024

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, Y., AbdAlmageed, W., Natarajan, P.: Mantra-net: Manipulation tracing net- work for detection and localization of image forgeries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9543–9552 (2019)

2019

[40] [40]

In: Proceedings of the AAAI Conference on Ar- tificial Intelligence

Zhang, Z., Li, M., Chang, M.C.: A new benchmark and model for challenging image manipulation detection. In: Proceedings of the AAAI Conference on Ar- tificial Intelligence. pp. 12759–12768 (2024),https://dl.acm.org/doi/10.1609/ aaai.v38i7.28571

2024

[41] [41]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, H., Dong, J., Chen, M.: Multi-attentional deepfake detection. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2185–2194 (2021)

2021

[42] [42]

Communica- tions of the ACM (2025),https://cacm.acm.org/research/ as-good-as-a-coin-toss-human-detection-of-ai-generated-content/, early Access, Accessed: Nov 2025

Zhou, X., Williams, J., Groves, M.: As good as a coin toss: Human detection of ai-generated content. Communica- tions of the ACM (2025),https://cacm.acm.org/research/ as-good-as-a-coin-toss-human-detection-of-ai-generated-content/, early Access, Accessed: Nov 2025

2025

[43] [43]

In: CVPR

Zhou, Y., Fang, X., Chen, P.Y., Wang, S.: Sig: Spatially interpretable grounded deepfake detection via vision-language alignment. In: CVPR. pp. 11854–11863 (2024)

2024

[44] [44]

In: ICCV

Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV. pp. 2223–2232 (2017)

2017