pith. machine review for the scientific record.
sign in

arxiv: 2510.04225 · v2 · submitted 2025-10-05 · 💻 cs.CV · cs.AI· cs.CL

Locate-Then-Examine: Grounded Region Reasoning Improves Detection of AI-Generated Images

Pith reviewed 2026-05-18 10:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords AI-generated image detectionvision-language modelsregion localizationforensic analysisTRACE datasetgrounded reasoningtwo-stage framework
0
0 comments X

The pith

A two-stage locate-then-examine process improves detection of AI-generated images by grounding verdicts in specific regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Locate-Then-Examine (LTE), a two-stage framework that uses vision-language models to detect AI-generated images. The first stage identifies suspicious regions in the image, and the second stage re-examines those cropped areas together with the full image to produce a refined real-versus-AI verdict and an explanation tied directly to the localized evidence. This setup targets the weakness of standard one-pass classifiers that often miss subtle artifacts in high-quality synthetic images. The authors also release the TRACE dataset of 20,000 images with region-level annotations and forensic explanations to enable training and testing of such methods. A reader would care because the approach supplies not only a classification score but also human-readable, pixel-linked reasons that support forensic use.

Core claim

Locate-Then-Examine (LTE) is a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real versus AI-generated verdict and its explanation, explicitly linking each decision to localized visual evidence through region proposals and region-aware reasoning.

What carries the argument

Locate-Then-Examine (LTE) two-stage process, in which an initial localization stage proposes suspicious regions and a second examination stage jointly reasons over the selected crops and the full image to improve classification accuracy and produce grounded explanations.

Load-bearing premise

The region proposals from the first stage must be accurate and informative enough that feeding the selected crops into the second stage produces a meaningfully better real-versus-AI verdict than a single full-image pass.

What would settle it

Replace the learned region proposals with random crops on the same test sets and check whether the accuracy and robustness gains over single-stage baselines disappear.

Figures

Figures reproduced from arXiv: 2510.04225 by Bowen Deng, Huijia Zhu, Jianfu Zhang, Jun Lan, Liqing Zhang, Weiqiang Wang, Yan Hong, Yikun Ji.

Figure 1
Figure 1. Figure 1: (a) Without revisiting specific details, VLMs may overlook critical cues and produce false reasoning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed data annotation pipeline. We ask the forensics expert VLM in Query 1 “Explanation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples from the test set of MagniFake, captions are summarized from the Query 2 response, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The relation of accuracy with regard to the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of where ZoomIn-32B corrects its initial mistake upon [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A statistical analysis of keywords in MagniFake explanations. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A collection of images from MagniFake with rendered bounding boxes. The first row shows [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A visualization of the attention mechanisms of the VLM in Query 1. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Number of samples grouped by the bounding boxes on OoD datasets. 0 1 2 3 4 ≥ 5 # Bounding Boxes 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy Models ZoomIn-32B ZoomIn-7B [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

The rapid growth of AI-generated imagery has blurred the boundary between real and synthetic content, raising practical concerns for digital integrity. Vision-language models (VLMs) can provide natural language explanations, but standard one-pass classifiers often miss subtle artifacts in high-quality synthetic images and offer limited grounding in the pixels. We propose Locate-Then-Examine (LTE), a two-stage VLM-based forensic framework that first localizes suspicious regions and then re-examines these crops together with the full image to refine the real vs. AI-generated verdict and its explanation. LTE explicitly links each decision to localized visual evidence through region proposals and region-aware reasoning. To support training and evaluation, we introduce TRACE, a dataset of 20,000 real and high-quality synthetic images with region-level annotations and automatically generated forensic explanations, constructed by a VLM-based pipeline with additional consistency checks and quality control. Across TRACE and multiple external benchmarks, LTE achieves competitive accuracy and improved robustness while providing human-understandable, region-grounded explanations suitable for forensic deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Locate-Then-Examine (LTE), a two-stage VLM framework for AI-generated image detection: stage 1 localizes suspicious regions, and stage 2 re-examines the selected crops together with the full image to produce a refined real-vs-AI verdict and region-grounded explanation. The authors introduce TRACE, a 20,000-image dataset of real and high-quality synthetic images with region annotations and VLM-generated explanations constructed via a pipeline with consistency checks. Empirical results claim competitive accuracy and improved robustness on TRACE plus external benchmarks, with outputs suitable for forensic use.

Significance. If the performance gains can be attributed specifically to the grounded localization mechanism rather than dataset artifacts or multi-pass VLM effects, the work would advance explainable forensic methods by addressing the limited pixel grounding of standard one-pass classifiers on high-quality synthetics.

major comments (2)
  1. [Experiments] Experiments section: no ablation isolates the contribution of the stage-1 localized crops by comparing LTE against variants that feed random crops or the full image (without localization) into stage 2. This control is required to substantiate the central claim that grounded region reasoning, rather than additional VLM passes, drives the reported accuracy and robustness gains on TRACE and external benchmarks.
  2. [§3] §3 (TRACE Dataset Construction): the dataset is generated by a VLM pipeline with consistency checks that is architecturally similar to LTE itself; this introduces a risk that measured improvements on TRACE partly reflect self-consistency with the generation process rather than independent generalization of the two-stage method.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'improved robustness' should specify the exact metrics, baselines, and perturbation types used to support this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: no ablation isolates the contribution of the stage-1 localized crops by comparing LTE against variants that feed random crops or the full image (without localization) into stage 2. This control is required to substantiate the central claim that grounded region reasoning, rather than additional VLM passes, drives the reported accuracy and robustness gains on TRACE and external benchmarks.

    Authors: We agree that an explicit ablation isolating the role of localized crops versus random crops or full-image input is needed to strengthen the central claim. In the revised manuscript we will add these controls to the Experiments section, reporting accuracy and robustness metrics for (i) random crops fed to stage 2 and (ii) full-image input without localization. These results will clarify whether the observed gains derive specifically from grounded region reasoning rather than additional VLM passes alone. revision: yes

  2. Referee: [§3] §3 (TRACE Dataset Construction): the dataset is generated by a VLM pipeline with consistency checks that is architecturally similar to LTE itself; this introduces a risk that measured improvements on TRACE partly reflect self-consistency with the generation process rather than independent generalization of the two-stage method.

    Authors: We acknowledge the potential for self-consistency bias. The TRACE pipeline uses different VLM prompts, cross-model consistency verification, and separate quality-control stages from the LTE inference procedure. In addition, LTE shows comparable gains on multiple external benchmarks constructed independently of our pipeline. In the revision we will expand §3 with a paragraph explicitly contrasting the dataset-generation process from LTE, and we will add further analysis emphasizing external-benchmark results to mitigate concerns about circularity. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper proposes an empirical two-stage VLM framework (Locate-Then-Examine) whose performance is measured on a newly introduced dataset (TRACE) plus external benchmarks. No equations, predictions, or first-principles results are shown to reduce by construction to fitted inputs or self-citations. Dataset construction via VLM pipeline with consistency checks introduces some methodological overlap with the proposed method, but this does not create a definitional or statistical tautology; external benchmarks supply independent grounding. The central claims rest on reported accuracy and robustness numbers rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework assumes that VLM region proposals capture forensic artifacts and that joint re-examination of crops plus full image yields better decisions than single-pass classification. No new physical entities or ad-hoc constants are introduced; the main added elements are the dataset construction pipeline and the two-stage inference procedure.

axioms (2)
  • domain assumption Vision-language models can produce reliable region proposals for forensic artifacts in both real and synthetic images.
    Invoked in the description of the first stage of LTE.
  • domain assumption Re-examining localized crops together with the full image improves verdict accuracy and explanation quality.
    Central premise of the Locate-Then-Examine design.

pith-pipeline@v0.9.0 · 5733 in / 1420 out tokens · 26849 ms · 2026-05-18T10:09:59.060358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Venus-DeFakerOne: Unified Fake Image Detection & Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.

  2. HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

    cs.CV 2026-04 unverdicted novelty 4.0

    HEDGE is a heterogeneous ensemble using progressive DINOv3 training, multi-scale features, and MetaCLIP2 diversity with dual-gating fusion to achieve robust AI-generated image detection and 4th place in the NTIRE 2026...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    Bleuberi: Bleu is a surprisingly effective reward for instruction following.arXiv preprint arXiv:2505.11080,

    Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, and Mohit Iyyer. Bleuberi: Bleu is a surprisingly effective reward for instruction following.arXiv preprint arXiv:2505.11080,

  2. [2]

    Antifakeprompt: Prompt-tuned vision-language models are fake image detectors.arXiv preprint arXiv:2310.17419,

    You-Ming Chang, Chen Yeh, Wei-Chen Chiu, and Ning Yu. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors.arXiv preprint arXiv:2310.17419,

  3. [3]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.ArXiv, abs/1504.00325,

  4. [4]

    Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510,

    Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510,

  5. [5]

    Building blocks for a complex-valued transformer architecture

    doi: 10.1109/ICASSP49357.2023. 10095167. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, K. Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255,

  6. [6]

    GRIT: Teaching MLLMs to Think with Images

    URLhttps://api.semanticscholar.org/CorpusID:57246310. Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Naraya- naraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879,

  7. [7]

    ISBN 9781450362016

    Association for Computing Machinery. ISBN 9781450362016. doi: 10.1145/3292500.3330866. URLhttps: //doi.org/10.1145/3292500.3330866. Yueying Gao, Dongliang Chang, Bingyao Yu, Haotian Qin, Lei Chen, Kongming Liang, and Zhanyu Ma. Fakereasoning: Towards generalizable forgery detection and reasoning.arXiv preprint arXiv:2503.21210,

  8. [8]

    10 Published as a conference paper at ICLR 2026 Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models.ArXiv, abs/2006.11239,

  9. [9]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    URLhttps://api.semanticscholar.org/CorpusID:219955663. Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  10. [10]

    Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al

    URLhttps://arxiv.org/abs/2504.14245. Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

  11. [11]

    Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787,

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787,

  12. [12]

    Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

    Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Uncover the achilles’ heels of fake images with large multimodal models.ArXiv, abs/2404.13306,

  13. [13]

    Chin-Yew Lin

    doi: 10.1109/TIFS.2018.2876837. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pp. 74–81,

  14. [14]

    Gpt-image-1.https://openai.com/index/image-generation-api/, 2025a

    OpenAI. Gpt-image-1.https://openai.com/index/image-generation-api/, 2025a. OpenAI. Introducing 4o image generation, Mar 2025b. URLhttps://openai.com/index/ introducing-4o-image-generation/. 11 Published as a conference paper at ICLR 2026 Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, and Daniel Rueckert...

  15. [15]

    Qwen Team

    URLhttps://arxiv.org/abs/2411.04125. Qwen Team. Qwen2.5-vl, January

  16. [16]

    Jonas Ricker, Denis Lukovnikov, and Asja Fischer

    URLhttps://qwenlm.github.io/blog/qwen2.5-vl/. Jonas Ricker, Denis Lukovnikov, and Asja Fischer. Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9130–9140,

  17. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://api.semanticscholar.org/ CorpusID:267335007. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.ArXiv, abs/2402.03300,

  18. [18]

    Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng

    URLhttps://api.semanticscholar.org/CorpusID:267412607. Chuming Shen, Wei Wei, Xiaoye Qu, and Yu Cheng. Satori-r1: Incentivizing multimodal reasoning with spatial grounding and verifiable rewards.arXiv preprint arXiv:2505.19094,

  19. [19]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.ArXiv, abs/2010.02502,

  20. [20]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    URLhttps://api.semanticscholar.org/CorpusID:222140788. Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025a. Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu...

  21. [21]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025a. Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InPr...

  22. [22]

    Grounded chain-of-thought for multimodal large language models

    Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, and Houqiang Li. Designdiffusion: High-quality text-to-design image generation with diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025b. Mengyang Wu, Yuzhi Zhao, Jialun Cao, Mingjie Xu, Zhongming Jiang, Xuehui Wang, Qinbin Li, Guang- neng Hu, Shen...

  23. [23]

    Fakeshield: Explain- able image forgery detection and localization via multi-modal large language models.arXiv preprint arXiv:2410.02761,

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explain- able image forgery detection and localization via multi-modal large language models.arXiv preprint arXiv:2410.02761,

  24. [24]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    Fan Yang, Ru Zhen, Jianing Wang, Yanhao Zhang, Haoxiang Chen, Haonan Lu, Sicheng Zhao, and Guiguang Ding. Heie: Mllm-based hierarchical explainable aigc image implausibility evaluator. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025a. Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yi...

  25. [25]

    From redundancy to relevance: Enhancing explainability in multimodal large lan- guage models.arXiv e-prints, pp

    Xiaofeng Zhang, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, and Jieping Ye. From redundancy to relevance: Enhancing explainability in multimodal large lan- guage models.arXiv e-prints, pp. arXiv–2406, 2024a. Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Wei...

  26. [26]

    These datasets provide diverse natural images spanning various object categories, ensuring broad coverage of real-world visual content

    and COCO (Chen et al., 2015). These datasets provide diverse natural images spanning various object categories, ensuring broad coverage of real-world visual content. The selection from these datasets guarantees high-quality, authentic photographs that serve as reliable negative examples for training. 13 Published as a conference paper at ICLR 2026 AI-Gene...

  27. [27]

    We found that at least 600 GB of VRAM is required to perform GRPO

    For ZoomIn-32B, the full training pipeline took 42.6 hours on 8x NVIDIA A100 GPUs. We found that at least 600 GB of VRAM is required to perform GRPO. For ZoomIn-7B, the training took 35.3 hours on 4x NVIDIA 15 Published as a conference paper at ICLR 2026 Table 3: Performance on MagniFake with degradation, including JPEG compression artifacts, random crop-...

  28. [28]

    Collectively, these methods represent diverse strategies from artifact analysis to semantic reasoning, advanc- ing the state of synthetic content detection

    utilizes neighboring pixel relationships to identify AI-generated images with good accuracy and generalizability, as CNN-based generative methods exhibit patterns in neighboring pixels. Collectively, these methods represent diverse strategies from artifact analysis to semantic reasoning, advanc- ing the state of synthetic content detection. During evaluat...

  29. [29]

    and SynthScars (Kang et al., 2025). On these OoD datasets, Fig- ure 10 and 11 display the relation of bounding boxes with regard to model performance, and the number of detected bounding boxes for each ZoomIn model variant (7B and 32B). The trend of Figure 10 highly resemblesFigure 5in the main paper, while Figure 11 is slightly different thanFigure 6, wh...

  30. [30]

    The deploy- ment of ZoomIn-32B takes 4x NVIDIA A100-40G GPUs connected with PCI-E

    to accelerate the inference process. The deploy- ment of ZoomIn-32B takes 4x NVIDIA A100-40G GPUs connected with PCI-E. ZoomIn-7B, however, is deployed on one NVIDIA A100-40G GPU. While our two-stage approach increases inference time from traditional classification methods, this over- head is justified by accuracy improvements and in- terpretability gains...