pith. sign in

arxiv: 2606.19259 · v1 · pith:G6KBJRVNnew · submitted 2026-06-17 · 💻 cs.CV · cs.AI

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords AI-generated imagestext-rich imagesimage detection benchmarkGPT-Image-2domain dependenceJPEG compression robustnessmultimodal detectionlayout-aware detection
0
0 comments X

The pith

Existing detectors for AI-generated text-rich images perform inconsistently across domains and degrade sharply under JPEG compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark of 8602 images generated by GPT-Image-2 across six categories that emphasize text and layout: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Five representative detectors are tested in a zero-shot setting on the full set, broken down by category, and after JPEG compression. Results indicate that accuracy depends heavily on the specific domain, with strong performance in one category often paired with failure in others. The strongest conventional detector still shows severe drops after compression. An exploratory test with a vision-language model is included to probe structured formats.

Core claim

The paper establishes that detector performance on GPT-Image-2 text-rich images is highly domain-dependent, with methods succeeding in some categories but failing in others, and that even leading conventional detectors lose effectiveness under JPEG compression.

What carries the argument

The multi-domain benchmark of 8602 images spanning commercial posters, infographics, academic posters, receipts, tables, and UI screenshots.

If this is right

  • No single detector can be relied upon across all text-rich image categories.
  • JPEG compression causes large accuracy drops even for the best conventional methods.
  • Multimodal vision-language models show mixed results when applied to structured text formats.
  • Detection methods need to incorporate awareness of textual semantics and layout organization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment in privacy-sensitive settings such as receipt or transaction verification would face particular risks from current detector weaknesses.
  • The benchmark could be used to test whether text-recognition pipelines combined with forensic signals close the identified gaps.
  • Extending the evaluation to additional post-processing operations beyond JPEG would clarify the practical robustness limits.
  • Similar domain-dependence patterns may appear when testing detectors on text-rich images from generators other than GPT-Image-2.

Load-bearing premise

The 8602 generated images across the six categories and the five evaluated detectors are representative of real-world text-rich AI-generated content and available detection methods.

What would settle it

Applying the same five detectors to a fresh collection of text-rich images produced by a different generator and checking whether the patterns of domain dependence and JPEG sensitivity persist or disappear.

Figures

Figures reproduced from arXiv: 2606.19259 by Shuyi Wang, Wenhan Zhang, Yijin Wang, Yuqi Ouyang.

Figure 1
Figure 1. Figure 1: Representative AI-generated samples across six text-rich image categories. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of representative text-rich datasets and our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the data-guided prompt construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average F1 score across different text￾rich image categories. Detector TP TN Accuracy Precision Recall F1 Dima806 [5] 4263 861 0.5957 0.6673 0.7591 0.7103 prithivMLmods [18] 2756 1417 0.4851 0.6372 0.4907 0.5545 HPAI [10, 1] 5411 208 0.6532 0.6608 0.9635 0.7839 CNNSpot [23] 163 2970 0.3642 0.9106 0.0290 0.0563 NPR [22] 4008 2912 0.8045 0.9819 0.7137 0.8266 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance of NPR under different post-processing conditions. Model Min F1 Min Category Max F1 Max Category Dima806 [5] 0.36 Receipt 93.77 Commercial Poster prithivMLmods [18] 3.32 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Category-wise Accuracy of GPT-5.5 on the Proposed Benchmark To further interpret its behavior, we manually reviewed the explanations generated by GPT-5.5 and identified recurring reasoning patterns across different categories. The explanations suggest that GPT-5.5 frequently relies on textual, semantic, and structural consistency cues in addition to visual appearance. This may partially explain its strong … view at source ↗
Figure 9
Figure 9. Figure 9: Representative anomalies observed in GPT-Image-2-generated text-rich images across six [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a benchmark of 8602 text-rich images generated by GPT-Image-2 across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots). It evaluates five detectors zero-shot, reports domain-dependent performance and severe JPEG sensitivity even for the strongest detector, and includes an exploratory VLM evaluation to argue for text- and layout-aware detection methods.

Significance. A well-scoped benchmark on text-rich AI images would be useful given the privacy and authenticity stakes, and the empirical finding of domain dependence plus compression fragility is a concrete observation worth documenting. The single-generator design, however, caps the result at model-specific insights rather than a general statement about AI-generated text-rich images.

major comments (2)
  1. [Abstract] Abstract and title: the work is framed as addressing detection of 'AI-generated text-rich images' broadly and the need for text/layout-aware methods for 'modern AI-generated images,' yet the entire benchmark and all evaluations use only GPT-Image-2 generations. Without cross-generator controls, the reported domain dependence and JPEG sensitivity could be artifacts of GPT-Image-2-specific text rendering rather than intrinsic properties of the domain.
  2. [Abstract] The central empirical claim (domain-dependent detector performance) rests on 8602 images from one generator whose text, font, and layout artifacts may differ systematically from those of Stable Diffusion, Midjourney, or other models on which the five evaluated detectors were presumably trained. This generator mismatch is load-bearing for the generalizability conclusion.
minor comments (1)
  1. [Abstract] The dataset release URL is listed as 'XXX'; replace with the actual link before publication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of precise scoping. Our work is explicitly focused on GPT-Image-2 as indicated by the title; we will revise the abstract and add a limitations discussion to eliminate any broader framing that could imply generalizability across generators.

read point-by-point responses
  1. Referee: [Abstract] Abstract and title: the work is framed as addressing detection of 'AI-generated text-rich images' broadly and the need for text/layout-aware methods for 'modern AI-generated images,' yet the entire benchmark and all evaluations use only GPT-Image-2 generations. Without cross-generator controls, the reported domain dependence and JPEG sensitivity could be artifacts of GPT-Image-2-specific text rendering rather than intrinsic properties of the domain.

    Authors: The title explicitly names GPT-Image-2, and the benchmark construction and all experiments are confined to this generator. The abstract's closing reference to 'modern AI-generated images' is motivational rather than a claim of tested generality. We will revise the abstract to state that findings on domain dependence and JPEG sensitivity apply to GPT-Image-2 generations and will add an explicit limitations paragraph noting that cross-generator validation remains future work. revision: partial

  2. Referee: [Abstract] The central empirical claim (domain-dependent detector performance) rests on 8602 images from one generator whose text, font, and layout artifacts may differ systematically from those of Stable Diffusion, Midjourney, or other models on which the five evaluated detectors were presumably trained. This generator mismatch is load-bearing for the generalizability conclusion.

    Authors: The central claim is domain dependence within the GPT-Image-2 benchmark, not a universal property of all AI text rendering. Because the detectors were evaluated zero-shot, any training mismatch is already implicit; we do not assert that the same patterns would hold for other generators. We will strengthen the limitations section to state that results are model-specific and that extending the benchmark to additional generators would be required for broader conclusions. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or predictions

full rationale

The paper creates and evaluates a benchmark dataset of 8602 GPT-Image-2 images across six categories, testing five detectors in zero-shot mode. No equations, fitted parameters, predictions, or derivation chains appear. All reported results (domain-dependent performance, JPEG sensitivity) are direct empirical measurements on the released images. No self-citations are load-bearing for any central claim, and the work is self-contained as an empirical contribution without reducing any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and evaluation paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5766 in / 1021 out tokens · 21797 ms · 2026-06-26T21:30:29.828585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 7 canonical work pages

  1. [1]

    Present and future generalization of synthetic image detectors

    Pablo Bernabeu-Pérez, Enrique Lopez-Cuena, and Dario Garcia-Gasulla. Present and future generalization of synthetic image detectors. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 3–20. Springer, 2025

  2. [2]

    Textdiffuser: Diffusion models as text painters

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 9353–9387. Curran Associates, Inc., 2023

  3. [3]

    Raising the bar of ai-generated image detection with clip, 2024

    Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip, 2024. URL https://arxiv.org/ abs/2312.00195

  4. [4]

    Practical analyses of how common social media platforms and photo storage services handle uploaded images

    Duc-Tien Dang-Nguyen, Vegard Velle Sjøen, Dinh-Hai Le, Thien-Phu Dao, Anh-Duy Tran, and Minh-Triet Tran. Practical analyses of how common social media platforms and photo storage services handle uploaded images. InInternational conference on multimedia modeling, pages 164–176. Springer, 2023

  5. [5]

    ai_vs_real_image_detection

    dima806. ai_vs_real_image_detection. Hugging Face model repository, 2024. Available at: https://huggingface.co/dima806/ai_vs_real_image_detection

  6. [6]

    Deep learning based visually rich document content understanding: A survey.Artificial Intelligence Review, 2026

    Yihao Ding, Soyeon Caren Han, Jean Lee, and Eduard Hovy. Deep learning based visually rich document content understanding: A survey.Artificial Intelligence Review, 2026. 11

  7. [7]

    Pei Fu, Tongkun Guan, Zining Wang, Zhentao Guo, Chen Duan, Hao Sun, Boming Chen, Qianyi Jiang, Jiayao Ma, Kai Zhou, et al. Multimodal large language models for text-rich image understanding: A comprehensive review.Findings of the Association for Computational Linguistics: ACL 2025, pages 19941–19958, 2025

  8. [8]

    Using facebook for image steganography

    Jason Hiney, Tejas Dakve, Krzysztof Szczypiorski, and Kris Gaj. Using facebook for image steganography. In2015 10th international conference on availability, reliability and security, pages 442–447. IEEE, 2015

  9. [9]

    Wildfake: A large-scale challenging dataset for ai-generated images detection, 2024

    Yan Hong and Jianfu Zhang. Wildfake: A large-scale challenging dataset for ai-generated images detection, 2024. URLhttps://arxiv.org/abs/2402.11843

  10. [10]

    SuSy: Spatial-Based Synthetic Image Detection and Recognition Model

    HPAI-BSC. SuSy: Spatial-Based Synthetic Image Detection and Recognition Model. Hugging Face model repository, 2024. Available at:https://huggingface.co/HPAI-BSC/SuSy

  11. [11]

    Scifigdetect: A benchmark for ai-generated scientific figure detection, 2026

    You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, and Xiaobai Li. Scifigdetect: A benchmark for ai-generated scientific figure detection, 2026. URL https://arxiv.org/abs/2604. 08211

  12. [12]

    Icdar2019 competition on scanned receipt ocr and information extraction

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 competition on scanned receipt ocr and information extraction. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019

  13. [13]

    Enrico: A dataset for topic modeling of mobile ui designs

    Luis A Leiva, Asutosh Hota, and Antti Oulasvirta. Enrico: A dataset for topic modeling of mobile ui designs. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services, pages 1–4, 2020

  14. [14]

    Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1697–1706, January 2022

  15. [15]

    org/abs/2302.10174

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models, 2024. URLhttps://arxiv.org/abs/2302.10174

  16. [16]

    Introducing ChatGPT Images 2.0, 2026

    OpenAI. Introducing ChatGPT Images 2.0, 2026. URL https://openai.com/index/ introducing-chatgpt-images-2-0/. Accessed: 2026-05-25

  17. [17]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors, International Conference on Learning Representations, volume 2024, pages 1862–1874, 2024

  18. [18]

    deepfake-detector-model-v1

    PrithivMLmods. deepfake-detector-model-v1. Hugging Face model repository, 2025. Available at:https://huggingface.co/prithivMLmods/deepfake-detector-model-v1

  19. [19]

    Revisiting tampered scene text detection in the era of generative ai

    Chenfan Qu, Yiwu Zhong, Fengjun Guo, and Lianwen Jin. Revisiting tampered scene text detection in the era of generative ai. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 694–702, 2025

  20. [20]

    Mitigating dataset bias in image captioning through clip confounder-free captioning network

    Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, and Shaikh Anowarul Fattah. Artifact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection. In2023 IEEE International Conference on Image Processing (ICIP), pages 2200–2204, 2023. doi: 10.1109/ICIP49359.2023.10222083

  21. [21]

    Postersum: A multimodal benchmark for scientific poster summarization

    Rohit Saxena, Pasquale Minervini, and Frank Keller. Postersum: A multimodal benchmark for scientific poster summarization. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 1828–1844, 2025

  22. [22]

    Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 28130–28139, 2024. 12

  23. [23]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn- generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

  24. [24]

    Detecting tampered scene text in the wild

    Yuxin Wang, Hongtao Xie, Mengting Xing, Jing Wang, Shenggao Zhu, and Yongdong Zhang. Detecting tampered scene text in the wild. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 215–232, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19815-1

  25. [25]

    Dire for diffusion-generated image detection, 2023

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection, 2023. URL https://arxiv.org/ abs/2303.09295

  26. [26]

    Aiforge-doc: A benchmark for detecting ai-forged tampering in financial and form documents, 2026

    Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, and Jingheng Huan. Aiforge-doc: A benchmark for detecting ai-forged tampering in financial and form documents, 2026. URLhttps://arxiv.org/abs/2602.20569

  27. [27]

    Gpt4o-receipt: A dataset and human study for ai-generated document forensics, 2026

    Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayu Xue, Yuxin Zhang, and Evelyn Marotta. Gpt4o-receipt: A dataset and human study for ai-generated document forensics, 2026. URLhttps://arxiv.org/abs/2603.11442

  28. [28]

    Image-based table recognition: data, model, and evaluation

    Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. Image-based table recognition: data, model, and evaluation. InEuropean conference on computer vision, pages 564–580. Springer, 2020

  29. [29]

    Genimage: A million-scale benchmark for detecting ai-generated image

    Mingjian Zhu, Hanting Chen, Qiangyu YAN, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 77771–77782. Curra...