pith. sign in

arxiv: 2604.25213 · v1 · submitted 2026-04-28 · 💻 cs.CV

When the Forger Is the Judge: GPT-Image-2 Cannot Recognize Its Own Faked Documents

Pith reviewed 2026-05-07 16:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords document forgeryAI inpaintingforensic detectionimage tamperinghuman perceptionAI self-detectionreceipt editing
0
0 comments X

The pith

GPT-Image-2 inpainting produces document forgeries that humans and detectors, including the model itself, cannot identify above chance level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that GPT-Image-2 can replace elements such as numbers on receipts through inpainting in a way that removes any reliable visual cue separating real documents from edited ones. A new paired dataset of over three thousand such forgeries is used to test human viewers side-by-side and three automated judges. Humans reach only chance accuracy, while forensic tools calibrated on traditional tampering suffer large performance drops when switched to the AI edits. The same GPT-Image-2 model, when prompted to locate its own changes, also fails to exceed low accuracy across multiple prompt variants. This isolates a detection gap created specifically by the model's inpainting process.

Core claim

GPT-Image-2 inpainting erases the visual boundary between authentic and AI-edited document images. Human inspectors achieve only chance-level accuracy in paired comparisons, standard forensic detectors that reach high accuracy on cross-camera splicing or OCR-token splicing fall sharply when applied to the AI edits, and GPT-Image-2 itself, asked zero-shot to flag any AI-generated or edited region, remains near chance across prompt strategies.

What carries the argument

The side-by-side performance comparison of detectors on GPT-Image-2 inpainted documents versus the same detectors calibrated on traditional tampering of the identical source documents.

If this is right

  • Document verification pipelines that rely on current forensic detectors lose reliability when facing GPT-Image-2 edits.
  • Human review of documents cannot serve as a standalone safeguard against this form of AI tampering.
  • Prompting a generative model to detect its own outputs does not yield a functional self-supervision signal.
  • New paired datasets with pixel masks are required to develop and test detectors specific to AI inpainting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gap may appear with other current image-generation models that support precise region editing.
  • Document security practices may need to shift toward embedded provenance data or cryptographic signing rather than visual inspection alone.
  • Low-cost creation of convincing forgeries could expand the scale of targeted fraud in financial and identity documents.

Load-bearing premise

The calibration sets from traditional tampering accurately measure detector capability on the source domain without bias that would disappear under GPT-Image-2 edits.

What would settle it

An experiment that retrains or fine-tunes one of the forensic detectors on the released GPT-Image-2 forgery set and then measures whether accuracy on held-out GPT-Image-2 edits recovers to the level previously seen on traditional tampering.

Figures

Figures reproduced from arXiv: 2604.25213 by Ankit Raj, Dennis Tsang Ng, Jiaqi Wu, Kidus Zewde, Simiao Ren, Tommy Duong, Xingyu Shen, Yuchen Zhou.

Figure 1
Figure 1. Figure 1: Can you tell which document is real? Two paired examples from AIForge-Doc v2. Abstract OpenAI’s GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document im￾ages: a single number on a receipt can now be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forg￾eries with pixel-precise masks in DocTamper-c… view at source ↗
Figure 2
Figure 2. Figure 2: (a) ROC curves on the full v2 test set. All three judges remain close to the chance diagonal: TruFor view at source ↗
Figure 3
Figure 3. Figure 3: Judge biases on AIForge-Doc v2. (a) Self-judge confusion matrix: model says REAL on view at source ↗
read the original abstract

OpenAI's GPT-Image-2 has effectively erased the visual boundary between authentic and AI-edited document images: a single number on a receipt can be replaced in under a second for a few cents. We release AIForge-Doc v2, a paired dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper-compatible format, and benchmark four lines of defence: human inspectors (N=120, n=365 pair-votes via the public 2AFC site CanUSpotAI.com), TruFor (generic forensic), DocTamper (qcf-568, document-specific), and the same GPT-Image-2 model as a zero-shot self-judge -- asked, to avoid the trivial "image is mostly real" reading, whether any region was generated or edited by an AI image model. Human 2AFC accuracy is 0.501, indistinguishable from chance: even side-by-side, inspectors cannot tell GPT-Image-2 receipt forgeries from authentic counterparts. The three computational judges sit only modestly above (TruFor 0.599, DocTamper 0.585, self-judge 0.532). The self-judge fails consistently, not by chance: across five prompt strategies and four policies for handling ambiguous responses, AUC never rises above 0.59. To rule out the possibility that the two forensic detectors are broken on our source domain rather than blind to AI inpainting, we calibrate each on a same-domain traditional-tampering set built for its training distribution: TruFor reaches AUC 0.962 on cross-camera splicing of our dataset, DocTamper reaches 0.852 on cross-document OCR-token splicing with two-pass JPEG re-encoding. Both retain near-published performance on traditional tampering; switching to GPT-Image-2 inpainting drops AUC by 0.27-0.36 (0.962->0.599 TruFor; 0.852->0.585 DocTamper), isolating a detection gap specific to GPT-Image-2 inpainting. We release the dataset, pipeline, four-judge protocol, and calibration sets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces AIForge-Doc v2, a dataset of 3,066 GPT-Image-2 document forgeries with pixel-precise masks in DocTamper format. It evaluates four detectors: human 2AFC (accuracy 0.501 on 365 pairs), TruFor (AUC 0.599 on GPT-Image-2 edits vs. 0.962 on same-domain cross-camera splicing), DocTamper (AUC 0.585 vs. 0.852 on OCR-token splicing), and GPT-Image-2 as zero-shot self-judge (AUC 0.532 across five prompt variants). The central result isolates a detection gap specific to GPT-Image-2 inpainting by showing retained near-published performance on traditional tampering calibration sets.

Significance. If the results hold, the work demonstrates a practically important gap in document forensic detectors against current AI inpainting, with direct implications for receipt and document verification. Strengths include the release of the paired dataset, calibration sets, and four-judge protocol, plus convergent evidence from human study and multiple self-judge strategies. The calibration experiments (TruFor 0.962, DocTamper 0.852 on traditional edits) provide a clean isolation of the AI-specific failure without circularity.

minor comments (3)
  1. Abstract: the phrase 'four lines of defence' lists humans, TruFor, DocTamper, and self-judge; a brief parenthetical clarifying that the self-judge is an additional zero-shot evaluation would avoid minor miscounting.
  2. Human study section: the 2AFC protocol (N=120, n=365) is well-described, but adding a short note on how pair selection avoided obvious cues (e.g., lighting or resolution) would strengthen reproducibility.
  3. Dataset release: the DocTamper-compatible mask format is a clear asset; including a brief table of document categories and forgery locations in the supplementary material would aid downstream use.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, accurate summary of the central result, and recommendation to accept. The emphasis on the calibration experiments correctly identifies how they isolate the GPT-Image-2-specific detection gap without circularity.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports only direct empirical AUC measurements and human 2AFC accuracy on held-out paired data. Calibration on traditional tampering (cross-camera splicing, OCR-token splicing) is performed on separate sets to establish baseline detector performance on the source domain; the subsequent drop under GPT-Image-2 inpainting is a measured difference between two independent test conditions, not a quantity derived from or fitted to the target result. No equations, self-definitions, ansatzes, or load-bearing self-citations appear in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on empirical benchmarks and dataset construction rather than new theoretical derivations or postulated entities.

axioms (1)
  • domain assumption The released dataset and calibration sets are representative of typical GPT-Image-2 document editing use cases.
    Performance claims assume the 3,066 forgeries and traditional-tampering calibration sets capture the relevant distribution.

pith-pipeline@v0.9.0 · 5739 in / 1277 out tokens · 54025 ms · 2026-05-07T16:48:08.767951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    2025 identity fraud report: Deepfake attacks strike every five min- utes amid 244% surge in digital document forg- eries

    Entrust Cybersecurity Institute. 2025 identity fraud report: Deepfake attacks strike every five min- utes amid 244% surge in digital document forg- eries. Technical report, Entrust, 2024. Re- leased November 2024. Data window: Sept 2023 – Aug 2024. https://www.entrust.com/ sites/default/files/documentation/ reports/2025-identity-fraud-report. pdf

  2. [2]

    Petranton- akis

    Paschalis Giakoumoglou, Dimitrios Karageorgiou, Symeon Papadopoulos, and Panagiotis C. Petranton- akis. SAGI: Semantically aligned and uncertainty guided AI image inpainting. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2025. SAGI-D: 95,839 AI-inpainted im- ages across 5 pipelines. https://arxiv.org/ abs/2502.06593

  3. [3]

    TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https: //grip-unina.github.io/TruFor/

  4. [4]

    Hierarchical fine-grained image forgery detection and localization

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. https://arxiv.org/abs/2303.17111

  5. [5]

    Jawa- har

    Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C.V . Jawa- har. ICDAR 2019 competition on scanned receipt OCR and information extraction. InInternational Conference on Document Analysis and Recognition (ICDAR), 2019. https://arxiv.org/abs/ 2103.10213

  6. [6]

    CAT-Net: Compression artifact trac- ing network for detection and localization of image splicing

    Myung-Joon Kwon, In-Jae Yu, Seung-Hun Nam, and Heung-Kyu Lee. CAT-Net: Compression artifact trac- ing network for detection and localization of image splicing. InIEEE Winter Conference on Applica- tions of Computer Vision (WACV), pages 375–384,

  7. [7]

    https://ieeexplore.ieee.org/ document/9423390

  8. [8]

    Can multi-modal (reasoning) LLMs detect document manipulation? arXiv preprint arXiv:2508.11021, 2025

    Zisheng Liang, Kidus Zewde, Rudra Pratap Singh, Disha Patil, Zexi Chen, Jiayu Xue, Yao Yao, Yifei Chen, Qinzhe Liu, and Simiao Ren. Can multi-modal (reasoning) LLMs detect document manipulation? arXiv preprint arXiv:2508.11021, 2025

  9. [9]

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. PSCC-Net: Progressive spatio-channel correla- tion network for image manipulation detection and localization.IEEE Transactions on Circuits and Sys- tems for Video Technology, 32(11):7505–7517, 2022

  10. [10]

    To- ward real text manipulation detection: New dataset and new solution.Pattern Recognition, 148:110828,

    Dongliang Luo, Yuliang Liu, Rui Yang, Xianjin Liu, Jishen Zeng, Yu Zhou, and Xiang Bai. To- ward real text manipulation detection: New dataset and new solution.Pattern Recognition, 148:110828,

  11. [11]

    https:// arxiv.org/abs/2312.06934, code: https: //github.com/DrLuo/RTM

    RTM: 9k images (6k tampered). https:// arxiv.org/abs/2312.06934, code: https: //github.com/DrLuo/RTM

  12. [12]

    IC- DAR 2023 competition on detecting tampered text in images

    Dongliang Luo, Yu Zhou, Rui Yang, Yuliang Liu, Xianjin Liu, Jishen Zeng, Enming Zhang, Biao Yang, Ziming Huang, Lianwen Jin, and Xiang Bai. IC- DAR 2023 competition on detecting tampered text in images. InDocument Analysis and Recogni- tion – ICDAR 2023, Lecture Notes in Computer Science. Springer, 2023. TII dataset: 11,385 im- ages, 5,500 tampered with p...

  13. [13]

    arXiv preprint arXiv:2307.14863 (2023)

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Xia Du, Ahmed Y . Al Hammadi, and Jizhe Zhou. IML-ViT: Benchmarking image manipulation localization by 12 vision transformer.arXiv preprint arXiv:2307.14863,

  14. [14]

    org / abs / 2307

    https : / / arxiv . org / abs / 2307 . 14863

  15. [15]

    Introducing gpt-image-2 — available to- day in the api and codex

    OpenAI. Introducing gpt-image-2 — available to- day in the api and codex. OpenAI Developer Com- munity announcement, 2026. Released April 21,

  16. [16]

    https://community.openai.com/ t/1379479

  17. [17]

    LLM Evaluators Recognize and Favor Their Own Generations

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024. Self- recognition behaviour in language models. https: //arxiv.org/abs/2404.13076

  18. [18]

    CORD: A consolidated receipt dataset for post- OCR parsing

    Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. CORD: A consolidated receipt dataset for post- OCR parsing. InDocument Intelligence Workshop, NeurIPS, 2019. Dataset and paper PDF: https: //github.com/clovaai/cord

  19. [19]

    Towards robust tampered text detection in docu- ment image: New dataset and new solution

    Chenfan Qu, Chongyu Liu, Yuliang Liu, Xinhong Chen, Dezhi Peng, Fengjun Guo, and Lianwen Jin. Towards robust tampered text detection in docu- ment image: New dataset and new solution. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5937–5946, 2023. DocTamper dataset: 170k im- ages, bilingual (zh/en), ht...

  20. [20]

    Revisiting tampered scene text detection in the era of generative AI

    Chenfan Qu, Yiwu Zhong, Fengjun Guo, and Lian- wen Jin. Revisiting tampered scene text detection in the era of generative AI. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pages 694–702, 2025. OSTF: 4,418 im- ages, 8 forgery tools including diffusion models. https://github.com/qcf-568/OSTF

  21. [21]

    Can multi- modal (reasoning) LLMs work as deepfake detectors? arXiv preprint arXiv:2503.20084, 2025

    Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Dennis Tsang Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, and Hengwei Xu. Can multi- modal (reasoning) LLMs work as deepfake detectors? arXiv preprint arXiv:2503.20084, 2025

  22. [22]

    AEROBLADE: Training-free detection of latent dif- fusion images using autoencoder reconstruction error

    Jonas Ricker, Denis Lukovnikov, and Asja Fischer. AEROBLADE: Training-free detection of latent dif- fusion images using autoencoder reconstruction error. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024

  23. [23]

    Spatial dual-modality graph reasoning for key information extraction.arXiv preprint arXiv:2103.14470, 2021

    Hongbin Sun, Zhanghui Kuang, Xiaoyu Yue, Chen- hao Lin, and Wayne Zhang. Spatial dual-modality graph reasoning for key information extraction.arXiv preprint arXiv:2103.14470, 2021. Introduces Wil- dReceipt dataset

  24. [24]

    AIForge-Doc: A benchmark for detecting ai-forged tampering in financial and form documents

    Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, and Jingheng Huan. AIForge-Doc: A benchmark for detecting ai-forged tampering in financial and form documents. https://arxiv.org/abs/2602. 20569, 2026. v1 of the paired-spec dataset reused in the present work

  25. [25]

    ManTra-Net: Manipulation tracing net- work for detection and localization of image forg- eries with anomalous features

    Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. ManTra-Net: Manipulation tracing net- work for detection and localization of image forg- eries with anomalous features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9543–9552,

  26. [26]

    https://ieeexplore.ieee.org/ document/8953774

  27. [27]

    XFUND: A benchmark dataset for multilingual visually rich form understanding

    Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. XFUND: A benchmark dataset for multilingual visually rich form understanding. InFindings of the Association for Computational Linguistics (ACL),

  28. [28]

    findings-acl.253/

    https://aclanthology.org/2022. findings-acl.253/

  29. [29]

    DiffForensics: Leveraging diffusion prior to image forgery detection and localization

    Zeqin Yu, Jiangqun Ni, Yuzhen Lin, Haoyi Deng, and Bin Li. DiffForensics: Leveraging diffusion prior to image forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. Open access: https: / / openaccess . thecvf . com / content / CVPR2024 / papers / Yu _ DiffForensics _ Leveraging...