pith. sign in

arxiv: 2606.21138 · v1 · pith:7YPATN3Dnew · submitted 2026-06-19 · 💻 cs.CV

SEED: Simple ViT and Evolving Harness for Explainable Text Forgery Detection

Pith reviewed 2026-06-26 14:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords text forgery detectionvision transformermultimodal large language modelimage forensicsforgery localizationLoRA adaptationexplainable detection
0
0 comments X

The pith

A single LoRA-adapted ViT plus evolving MLLM harness detects, localizes, and explains text forgeries in images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SEED to produce structured forensic reports that include detection, pixel-level localization, and natural-language explanations for AI-edited text in images. It builds the system from three parts: a similarity-guided pipeline that creates varied synthetic training examples, one Vision Transformer adapted via LoRA on DINOv3 to handle both detection and localization with few new parameters, and an evolving harness that feeds the detector outputs into a multimodal large language model and refines the full report through repeated proposer-evaluator steps. A sympathetic reader would care because the approach keeps the vision model simple while adding an iterative language-model stage that turns raw predictions into usable explanations, and the resulting system placed third in the GenText-Forensics Challenge.

Core claim

SEED shows that a similarity-guided augmentation pipeline, a single LoRA-adapted ViT built on DINOv3, and an evolving MLLM harness with a proposer-evaluator loop can jointly perform detection, pixel-level localization, and generation of complete multilingual forensic reports for text-centric forgeries.

What carries the argument

The evolving harness that iteratively refines MLLM-generated reports by running a proposer-evaluator loop on the detector outputs.

If this is right

  • The single ViT can jointly output detection scores and pixel-level masks while keeping most pre-trained weights frozen.
  • Similarity-guided augmentation supplies enough varied synthetic forgeries to train the detector without large external datasets.
  • The modular design separates the vision detection stage from the language explanation stage, allowing each to be updated independently.
  • The full pipeline produces the structured reports required by the GenText-Forensics Challenge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular split between a lightweight detector and an iterative language stage could be tested on other image-manipulation tasks that need both localization and explanation.
  • If the proposer-evaluator loop generalizes, it could reduce reliance on hand-crafted prompts when turning model outputs into human-readable forensic text.
  • Because the vision model uses minimal adaptation, the approach might scale to larger pre-trained backbones without proportional increases in training cost.

Load-bearing premise

The proposer-evaluator loop inside the evolving harness reliably improves the quality of the natural-language forensic reports.

What would settle it

An external human evaluation of the generated reports on a new set of forgery images, scored independently of the challenge metric, would show whether the loop adds measurable quality.

Figures

Figures reproduced from arXiv: 2606.21138 by Haiwei Wu, Jiantao Zhou, Kahim Wong, Kemou Li, Yiming Chen.

Figure 1
Figure 1. Figure 1: Example with (a) the forged image, (b) its forgery [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SEED’s three-stage forgery analysis pipeline. Stage 1 generates diverse synthetic forgeries using contrastive [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual input to the MLLM in the harness stage. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

AI-assisted image editing threatens trust in financial, legal, and identity records. The GenText-Forensics Challenge at ACM MM 2026 addresses this by requiring structured forensic reports, in which integrating detection, pixel-level localization, and natural language explanation for multilingual text-centric forgery images. We present SEED, a modular system with three components. First, a similarity-guided pipeline augments training with diverse synthetic forgeries. Second, a single ViT, built on DINOv3 with LoRA adaptation, jointly performs detection and pixel-level localization while preserving pre-trained priors with minimal trainable parameters. Third, an evolving harness takes the detector's predictions and generates a complete forensic report via an MLLM, iteratively improved through a proposer-evaluator loop optimizing report quality. SEED ranked 3rd in the GenText-Forensics Challenge. Code and data are available at https://github.com/KahimWong/GenText-Forensics-3rd-Place.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents SEED, a modular pipeline for explainable text forgery detection consisting of (1) a similarity-guided augmentation pipeline to create diverse synthetic forgeries, (2) a single LoRA-adapted ViT (DINOv3 backbone) that jointly performs binary detection and pixel-level localization while preserving pre-trained priors, and (3) an evolving MLLM harness that iteratively refines forensic reports via a proposer-evaluator loop. The system is reported to have placed 3rd in the GenText-Forensics Challenge at ACM MM 2026; code and data are released.

Significance. If the ranking can be attributed to the described components, the work would be significant for efficient, low-parameter forgery detection and localization in text-centric images together with the generation of structured natural-language explanations. The competition outcome supplies external validation of the end-to-end system, and public code release is a clear strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract and experimental description: the central claim that the full pipeline (including the evolving harness) produced the 3rd-place ranking is asserted without any reported quantitative metrics, ablation tables, or error analysis that would isolate the contribution of each module.
  2. [Evolving harness description] Evolving harness section: the claim that the proposer-evaluator loop improves report quality rests exclusively on the challenge’s internal natural-language explanation scores; no ablation against a non-iterative MLLM baseline or external human forensic evaluation is provided, leaving the explanatory component of the title unsupported.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments highlighting the need for stronger isolation of component contributions. We agree that the current manuscript lacks explicit ablations and will revise to address this where feasible. Responses to each major comment follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental description: the central claim that the full pipeline (including the evolving harness) produced the 3rd-place ranking is asserted without any reported quantitative metrics, ablation tables, or error analysis that would isolate the contribution of each module.

    Authors: We agree that the manuscript does not report ablations or error analysis that isolate the contribution of the similarity-guided augmentation, the LoRA ViT detector, or the evolving harness to the final 3rd-place ranking. The ranking reflects end-to-end system performance in the challenge. In revision we will add ablation tables using the challenge metrics to quantify the effect of removing or replacing each module. revision: yes

  2. Referee: [Evolving harness description] Evolving harness section: the claim that the proposer-evaluator loop improves report quality rests exclusively on the challenge’s internal natural-language explanation scores; no ablation against a non-iterative MLLM baseline or external human forensic evaluation is provided, leaving the explanatory component of the title unsupported.

    Authors: The iterative loop's benefit is supported only by the challenge's internal explanation scores for the submitted system. No non-iterative MLLM ablation or external human evaluation appears in the manuscript. We will add a non-iterative MLLM baseline comparison in the revision. External human forensic evaluation was not performed and cannot be supplied without new annotation effort. revision: partial

standing simulated objections not resolved
  • External human forensic evaluation of the generated reports, which was outside the original challenge protocol and would require new resources.

Circularity Check

0 steps flagged

No circularity: external competition ranking and modular system description

full rationale

The paper presents an engineering system (similarity-guided augmentation, LoRA-adapted ViT detector, evolving MLLM harness) whose primary claim is a 3rd-place ranking in the GenText-Forensics Challenge. This ranking is an external, independent outcome rather than an internally derived prediction. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the text. The harness improvement is asserted via challenge scoring, but this does not reduce to a self-definitional or constructionally forced result. The derivation chain is self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Because only the abstract is available, no explicit free parameters, axioms, or invented entities can be extracted; the system relies on standard pre-trained models (DINOv3, MLLM) whose training details are external.

pith-pipeline@v0.9.1-grok · 5708 in / 1123 out tokens · 16946 ms · 2026-06-26T14:35:26.156727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Zhongxi Chen, Shen Chen, Taiping Yao, Ke Sun, Shouhong Ding, Xianming Lin, Liujuan Cao, and Rongrong Ji. 2024. Enhancing tampered text detection through frequency feature fusion and decomposition. InProc. Eur. Conf. Comput. Vis. 200–217

  2. [2]

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProc. IEEE Comput. Vis. Pattern Recogn.1290–1299

  3. [3]

    Mohamed Dhouib, Davide Buscaldi, Sonia Vanier, and Aymen Shabou. 2026. Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline.arXiv preprint arXiv:2602.17322(2026)

  4. [4]

    Li Dong, Weipeng Liang, and Rangding Wang. 2024. Robust text image tampering localization via forgery traces enhancement and multiscale attention.IEEE Trans. Consum. Electron.(2024), 3495–3507

  5. [5]

    Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi- Man Pun, Jian Liu, and Ji-Zhe Zhou. 2025. Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization. InAdv. Neural Inf. Process. Syst

  6. [6]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InProc. Int. Conf. Learn. Representat

  7. [7]

    Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. 2024. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InProc. Eur. Conf. Comput. Vis.150–168

  8. [8]

    Tommie Kerssies, Niccolo Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan De Geus. 2025. Your vit is secretly an image segmentation model. InProc. IEEE Comput. Vis. Pattern Recogn.25303–25313

  9. [9]

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. Meta-harness: End-to-end optimization of model harnesses. arXiv preprint arXiv:2603.28052(2026)

  10. [10]

    Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, and Jiantao Zhou. 2026. LLM Unlearning with LLM Beliefs. InProc. Int. Conf. Learn. Repre- sentat

  11. [11]

    Dongliang Luo, Yuliang Liu, Rui Yang, Xianjin Liu, Jishen Zeng, Yu Zhou, and Xiang Bai. 2025. Toward real text manipulation detection: New dataset and new solution.Pattern Recognition(2025), 110828

  12. [12]

    GenText-Forensics Organizers. 2026. GenText-Forensics: Challenge on Explain- able Forensics and Adversarial Generation for Text-Centric Images. ACM Multi- media 2026 Challenge. https://gentext-forensics-acm-mm-2026.github.io/

  13. [13]

    Chenfan Qu, Chongyu Liu, Yuliang Liu, Xinhong Chen, Dezhi Peng, Fengjun Guo, and Lianwen Jin. 2023. Towards robust tampered text detection in document image: New dataset and new solution. InProc. IEEE Comput. Vis. Pattern Recogn. 5937–5946

  14. [14]

    Chenfan Qu, Yiwu Zhong, Fengjun Guo, and Lianwen Jin. 2025. Revisiting tampered scene text detection in the era of generative AI. InProc. AAAI Conf. Arti. Intell.694–702

  15. [15]

    Chenfan Qu, Yiwu Zhong, Jian Liu, Xuekang Zhu, Bohan Yu, and Lianwen Jin

  16. [16]

    Textshield-r1: Reinforced reasoning for tampered text detection. InProc. AAAI Conf. Arti. Intell.8621–8629

  17. [17]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. IEEE Comput. Vis. Pattern Recogn.10684–10695

  18. [18]

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

  19. [19]

    Yalin Song, Wenbin Jiang, Xiuli Chai, Zhihua Gan, Mengyuan Zhou, and Lei Chen. 2025. Cross-attention based two-branch networks for document image forgery localization in the metaverse.ACM Trans. Multimedia Comput. Commun. Appl.(2025), 1–24

  20. [20]

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust large mask inpainting with fourier convolutions. InProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (W ACV). 2149–2159

  21. [21]

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie

  22. [22]

    Anytext: Multilingual visual text generation and editing. InProc. Int. Conf. Learn. Representat.56783–56799

  23. [23]

    Yuxin Wang, Hongtao Xie, Mengting Xing, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2022. Detecting tampered scene text in the wild. InProc. Eur. Conf. Comput. Vis.215–232

  24. [24]

    Yuxin Wang, Boqiang Zhang, Hongtao Xie, and Yongdong Zhang. 2022. Tampered text detection via RGB and frequency relationship modeling.Chin. J. Netw. Inf. Secur.(2022), 29–40

  25. [25]

    2026.𝑘 NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection.arXiv preprint arXiv:2604.02008(2026)

    Kahim Wong, Kemou Li, Haiwei Wu, and Jiantao Zhou. 2026.𝑘 NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection.arXiv preprint arXiv:2604.02008(2026)

  26. [26]

    Kahim Wong, Jicheng Zhou, Kemou Li, Yain-Whar Si, Xiaowei Wu, and Jiantao Zhou. 2025. FontGuard: A Robust Font Watermarking Approach Leveraging Deep Font Knowledge.IEEE Trans. Multimedia(2025), 7876–7890

  27. [27]

    Kahim Wong, Jicheng Zhou, Haiwei Wu, Yain-Whar Si, and Jiantao Zhou. 2025. ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Fea- ture and Hierarchical Content Disentanglement. InProc. IEEE Int. Conf. Comput. Vis.19280–19289

  28. [28]

    Ka Him Wong, Jicheng Zhou, Jiantao Zhou, and Yain-Whar Si. 2025. An End-to- End Model for Logits-Based Large Language Models Watermarking. InProc. Int. Conf. Mach. Learn.66971–66991

  29. [29]

    Haiwei Wu, Fengpeng Li, Zhilin Tu, Yuanman Li, Xiong Li, and Jiantao Zhou

  30. [30]

    Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment. In Proc. IEEE Comput. Vis. Pattern Recogn.42997–43007

  31. [31]

    Haiwei Wu, Kemou Li, Yuanman Li, and Jiantao Zhou. 2026. Editprint: General digital image forensics via editing fingerprint with self-augmentation training. InProc. IEEE Comput. Vis. Pattern Recogn.35483–35493

  32. [32]

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2025. Orthogonal subspace decomposition for generalizable ai-generated image detection. InProc. Int. Conf. Mach. Learn.70268–70288. ,