pith. machine review for the scientific record. sign in

arxiv: 2605.08820 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI· cs.CR

Recognition: no theorem link

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CR
keywords AI-generated image detectionrefund fraudmultimodal benchmarke-commerce fraudimage forgery detectiondamage evidencesynthetic imagesclaim verification
0
0 comments X

The pith

Current multimodal AI models struggle to detect AI-generated images used as fake damage evidence in refund claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds FraudBench to address the problem of AI-generated images supporting false refund claims for damaged products in e-commerce and similar services. It collects real review images, labels them for actual damage using AI help and people, then creates fake versions with six different generation models. Tests on this data find that multimodal large language models spot real damage but miss many fakes, often performing worse than a coin flip. Specialized detectors do better yet still vary by generator and sometimes mistake real images for fakes. This setup shows why general fake-image tools fall short for verifying specific claims about product condition.

Core claim

FraudBench provides a multimodal benchmark consisting of real user-review evidence from e-commerce, food delivery, and travel services, with genuine damaged and undamaged labels established through MLLM-assisted filtering and human annotation, alongside fake-damaged evidence synthesized using six state-of-the-art image models. Evaluations under this benchmark demonstrate that current multimodal large language models recognize real-damaged evidence but exhibit true positive rates far below the 50% baseline on most fake-damaged subsets generated by different models.

What carries the argument

The FraudBench construction pipeline, which pairs real evidence images and metadata with both authentic damage labels and synthetic fake-damage variants from multiple generators to enable claim-conditioned detection testing.

If this is right

  • Multimodal large language models often fail to detect fake-damaged evidence with rates below 50 percent on most generator subsets.
  • Specialized detectors achieve better performance but remain inconsistent across generators and produce false positives on real-damaged samples.
  • Current methods leave a gap in reliable claim-conditioned verification of refund evidence.
  • Human participants provide a performance reference under identical evaluation settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Platforms handling refunds could integrate claim-specific checks that go beyond general image authenticity tools to reduce fraud losses.
  • Detector development should prioritize robustness testing against diverse image synthesis techniques to close the observed performance gaps.
  • The benchmark could be extended to other claim types like insurance or warranty disputes where visual evidence is submitted.

Load-bearing premise

The MLLM-assisted filtering and human annotation process yields accurate ground-truth labels for genuine versus undamaged evidence, and the six chosen generators adequately represent the variety of AI-generated images in actual refund fraud cases.

What would settle it

Running the evaluated detectors on a new set of images created by an additional image generation model not used in the original benchmark and checking if detection rates stay low would challenge the observed failure pattern.

Figures

Figures reproduced from arXiv: 2605.08820 by Boyang Chen, Hong Xi Tae, Jiaming Zhang, Lei Xiao, Longtao Huang, Pengjun Xie, Tiantong Wang, Tiantong Wu, Wei Liu, Wei Yang Bryan Lim, Xinyu Yan, Yachun Mi, Yichen He, Yilei Zhao, Yurong Hao.

Figure 1
Figure 1. Figure 1: Construction pipeline of our proposed FRAUDBENCH. The benchmark is constructed in three stages: (1) Data Collection: We collect real-world refund-related review data from four data sources and consolidate them into 29 categories. (2) Data Preprocessing: We perform multimodal aggregation, two-level cleaning, rule-based filtering, representative sampling, and anonymization, with screening and human verificat… view at source ↗
Figure 2
Figure 2. Figure 2: Representative benchmark examples in FRAUDBENCH. We illustrate examples covering the main evaluation settings: input modality, contextual information, multi-step reasoning, and prompt sensitivity. Text prompts, rationales, and intermediate judgments are abbreviated for visualization. 3.3 Evaluation Protocol Detecting AI-generated review images in refund scenarios requires more than a single fixed evaluatio… view at source ↗
Figure 3
Figure 3. Figure 3: Top-level directory structure of FRAUDBENCH, illustrated for a single product or service category. The structure is identical across all 29 categories. Positive and Negative Subsets. The Positive directory contains real-undamaged review records, which serve as generation references for synthetic evidence and are excluded from evaluation. The Negative directory contains real-damaged review records, which fo… view at source ↗
Figure 4
Figure 4. Figure 4: Structure of an Amazon user review record. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Structure of an Amazon item metadata record. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Structure of an enriched Amazon review-level metadata record after matching a retained [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structure of a retained Trip.com user review record. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Structure of a retained GrabFood user review record. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt used for MLLM-based image relevance filtering of Trip.com hotel review [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompts used for MLLM-based filtering of GrabFood positive review media, [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompts used for MLLM-based filtering of GrabFood negative review media, [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt used for MLLM-based damage pattern analysis of Amazon e-commerce [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System prompt used for MLLM-based damage pattern analysis of Trip.com hotel review [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: System prompt used for MLLM-based damage pattern analysis of GrabFood food delivery [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Shared system prompt used to establish the forensic-analysis role for MLLM-based [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Review-text augmentation inserted between the image-analysis instruction and the JSON [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Multi-image prompt used when all images from the same review are submitted in one user [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Continuation prompt used after each non-final image in the multi-step setting. [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Final prompt used to obtain the consolidated verdict in the multi-step setting. [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Baseline prompt used in the prompt-sensitivity study. [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Merged prompt used in the prompt-sensitivity study. The system and user instructions are [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: No Checklist prompt used in the prompt-sensitivity study. [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Generic prompt used in the prompt-sensitivity study. [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Minimal prompt used in the prompt-sensitivity study. [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
read the original abstract

Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. It curates real-world user-review images and metadata from e-commerce, food delivery, and travel scenarios. Genuine damaged and undamaged evidence is identified using MLLM-assisted filtering and human annotation, while fake-damaged evidence is synthesized from undamaged references via six state-of-the-art image generation and editing models. Evaluations of MLLMs, specialized detectors, and humans show MLLMs recognize real-damaged evidence well but have low true positive rates (often below 50%) on fake-damaged subsets, with specialized detectors performing better but inconsistently and prone to false positives on real samples.

Significance. If the ground-truth construction is validated, the benchmark is significant for exposing practical gaps in MLLMs on claim-conditioned fraud detection tasks with direct relevance to e-commerce security. The multimodal setup (images plus review metadata), consistent cross-model evaluation protocol, and inclusion of human baselines provide a reusable resource that could guide development of more robust verification systems. The paper's empirical demonstration of TPR disparities across generators is a concrete contribution.

major comments (2)
  1. [Section 3] The central experimental claims—that MLLMs achieve high recognition on real-damaged evidence but TPR far below the 50% baseline on most fake-damaged generator subsets—rest on the correctness of the real-evidence labels. Section 3 describes the MLLM-assisted filtering followed by human annotation but supplies no inter-annotator agreement statistic (e.g., Cohen’s or Fleiss’ kappa), number of annotators, disagreement-resolution protocol, or external validation against domain experts. Without these, label noise cannot be ruled out as a confounder for the reported performance gaps.
  2. [Section 4] No quantitative checks for synthetic-image realism or distribution shift between real and generated damaged evidence are reported. Section 4 (evaluation) and the construction pipeline omit metrics such as perceptual similarity scores, human realism ratings, or statistical tests confirming that the six generators produce evidence representative of real-world refund fraud distributions.
minor comments (3)
  1. The abstract refers to “six state-of-the-art image editing and generation models” without naming them; listing the specific models (with versions) in Section 3.1 would improve reproducibility.
  2. A summary table of dataset statistics (counts of real-damaged, real-undamaged, and synthetic samples per scenario and generator) is missing and would aid readers in assessing scale and balance.
  3. Figure captions could more explicitly indicate which generator produced each synthetic example panel to facilitate direct comparison with the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review, the positive assessment of FraudBench's significance for exposing gaps in claim-conditioned fraud detection, and the recommendation for major revision. We agree that greater transparency on ground-truth construction and synthetic-image validation will strengthen the manuscript. Below we respond point by point to the major comments and commit to the corresponding revisions.

read point-by-point responses
  1. Referee: [Section 3] The central experimental claims—that MLLMs achieve high recognition on real-damaged evidence but TPR far below the 50% baseline on most fake-damaged generator subsets—rest on the correctness of the real-evidence labels. Section 3 describes the MLLM-assisted filtering followed by human annotation but supplies no inter-annotator agreement statistic (e.g., Cohen’s or Fleiss’ kappa), number of annotators, disagreement-resolution protocol, or external validation against domain experts. Without these, label noise cannot be ruled out as a confounder for the reported performance gaps.

    Authors: We agree that these details are essential for readers to assess potential label noise. The revised manuscript will expand Section 3 to report the number of human annotators, the inter-annotator agreement statistic (Fleiss’ kappa), the disagreement-resolution protocol (majority vote after discussion), and any available steps toward external validation. These additions will directly address the concern that label noise could confound the reported TPR gaps between real- and fake-damaged evidence. revision: yes

  2. Referee: [Section 4] No quantitative checks for synthetic-image realism or distribution shift between real and generated damaged evidence are reported. Section 4 (evaluation) and the construction pipeline omit metrics such as perceptual similarity scores, human realism ratings, or statistical tests confirming that the six generators produce evidence representative of real-world refund fraud distributions.

    Authors: We acknowledge that the current manuscript lacks quantitative realism and distribution-shift checks. In the revised version we will add these analyses to the construction pipeline and Section 4, including perceptual similarity scores (e.g., LPIPS, SSIM) between real and generated damaged images, human realism ratings collected from a controlled study, and statistical tests comparing feature distributions. This will provide evidence that the six generators produce samples representative of real-world refund fraud. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper constructs FraudBench via real-world data curation, MLLM-assisted filtering plus human annotation for labels, synthesis of fake-damaged images using six external generators, and then evaluates MLLMs, detectors, and humans on the resulting dataset. No equations, parameter fitting, predictions, or derivation chain exist that could reduce to inputs by construction. No self-citations are load-bearing for any central claim, and the work contains no ansatz, uniqueness theorems, or renaming of known results. The central claims rest on external model performance measurements against the curated benchmark, making the study self-contained as standard empirical research without circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark's validity depends on two domain assumptions about data quality; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption MLLM-assisted filtering combined with human annotation yields accurate ground-truth labels separating genuine damaged from undamaged evidence
    Invoked in the curation step that defines the real-damaged and real-undamaged subsets.
  • domain assumption Images produced by the six state-of-the-art editing and generation models are representative of realistic AI-generated fraudulent refund evidence
    Basis for creating the fake-damaged subsets used in all detection experiments.

pith-pipeline@v0.9.0 · 5630 in / 1438 out tokens · 58665 ms · 2026-05-12T02:05:27.589303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    High-Resolution Image Synthesis with Latent Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  2. [2]

    Advances in neural information processing systems , volume=

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , author=. Advances in neural information processing systems , volume=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    InstructPix2Pix: Learning to Follow Image Editing Instructions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  5. [5]

    IEEE Journal of Selected Topics in Signal Processing , volume=

    Media Forensics and DeepFakes: An Overview , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2020 , publisher=

  6. [6]

    ACM Computing Surveys (CSUR) , volume=

    The Creation and Detection of Deepfakes: A Survey , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=

  7. [7]

    Li, Hongyang , year =

  8. [8]

    2025 , month = sep, day =

  9. [9]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    FaceForensics++: Learning to Detect Manipulated Facial Images , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    DF40: Toward Next-Generation Deepfake Detection , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Community Forensics: Using Thousands of Generators to Train Fake Image Detectors , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  13. [13]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  14. [14]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  15. [15]

    Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

    Bridging Language and Items for Retrieval and Recommendation , author=. arXiv preprint arXiv:2403.03952 , year=

  16. [16]

    ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization , url =

    Du, Bo and Zhu, Xuekang and Ma, Xiaochen and Qu, Chenfan and Feng, Kaiwen and Yang, Zhe and Pun, Chi-Man and Liu, Jian and Zhou, Ji-Zhe , booktitle =. ForensicHub: A Unified Benchmark & Codebase for All-Domain Fake Image Detection and Localization , url =

  17. [17]

    for now , author=

    CNN-generated images are surprisingly easy to spot... for now , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  18. [18]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Towards Universal Fake Image Detectors that Generalize Across Generative Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  19. [19]

    A Sanity Check for AI-generated Image Detection , url =

    Yan, Shilin and Li, Ouxiang and Cai, Jiayin and Hao, Yanbin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi , booktitle =. A Sanity Check for AI-generated Image Detection , url =

  20. [20]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  21. [21]

    International Conference on Machine Learning , pages=

    Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  22. [22]

    Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation , url =

    Wen, Siwei and Ye, Junyan and Feng, Peilin and Kang, Hengrui and Wen, Zichen and Chen, Yize and Wu, Jiang and Wu, Wenjun and He, Conghui and Li, Weijia , booktitle =. Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation , url =

  23. [23]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  24. [24]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    LEGION: Learning to Ground and Explain for Synthetic Image Detection , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  25. [25]

    The Fourteenth International Conference on Learning Representations , year=

    FakeXplain: AI-Generated Image Detection via Human-Aligned Grounded Reasoning , author=. The Fourteenth International Conference on Learning Representations , year=

  26. [26]

    2026 , month = mar, day =

  27. [27]

    2025 , month = dec, day =

  28. [28]

    2025 , month = nov, day =

  29. [29]

    2026 , howpublished =

  30. [30]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  31. [31]

    2026 , month = apr, day =

  32. [32]

    arXiv preprint arXiv:2604.15804 , year=

  33. [33]

    2024 , month = dec, day =

  34. [34]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  35. [35]

    Towards generalizable ai-generated image detection via image-adaptive prompt learning.arXiv preprint arXiv:2508.01603, 2025

    Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning , author=. arXiv preprint arXiv:2508.01603 , year=

  36. [36]

    Promptception: How Sensitive Are Large Multimodal Models to Prompts?

    Ismithdeen, Mohamed Insaf and Khattak, Muhammad Uzair and Khan, Salman. Promptception: How Sensitive Are Large Multimodal Models to Prompts?. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1302

  37. [37]

    Journal of Retailing and Consumer Services , volume=

    Understanding fraudulent returns and mitigation strategies in multichannel retailing , author=. Journal of Retailing and Consumer Services , volume=. 2023 , publisher=