SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3
The pith
Existing AI image detectors fail dramatically on scientific figures from modern generators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that modern multimodal generators can now produce near-publishable scientific figures, but detection methods developed for open-domain images do not transfer to this setting. The SciFigDetect benchmark, built with an agent pipeline for data creation and review-driven refinement, covers multiple figure types and generation sources. Testing under zero-shot, cross-generator, and degraded conditions reveals dramatic failures, strong overfitting to individual generators, and fragility to typical corruptions, exposing a gap between current capabilities and the distribution of synthetic scholarly visuals.
What carries the argument
The agent-based data pipeline that retrieves licensed papers, performs multimodal understanding of text and figures, builds structured prompts, synthesizes candidates, and filters them via a review-driven refinement loop to create aligned real-synthetic pairs.
If this is right
- Current detection methods will not reliably identify AI-generated scientific figures in zero-shot use across different tools.
- Generator-specific overfitting means practical detectors must handle multiple sources to avoid sharp performance drops.
- Fragility under common post-processing corruptions implies detectors may miss altered figures in published work.
- A dedicated benchmark for scientific figures is required to develop more robust forensic tools for scholarly content.
Where Pith is reading between the lines
- Detection models may benefit from incorporating scientific semantics and text-figure alignment as explicit features.
- Academic publishing platforms could adopt similar benchmarks for automated integrity checks on submitted visuals.
- The pipeline approach could extend to other structured academic elements such as tables or chemical diagrams.
- As generators advance, repeated benchmarking will be needed to track the evolving gap in detection performance.
Load-bearing premise
The synthetic figures produced by the agent pipeline accurately represent the distribution and post-processing conditions of real AI-generated scientific figures from diverse sources.
What would settle it
Test the benchmark detectors on a fresh set of AI-generated scientific figures produced directly by current generators like GPT-4V or Claude for actual research papers and measure if performance matches the reported zero-shot and degradation results.
Figures
read the original abstract
Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SciFigDetect, the first benchmark dataset and evaluation for detecting AI-generated scientific figures. It describes an agent-based pipeline that retrieves papers, performs multimodal analysis, generates structured prompts, synthesizes figures via multiple generators, and applies review-driven refinement to produce aligned real-synthetic pairs across figure categories. Experiments benchmark existing detectors in zero-shot transfer, cross-generator, and post-processing corruption settings, reporting dramatic failures, generator-specific overfitting, and fragility to common degradations.
Significance. If the synthetic figures are representative of real AI-generated scientific figures, the benchmark would usefully expose a gap in current AIGI detectors for structured, text-dense scholarly imagery and provide a public dataset to drive future work on scientific-figure forensics. The work is empirical benchmark construction with no machine-checked proofs or parameter-free derivations, but the release of the dataset and the reported performance gaps constitute a concrete starting point for the community.
major comments (3)
- [Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.
- [Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.
- [Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction would benefit from a clearer statement of the exact number of figures, generators, and categories in the released benchmark.
- [Figures and tables] Figure captions and table headers should explicitly indicate whether reported metrics are averaged over all generators or broken down per generator.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. We have carefully considered each major comment and provide our responses below. We commit to revising the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.
Authors: We agree with the referee that direct quantitative validation against real AI-generated figures would strengthen the benchmark's validity. However, as this is the first such benchmark, a comprehensive held-out set of verified real AI-generated scientific figures from sources like arXiv was not available for comparison. Our pipeline is designed to produce realistic pairs by starting from real papers and using agent-based multimodal analysis to generate aligned synthetics. We include extensive qualitative examples and category coverage in the manuscript. In the revised version, we will add basic quantitative statistics (e.g., text density, layout metrics) for both real and synthetic figures from the source papers and discuss the limitations of the construction process. We will also tone down the claim of exact distributional match to 'closely approximates real-world scientific figure generation scenarios.' revision: partial
-
Referee: [Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.
Authors: We apologize for the lack of clarity in the manuscript. The zero-shot evaluation uses detectors pre-trained on existing AIGI benchmarks (such as those for natural images) and tests them directly on our SciFigDetect dataset without any fine-tuning or exposure to our data, ensuring no overlap with training sets of those detectors. For cross-generator, we use leave-one-generator-out protocols. We will revise the Experiments section to explicitly describe the split construction, confirm no paper or generator overlap in the relevant settings, and include detailed per-generator performance tables to support the overfitting observations. revision: yes
-
Referee: [Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.
Authors: The referee correctly notes that the corruptions were chosen based on standard practices in the AIGI detection literature (e.g., JPEG compression levels, resizing factors commonly used in robustness tests) rather than calibrated specifically to observed distributions in scientific figures. This is a limitation. In the revision, we will add a discussion justifying the choice of parameters with references to prior work and, if feasible, include an analysis of typical post-processing in arXiv figures. We will also clarify that the results demonstrate fragility to these common degradations, which are relevant even if not perfectly calibrated. revision: partial
Circularity Check
No circularity: purely empirical benchmark construction and external evaluation
full rationale
The paper describes an agent-based pipeline to generate a new benchmark dataset of real-synthetic scientific figure pairs and then reports empirical performance of existing external detectors under zero-shot, cross-generator, and corruption settings. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text. The central claims (detector failure modes) are direct measurements on held-out data rather than quantities derived from the pipeline outputs by construction. The skeptic concern about distribution match is a validity question, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal understanding of paper text and figures can produce structured prompts that yield realistic scientific figures when fed to generators.
Reference graph
Works this paper leans on
-
[1]
American Association for the Advancement of Science. [n. d.]. Science Jour- nals: Editorial Policies. https://www.science.org/content/page/science-journals- editorial-policies
-
[2]
Jordan J Bird and Ahmad Lotfi. 2024. Cifake: Image classification and explainable identification of ai-generated synthetic images.IEEE Access12 (2024), 15642– 15650
work page 2024
-
[3]
Cell Press. [n. d.]. Figure guidelines. https://www.cell.com/information-for- authors/figure-guidelines
-
[4]
Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. InEuropean Conference on Computer Vision. Springer, 103–120
work page 2020
-
[5]
Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. 2020. On the detection of digital face manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5781–5790
work page 2020
-
[6]
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems34 (2021), 8780–8794
work page 2021
-
[7]
Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up- convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7890–7899
work page 2020
-
[8]
Euro-Par 2026. [n. d.]. Euro-Par 2026: 32nd International European Conference on Parallel and Distributed Computing. https://easychair.org/cfp/Euro-Par2026
work page 2026
-
[9]
EuroGNC Conference. [n. d.]. EuroGNC AI Policy. https://eurognc.ceas.org/ai- policy/
-
[10]
Apurva Gandhi and Shomik Jain. 2020. Adversarial perturbations fool deepfake detectors. InInternational Joint Conference on Neural Networks. IEEE, 1–8
work page 2020
-
[11]
Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.Advances in Neural Information Processing Systems27 (2014)
work page 2014
-
[12]
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10696–10706
work page 2022
-
[13]
Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4360–4369
work page 2021
-
[14]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems33 (2020), 6840–6851
work page 2020
-
[15]
Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264
work page 2021
- [16]
-
[17]
Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 48–57
work page 2022
-
[18]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations
work page 2018
-
[19]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410
work page 2019
-
[20]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119
work page 2020
- [21]
-
[22]
Wilson, Woosang Lim, and William Yang Wang
Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understand- ing. arXiv:2407.04903
- [23]
-
[24]
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780
work page 2024
-
[25]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
-
[26]
Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594
work page 2023
-
[27]
Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375
work page 2024
-
[28]
Nature Portfolio. [n. d.]. Artificial Intelligence (AI). https://www.nature.com/ nature-portfolio/editorial-policies/ai
-
[29]
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804
work page 2022
-
[30]
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489
work page 2023
-
[31]
OpenAI. [n. d.]. GPT Image 1. https://developers.openai.com/api/docs/models/ gpt-image-1
-
[32]
OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt/
work page 2022
-
[33]
OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/
work page 2025
-
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al
-
[35]
In International Conference on Machine Learning
Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PmLR, 8748–8763
-
[36]
Naina Raisinghani. 2025. Introducing Nano Banana Pro. https://blog.google/ innovation-and-ai/products/nano-banana-pro/
work page 2025
- [37]
-
[38]
Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3418–3432
work page 2023
-
[39]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060
work page 2024
-
[40]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139
work page 2024
-
[41]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114
work page 2023
-
[42]
Run Wang, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Yihao Huang, Jian Wang, and Yang Liu. 2021. FakeSpotter: a simple yet robust baseline for spotting AI- synthesized fake faces. InProceedings of the Twenty-Ninth International Confer- ence on International Joint Conferences on Artificial Intelligence. 3444–3451
work page 2021
-
[43]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot...for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion
work page 2020
- [44]
-
[45]
Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315– 13325
work page 2025
-
[46]
Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2025. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations
work page 2025
-
[47]
Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2025. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection. In You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, and Xiaobai Li International Conference on Machine Learning
work page 2025
-
[48]
Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. InIEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8261–8265
work page 2019
-
[49]
Yilun Zhao, Chengye Wang, Chuhan Li, and Arman Cohan. 2025. Can multi- modal foundation models understand schematic diagrams? an empirical study on information-seeking qa over scientific papers. InFindings of the Association for Computational Linguistics: ACL 2025. 18598–18631
work page 2025
- [50]
-
[51]
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image.Advances in Neural Information Processing Systems36 (2023), 77771–77782
work page 2023
- [52]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.