SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

Changfa Mo; Chenzhuo Zhao; Haotian Liu; Xiaobai Li; You Hu

arxiv: 2604.08211 · v1 · submitted 2026-04-09 · 💻 cs.CV

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu , Chenzhuo Zhao , Changfa Mo , Haotian Liu , Xiaobai Li This is my paper

Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated imagesscientific figure detectiondetection benchmarkzero-shot transferimage forensicsmultimodal generatorsresearch integritypost-processing robustness

0 comments

The pith

Existing AI image detectors fail dramatically on scientific figures from modern generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the first benchmark for detecting AI-generated scientific figures, which are structured and text-dense unlike natural images. It builds the dataset via an agent-based pipeline that analyzes real papers, creates structured prompts, synthesizes figures, and refines them through review to produce matched real-synthetic pairs across categories and sources. Benchmark evaluations demonstrate that current detectors perform poorly in zero-shot transfer, overfit to specific generators, and degrade under common post-processing like compression or cropping. A reader would care because these high-quality fakes threaten the integrity of scholarly publishing, and the work shows existing tools are not ready for this emerging threat.

Core claim

The paper establishes that modern multimodal generators can now produce near-publishable scientific figures, but detection methods developed for open-domain images do not transfer to this setting. The SciFigDetect benchmark, built with an agent pipeline for data creation and review-driven refinement, covers multiple figure types and generation sources. Testing under zero-shot, cross-generator, and degraded conditions reveals dramatic failures, strong overfitting to individual generators, and fragility to typical corruptions, exposing a gap between current capabilities and the distribution of synthetic scholarly visuals.

What carries the argument

The agent-based data pipeline that retrieves licensed papers, performs multimodal understanding of text and figures, builds structured prompts, synthesizes candidates, and filters them via a review-driven refinement loop to create aligned real-synthetic pairs.

If this is right

Current detection methods will not reliably identify AI-generated scientific figures in zero-shot use across different tools.
Generator-specific overfitting means practical detectors must handle multiple sources to avoid sharp performance drops.
Fragility under common post-processing corruptions implies detectors may miss altered figures in published work.
A dedicated benchmark for scientific figures is required to develop more robust forensic tools for scholarly content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection models may benefit from incorporating scientific semantics and text-figure alignment as explicit features.
Academic publishing platforms could adopt similar benchmarks for automated integrity checks on submitted visuals.
The pipeline approach could extend to other structured academic elements such as tables or chemical diagrams.
As generators advance, repeated benchmarking will be needed to track the evolving gap in detection performance.

Load-bearing premise

The synthetic figures produced by the agent pipeline accurately represent the distribution and post-processing conditions of real AI-generated scientific figures from diverse sources.

What would settle it

Test the benchmark detectors on a fresh set of AI-generated scientific figures produced directly by current generators like GPT-4V or Claude for actual research papers and measure if performance matches the reported zero-shot and degradation results.

Figures

Figures reproduced from arXiv: 2604.08211 by Changfa Mo, Chenzhuo Zhao, Haotian Liu, Xiaobai Li, You Hu.

**Figure 2.** Figure 2: Overview of the data construction pipeline. From licensed source papers and figure-related context, our framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset statistics. Left: per-category sample counts [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-generator generalization gap. Averaged across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds the first benchmark for AI-generated scientific figures and shows detectors fail on them, but the synthetic data's closeness to real cases is unverified.

read the letter

The main thing here is that they created SciFigDetect, the first benchmark aimed at AI-generated scientific figures instead of generic images. Their agent pipeline pulls papers, does multimodal analysis, makes structured prompts, generates figures, and refines them through reviews. This gives aligned real-synthetic pairs and lets them test detectors in zero-shot, cross-generator, and corrupted settings, where performance drops sharply and overfitting shows up clearly.

Referee Report

3 major / 2 minor

Summary. The paper introduces SciFigDetect, the first benchmark dataset and evaluation for detecting AI-generated scientific figures. It describes an agent-based pipeline that retrieves papers, performs multimodal analysis, generates structured prompts, synthesizes figures via multiple generators, and applies review-driven refinement to produce aligned real-synthetic pairs across figure categories. Experiments benchmark existing detectors in zero-shot transfer, cross-generator, and post-processing corruption settings, reporting dramatic failures, generator-specific overfitting, and fragility to common degradations.

Significance. If the synthetic figures are representative of real AI-generated scientific figures, the benchmark would usefully expose a gap in current AIGI detectors for structured, text-dense scholarly imagery and provide a public dataset to drive future work on scientific-figure forensics. The work is empirical benchmark construction with no machine-checked proofs or parameter-free derivations, but the release of the dataset and the reported performance gaps constitute a concrete starting point for the community.

major comments (3)

[Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.
[Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.
[Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.

minor comments (2)

[Abstract / Introduction] The abstract and introduction would benefit from a clearer statement of the exact number of figures, generators, and categories in the released benchmark.
[Figures and tables] Figure captions and table headers should explicitly indicate whether reported metrics are averaged over all generators or broken down per generator.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We have carefully considered each major comment and provide our responses below. We commit to revising the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.

Authors: We agree with the referee that direct quantitative validation against real AI-generated figures would strengthen the benchmark's validity. However, as this is the first such benchmark, a comprehensive held-out set of verified real AI-generated scientific figures from sources like arXiv was not available for comparison. Our pipeline is designed to produce realistic pairs by starting from real papers and using agent-based multimodal analysis to generate aligned synthetics. We include extensive qualitative examples and category coverage in the manuscript. In the revised version, we will add basic quantitative statistics (e.g., text density, layout metrics) for both real and synthetic figures from the source papers and discuss the limitations of the construction process. We will also tone down the claim of exact distributional match to 'closely approximates real-world scientific figure generation scenarios.' revision: partial
Referee: [Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.

Authors: We apologize for the lack of clarity in the manuscript. The zero-shot evaluation uses detectors pre-trained on existing AIGI benchmarks (such as those for natural images) and tests them directly on our SciFigDetect dataset without any fine-tuning or exposure to our data, ensuring no overlap with training sets of those detectors. For cross-generator, we use leave-one-generator-out protocols. We will revise the Experiments section to explicitly describe the split construction, confirm no paper or generator overlap in the relevant settings, and include detailed per-generator performance tables to support the overfitting observations. revision: yes
Referee: [Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.

Authors: The referee correctly notes that the corruptions were chosen based on standard practices in the AIGI detection literature (e.g., JPEG compression levels, resizing factors commonly used in robustness tests) rather than calibrated specifically to observed distributions in scientific figures. This is a limitation. In the revision, we will add a discussion justifying the choice of parameters with references to prior work and, if feasible, include an analysis of typical post-processing in arXiv figures. We will also clarify that the results demonstrate fragility to these common degradations, which are relevant even if not perfectly calibrated. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and external evaluation

full rationale

The paper describes an agent-based pipeline to generate a new benchmark dataset of real-synthetic scientific figure pairs and then reports empirical performance of existing external detectors under zero-shot, cross-generator, and corruption settings. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text. The central claims (detector failure modes) are direct measurements on held-out data rather than quantities derived from the pipeline outputs by construction. The skeptic concern about distribution match is a validity question, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on domain assumptions about pipeline fidelity rather than new free parameters or invented entities.

axioms (1)

domain assumption Multimodal understanding of paper text and figures can produce structured prompts that yield realistic scientific figures when fed to generators.
This underpins the entire data construction pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1211 out tokens · 73459 ms · 2026-05-10T17:32:30.014105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

American Association for the Advancement of Science. [n. d.]. Science Jour- nals: Editorial Policies. https://www.science.org/content/page/science-journals- editorial-policies

work page
[2]

Jordan J Bird and Ahmad Lotfi. 2024. Cifake: Image classification and explainable identification of ai-generated synthetic images.IEEE Access12 (2024), 15642– 15650

work page 2024
[3]

Cell Press. [n. d.]. Figure guidelines. https://www.cell.com/information-for- authors/figure-guidelines

work page
[4]

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. InEuropean Conference on Computer Vision. Springer, 103–120

work page 2020
[5]

Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. 2020. On the detection of digital face manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5781–5790

work page 2020
[6]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems34 (2021), 8780–8794

work page 2021
[7]

Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up- convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7890–7899

work page 2020
[8]

Euro-Par 2026. [n. d.]. Euro-Par 2026: 32nd International European Conference on Parallel and Distributed Computing. https://easychair.org/cfp/Euro-Par2026

work page 2026
[9]

EuroGNC Conference. [n. d.]. EuroGNC AI Policy. https://eurognc.ceas.org/ai- policy/

work page
[10]

Apurva Gandhi and Shomik Jain. 2020. Adversarial perturbations fool deepfake detectors. InInternational Joint Conference on Neural Networks. IEEE, 1–8

work page 2020
[11]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.Advances in Neural Information Processing Systems27 (2014)

work page 2014
[12]

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10696–10706

work page 2022
[13]

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4360–4369

work page 2021
[14]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems33 (2020), 6840–6851

work page 2020
[15]

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264

work page 2021
[16]

Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. 2026. SciFig: Towards Automating Scientific Figure Generation. arXiv:2601.04390 [cs.AI] https://arxiv.org/abs/2601.04390

work page arXiv 2026
[17]

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 48–57

work page 2022
[18]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations

work page 2018
[19]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410

work page 2019
[20]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119

work page 2020
[21]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv:2403.00231 [cs.CV] https://arxiv.org/abs/2403.00231

work page arXiv 2024
[22]

Wilson, Woosang Lim, and William Yang Wang

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understand- ing. arXiv:2407.04903

work page arXiv 2025
[23]

Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, and Yue Zhang. 2026. AutoFigure-Edit: Generating Editable Scientific Illustration. arXiv:2603.06674 [cs.CV] https://arxiv.org/abs/2603.06674

work page arXiv 2026
[24]

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780

work page 2024
[25]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

work page
[26]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

work page 2023
[27]

Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375

work page 2024
[28]

Nature Portfolio. [n. d.]. Artificial Intelligence (AI). https://www.nature.com/ nature-portfolio/editorial-policies/ai

work page
[29]

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

work page 2022
[30]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

work page 2023
[31]

OpenAI. [n. d.]. GPT Image 1. https://developers.openai.com/api/docs/models/ gpt-image-1

work page
[32]

OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt/

work page 2022
[33]

OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/

work page 2025
[34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[35]

In International Conference on Machine Learning

Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PmLR, 8748–8763

work page
[36]

Naina Raisinghani. 2025. Introducing Nano Banana Pro. https://blog.google/ innovation-and-ai/products/nano-banana-pro/

work page 2025
[37]

Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. 2024. SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation. arXiv:2405.08807 [cs.CV] https://arxiv.org/abs/2405.08807

work page arXiv 2024
[38]

Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3418–3432

work page 2023
[39]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024
[40]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

work page 2024
[41]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114

work page 2023
[42]

Run Wang, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Yihao Huang, Jian Wang, and Yang Liu. 2021. FakeSpotter: a simple yet robust baseline for spotting AI- synthesized fake faces. InProceedings of the Twenty-Ninth International Confer- ence on International Joint Conferences on Artificial Intelligence. 3444–3451

work page 2021
[43]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot...for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

work page 2020
[44]

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2023. Benchmarking Deepart Detection. arXiv:2302.14475 [cs.CV] https://arxiv.org/abs/2302.14475

work page arXiv 2023
[45]

Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315– 13325

work page 2025
[46]

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2025. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations

work page 2025
[47]

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2025. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection. In You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, and Xiaobai Li International Conference on Machine Learning

work page 2025
[48]

Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. InIEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8261–8265

work page 2019
[49]

Yilun Zhao, Chengye Wang, Chuhan Li, and Arman Cohan. 2025. Can multi- modal foundation models understand schematic diagrams? an empirical study on information-seeking qa over scientific papers. InFindings of the Association for Computational Linguistics: ACL 2025. 18598–18631

work page 2025
[50]

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265

work page arXiv 2026
[51]

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image.Advances in Neural Information Processing Systems36 (2023), 77771–77782

work page 2023
[52]

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. arXiv:2602.03828 [cs.AI] https://arxiv. org/abs/2602.03828

work page arXiv 2026

[1] [1]

American Association for the Advancement of Science. [n. d.]. Science Jour- nals: Editorial Policies. https://www.science.org/content/page/science-journals- editorial-policies

work page

[2] [2]

Jordan J Bird and Ahmad Lotfi. 2024. Cifake: Image classification and explainable identification of ai-generated synthetic images.IEEE Access12 (2024), 15642– 15650

work page 2024

[3] [3]

Cell Press. [n. d.]. Figure guidelines. https://www.cell.com/information-for- authors/figure-guidelines

work page

[4] [4]

Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. InEuropean Conference on Computer Vision. Springer, 103–120

work page 2020

[5] [5]

Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. 2020. On the detection of digital face manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5781–5790

work page 2020

[6] [6]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems34 (2021), 8780–8794

work page 2021

[7] [7]

Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up- convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7890–7899

work page 2020

[8] [8]

Euro-Par 2026. [n. d.]. Euro-Par 2026: 32nd International European Conference on Parallel and Distributed Computing. https://easychair.org/cfp/Euro-Par2026

work page 2026

[9] [9]

EuroGNC Conference. [n. d.]. EuroGNC AI Policy. https://eurognc.ceas.org/ai- policy/

work page

[10] [10]

Apurva Gandhi and Shomik Jain. 2020. Adversarial perturbations fool deepfake detectors. InInternational Joint Conference on Neural Networks. IEEE, 1–8

work page 2020

[11] [11]

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.Advances in Neural Information Processing Systems27 (2014)

work page 2014

[12] [12]

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10696–10706

work page 2022

[13] [13]

Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4360–4369

work page 2021

[14] [14]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems33 (2020), 6840–6851

work page 2020

[15] [15]

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264

work page 2021

[16] [16]

Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. 2026. SciFig: Towards Automating Scientific Figure Generation. arXiv:2601.04390 [cs.AI] https://arxiv.org/abs/2601.04390

work page arXiv 2026

[17] [17]

Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 48–57

work page 2022

[18] [18]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations

work page 2018

[19] [19]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410

work page 2019

[20] [20]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119

work page 2020

[21] [21]

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv:2403.00231 [cs.CV] https://arxiv.org/abs/2403.00231

work page arXiv 2024

[22] [22]

Wilson, Woosang Lim, and William Yang Wang

Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understand- ing. arXiv:2407.04903

work page arXiv 2025

[23] [23]

Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, and Yue Zhang. 2026. AutoFigure-Edit: Generating Editable Scientific Illustration. arXiv:2603.06674 [cs.CV] https://arxiv.org/abs/2603.06674

work page arXiv 2026

[24] [24]

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780

work page 2024

[25] [25]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

work page

[26] [26]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

work page 2023

[27] [27]

Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375

work page 2024

[28] [28]

Nature Portfolio. [n. d.]. Artificial Intelligence (AI). https://www.nature.com/ nature-portfolio/editorial-policies/ai

work page

[29] [29]

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

work page 2022

[30] [30]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

work page 2023

[31] [31]

OpenAI. [n. d.]. GPT Image 1. https://developers.openai.com/api/docs/models/ gpt-image-1

work page

[32] [32]

OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt/

work page 2022

[33] [33]

OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/

work page 2025

[34] [34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page

[35] [35]

In International Conference on Machine Learning

Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PmLR, 8748–8763

work page

[36] [36]

Naina Raisinghani. 2025. Introducing Nano Banana Pro. https://blog.google/ innovation-and-ai/products/nano-banana-pro/

work page 2025

[37] [37]

Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. 2024. SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation. arXiv:2405.08807 [cs.CV] https://arxiv.org/abs/2405.08807

work page arXiv 2024

[38] [38]

Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3418–3432

work page 2023

[39] [39]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024

[40] [40]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

work page 2024

[41] [41]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114

work page 2023

[42] [42]

Run Wang, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Yihao Huang, Jian Wang, and Yang Liu. 2021. FakeSpotter: a simple yet robust baseline for spotting AI- synthesized fake faces. InProceedings of the Twenty-Ninth International Confer- ence on International Joint Conferences on Artificial Intelligence. 3444–3451

work page 2021

[43] [43]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot...for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

work page 2020

[44] [44]

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2023. Benchmarking Deepart Detection. arXiv:2302.14475 [cs.CV] https://arxiv.org/abs/2302.14475

work page arXiv 2023

[45] [45]

Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315– 13325

work page 2025

[46] [46]

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2025. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations

work page 2025

[47] [47]

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2025. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection. In You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, and Xiaobai Li International Conference on Machine Learning

work page 2025

[48] [48]

Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. InIEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8261–8265

work page 2019

[49] [49]

Yilun Zhao, Chengye Wang, Chuhan Li, and Arman Cohan. 2025. Can multi- modal foundation models understand schematic diagrams? an empirical study on information-seeking qa over scientific papers. InFindings of the Association for Computational Linguistics: ACL 2025. 18598–18631

work page 2025

[50] [50]

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265

work page arXiv 2026

[51] [51]

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image.Advances in Neural Information Processing Systems36 (2023), 77771–77782

work page 2023

[52] [52]

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. arXiv:2602.03828 [cs.AI] https://arxiv. org/abs/2602.03828

work page arXiv 2026