pith. sign in

arxiv: 2604.08211 · v1 · submitted 2026-04-09 · 💻 cs.CV

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated imagesscientific figure detectiondetection benchmarkzero-shot transferimage forensicsmultimodal generatorsresearch integritypost-processing robustness
0
0 comments X

The pith

Existing AI image detectors fail dramatically on scientific figures from modern generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the first benchmark for detecting AI-generated scientific figures, which are structured and text-dense unlike natural images. It builds the dataset via an agent-based pipeline that analyzes real papers, creates structured prompts, synthesizes figures, and refines them through review to produce matched real-synthetic pairs across categories and sources. Benchmark evaluations demonstrate that current detectors perform poorly in zero-shot transfer, overfit to specific generators, and degrade under common post-processing like compression or cropping. A reader would care because these high-quality fakes threaten the integrity of scholarly publishing, and the work shows existing tools are not ready for this emerging threat.

Core claim

The paper establishes that modern multimodal generators can now produce near-publishable scientific figures, but detection methods developed for open-domain images do not transfer to this setting. The SciFigDetect benchmark, built with an agent pipeline for data creation and review-driven refinement, covers multiple figure types and generation sources. Testing under zero-shot, cross-generator, and degraded conditions reveals dramatic failures, strong overfitting to individual generators, and fragility to typical corruptions, exposing a gap between current capabilities and the distribution of synthetic scholarly visuals.

What carries the argument

The agent-based data pipeline that retrieves licensed papers, performs multimodal understanding of text and figures, builds structured prompts, synthesizes candidates, and filters them via a review-driven refinement loop to create aligned real-synthetic pairs.

If this is right

  • Current detection methods will not reliably identify AI-generated scientific figures in zero-shot use across different tools.
  • Generator-specific overfitting means practical detectors must handle multiple sources to avoid sharp performance drops.
  • Fragility under common post-processing corruptions implies detectors may miss altered figures in published work.
  • A dedicated benchmark for scientific figures is required to develop more robust forensic tools for scholarly content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection models may benefit from incorporating scientific semantics and text-figure alignment as explicit features.
  • Academic publishing platforms could adopt similar benchmarks for automated integrity checks on submitted visuals.
  • The pipeline approach could extend to other structured academic elements such as tables or chemical diagrams.
  • As generators advance, repeated benchmarking will be needed to track the evolving gap in detection performance.

Load-bearing premise

The synthetic figures produced by the agent pipeline accurately represent the distribution and post-processing conditions of real AI-generated scientific figures from diverse sources.

What would settle it

Test the benchmark detectors on a fresh set of AI-generated scientific figures produced directly by current generators like GPT-4V or Claude for actual research papers and measure if performance matches the reported zero-shot and degradation results.

Figures

Figures reproduced from arXiv: 2604.08211 by Changfa Mo, Chenzhuo Zhao, Haotian Liu, Xiaobai Li, You Hu.

Figure 1
Figure 1. Figure 1: Overview of our benchmark. Representative real–synthetic examples from three figure categories: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the data construction pipeline. From licensed source papers and figure-related context, our framework [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset statistics. Left: per-category sample counts [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-generator generalization gap. Averaged across [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SciFigDetect, the first benchmark dataset and evaluation for detecting AI-generated scientific figures. It describes an agent-based pipeline that retrieves papers, performs multimodal analysis, generates structured prompts, synthesizes figures via multiple generators, and applies review-driven refinement to produce aligned real-synthetic pairs across figure categories. Experiments benchmark existing detectors in zero-shot transfer, cross-generator, and post-processing corruption settings, reporting dramatic failures, generator-specific overfitting, and fragility to common degradations.

Significance. If the synthetic figures are representative of real AI-generated scientific figures, the benchmark would usefully expose a gap in current AIGI detectors for structured, text-dense scholarly imagery and provide a public dataset to drive future work on scientific-figure forensics. The work is empirical benchmark construction with no machine-checked proofs or parameter-free derivations, but the release of the dataset and the reported performance gaps constitute a concrete starting point for the community.

major comments (3)
  1. [Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.
  2. [Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.
  3. [Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction would benefit from a clearer statement of the exact number of figures, generators, and categories in the released benchmark.
  2. [Figures and tables] Figure captions and table headers should explicitly indicate whether reported metrics are averaged over all generators or broken down per generator.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We have carefully considered each major comment and provide our responses below. We commit to revising the manuscript accordingly where appropriate.

read point-by-point responses
  1. Referee: [Benchmark construction / data pipeline] Benchmark construction section (pipeline description): the central claim that the agent pipeline produces figures whose distribution matches real AI-generated scientific figures is not supported by any quantitative validation (e.g., statistical tests on layout statistics, text density, semantic alignment, or detection-score distributions against a held-out set of real AI figures from arXiv or similar sources). Without such evidence, the reported zero-shot failures and overfitting results may not transfer.

    Authors: We agree with the referee that direct quantitative validation against real AI-generated figures would strengthen the benchmark's validity. However, as this is the first such benchmark, a comprehensive held-out set of verified real AI-generated scientific figures from sources like arXiv was not available for comparison. Our pipeline is designed to produce realistic pairs by starting from real papers and using agent-based multimodal analysis to generate aligned synthetics. We include extensive qualitative examples and category coverage in the manuscript. In the revised version, we will add basic quantitative statistics (e.g., text density, layout metrics) for both real and synthetic figures from the source papers and discuss the limitations of the construction process. We will also tone down the claim of exact distributional match to 'closely approximates real-world scientific figure generation scenarios.' revision: partial

  2. Referee: [Experiments] Experiments section (zero-shot and cross-generator protocols): the evaluation does not specify how the zero-shot split is constructed (e.g., whether any generator or paper source overlap exists between training and test) or provide per-generator breakdown tables that would allow readers to assess the strength of the overfitting claim.

    Authors: We apologize for the lack of clarity in the manuscript. The zero-shot evaluation uses detectors pre-trained on existing AIGI benchmarks (such as those for natural images) and tests them directly on our SciFigDetect dataset without any fine-tuning or exposure to our data, ensuring no overlap with training sets of those detectors. For cross-generator, we use leave-one-generator-out protocols. We will revise the Experiments section to explicitly describe the split construction, confirm no paper or generator overlap in the relevant settings, and include detailed per-generator performance tables to support the overfitting observations. revision: yes

  3. Referee: [Experiments / degraded-image settings] Degradation experiments: the post-processing corruptions (compression, resizing, etc.) are applied, but the paper does not report whether these operations are calibrated to the distribution of real post-processed scientific figures or merely chosen ad hoc, which affects the fragility conclusion.

    Authors: The referee correctly notes that the corruptions were chosen based on standard practices in the AIGI detection literature (e.g., JPEG compression levels, resizing factors commonly used in robustness tests) rather than calibrated specifically to observed distributions in scientific figures. This is a limitation. In the revision, we will add a discussion justifying the choice of parameters with references to prior work and, if feasible, include an analysis of typical post-processing in arXiv figures. We will also clarify that the results demonstrate fragility to these common degradations, which are relevant even if not perfectly calibrated. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and external evaluation

full rationale

The paper describes an agent-based pipeline to generate a new benchmark dataset of real-synthetic scientific figure pairs and then reports empirical performance of existing external detectors under zero-shot, cross-generator, and corruption settings. No equations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the provided text. The central claims (detector failure modes) are direct measurements on held-out data rather than quantities derived from the pipeline outputs by construction. The skeptic concern about distribution match is a validity question, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on domain assumptions about pipeline fidelity rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Multimodal understanding of paper text and figures can produce structured prompts that yield realistic scientific figures when fed to generators.
    This underpins the entire data construction pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5546 in / 1211 out tokens · 73459 ms · 2026-05-10T17:32:30.014105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    American Association for the Advancement of Science. [n. d.]. Science Jour- nals: Editorial Policies. https://www.science.org/content/page/science-journals- editorial-policies

  2. [2]

    Jordan J Bird and Ahmad Lotfi. 2024. Cifake: Image classification and explainable identification of ai-generated synthetic images.IEEE Access12 (2024), 15642– 15650

  3. [3]

    Cell Press. [n. d.]. Figure guidelines. https://www.cell.com/information-for- authors/figure-guidelines

  4. [4]

    Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. InEuropean Conference on Computer Vision. Springer, 103–120

  5. [5]

    Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. 2020. On the detection of digital face manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5781–5790

  6. [6]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems34 (2021), 8780–8794

  7. [7]

    Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up- convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7890–7899

  8. [8]

    Euro-Par 2026. [n. d.]. Euro-Par 2026: 32nd International European Conference on Parallel and Distributed Computing. https://easychair.org/cfp/Euro-Par2026

  9. [9]

    EuroGNC Conference. [n. d.]. EuroGNC AI Policy. https://eurognc.ceas.org/ai- policy/

  10. [10]

    Apurva Gandhi and Shomik Jain. 2020. Adversarial perturbations fool deepfake detectors. InInternational Joint Conference on Neural Networks. IEEE, 1–8

  11. [11]

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.Advances in Neural Information Processing Systems27 (2014)

  12. [12]

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. 2022. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10696–10706

  13. [13]

    Yinan He, Bei Gan, Siyu Chen, Yichun Zhou, Guojun Yin, Luchuan Song, Lu Sheng, Jing Shao, and Ziwei Liu. 2021. Forgerynet: A versatile benchmark for comprehensive forgery analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4360–4369

  14. [14]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems33 (2020), 6840–6851

  15. [15]

    Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264

  16. [16]

    Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, and Shraman Pramanick. 2026. SciFig: Towards Automating Scientific Figure Generation. arXiv:2601.04390 [cs.AI] https://arxiv.org/abs/2601.04390

  17. [17]

    Yonghyun Jeong, Doyeon Kim, Seungjai Min, Seongho Joe, Youngjune Gwon, and Jongwon Choi. 2022. Bihpf: Bilateral high-pass filters for robust deepfake detection. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 48–57

  18. [18]

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. InInternational Conference on Learning Representations

  19. [19]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401–4410

  20. [20]

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8110–8119

  21. [21]

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. 2024. Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. arXiv:2403.00231 [cs.CV] https://arxiv.org/abs/2403.00231

  22. [22]

    Wilson, Woosang Lim, and William Yang Wang

    Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, and William Yang Wang. 2025. MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understand- ing. arXiv:2407.04903

  23. [23]

    Zhen Lin, Qiujie Xie, Minjun Zhu, Shichen Li, Qiyao Sun, Enhao Gu, Yiran Ding, Ke Sun, Fang Guo, Panzhong Lu, Zhiyuan Ning, Yixuan Weng, and Yue Zhang. 2026. AutoFigure-Edit: Generating Editable Scientific Illustration. arXiv:2603.06674 [cs.CV] https://arxiv.org/abs/2603.06674

  24. [24]

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780

  25. [25]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  26. [26]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

  27. [27]

    Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375

  28. [28]

    Nature Portfolio. [n. d.]. Artificial Intelligence (AI). https://www.nature.com/ nature-portfolio/editorial-policies/ai

  29. [29]

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. InInternational Conference on Machine Learning. PMLR, 16784–16804

  30. [30]

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

  31. [31]

    OpenAI. [n. d.]. GPT Image 1. https://developers.openai.com/api/docs/models/ gpt-image-1

  32. [32]

    OpenAI. 2022. Introducing ChatGPT. https://openai.com/index/chatgpt/

  33. [33]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/

  34. [34]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  35. [35]

    In International Conference on Machine Learning

    Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PmLR, 8748–8763

  36. [36]

    Naina Raisinghani. 2025. Introducing Nano Banana Pro. https://blog.google/ innovation-and-ai/products/nano-banana-pro/

  37. [37]

    Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. 2024. SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation. arXiv:2405.08807 [cs.CV] https://arxiv.org/abs/2405.08807

  38. [38]

    Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. 2023. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 3418–3432

  39. [39]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

  40. [40]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

  41. [41]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114

  42. [42]

    Run Wang, Felix Juefei-Xu, Lei Ma, Xiaofei Xie, Yihao Huang, Jian Wang, and Yang Liu. 2021. FakeSpotter: a simple yet robust baseline for spotting AI- synthesized fake faces. InProceedings of the Twenty-Ninth International Confer- ence on International Joint Conferences on Artificial Intelligence. 3444–3451

  43. [43]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot...for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

  44. [44]

    Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2023. Benchmarking Deepart Detection. arXiv:2302.14475 [cs.CV] https://arxiv.org/abs/2302.14475

  45. [45]

    Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From words to structured visuals: A benchmark and framework for text-to-diagram generation and editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315– 13325

  46. [46]

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2025. A sanity check for ai-generated image detection. InInternational Conference on Learning Representations

  47. [47]

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. 2025. Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection. In You Hu, Chenzhuo Zhao, Changfa Mo, Haotian Liu, and Xiaobai Li International Conference on Machine Learning

  48. [48]

    Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing deep fakes using inconsistent head poses. InIEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8261–8265

  49. [49]

    Yilun Zhao, Chengye Wang, Chuhan Li, and Arman Cohan. 2025. Can multi- modal foundation models understand schematic diagrams? an empirical study on information-seeking qa over scientific papers. InFindings of the Association for Computational Linguistics: ACL 2025. 18598–18631

  50. [50]

    Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265

  51. [51]

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. 2023. GenImage: A Million-Scale Benchmark for Detecting AI-Generated Image.Advances in Neural Information Processing Systems36 (2023), 77771–77782

  52. [52]

    Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, and Yue Zhang. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. arXiv:2602.03828 [cs.AI] https://arxiv. org/abs/2602.03828