pith. machine review for the scientific record. sign in

arxiv: 2604.20857 · v1 · submitted 2026-02-28 · 💻 cs.IR · cs.AI

Recognition: no theorem link

DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:47 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords diagram datasetretrieval-augmented generationscientific figuresschematic diagramsAI scientist systemsmultimodal retrievalfigure generationdataset curation
0
0 comments X

The pith

DiagramBank curates 89,422 schematic diagrams from scientific publications to enable retrieval-augmented generation of high-quality figures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiagramBank, a dataset of 89,422 schematic diagrams extracted from top-tier scientific publications. It addresses the bottleneck in AI scientist systems where generating publication-grade diagrams, such as teaser figures, remains challenging despite advances in writing manuscripts and code. The dataset is built via an automated pipeline that pulls figures along with their in-text references and uses CLIP to filter for schematics over plots or natural images. Each entry includes context from abstracts, captions, and figure references, supporting retrieval at varying levels of detail. The authors also release a codebase showing how to use the dataset for exemplar-conditioned figure synthesis.

Core claim

We present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities.

What carries the argument

The automated curation pipeline that extracts figures and in-text references, combined with a CLIP-based filter to identify schematic diagrams, and the pairing with rich context from abstracts, captions, and figure-reference pairs.

If this is right

  • AI scientist systems gain access to exemplars that support conceptual synthesis in teaser figure generation.
  • Multimodal retrieval becomes possible across query types from abstract summaries to specific figure references.
  • Exemplar-conditioned synthesis of publication-grade diagrams becomes a practical component in end-to-end paper generation.
  • The ready-to-index release format allows direct integration into existing retrieval-augmented pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The curation approach could be adapted to collect other visual types such as data plots or experimental images for broader coverage.
  • Pairing the dataset with text-generation models might enable fully automated manuscript production that includes figures.
  • The collection of diagrams with metadata could support studies of effective visual communication patterns across scientific fields.

Load-bearing premise

The CLIP-based filter and automated extraction pipeline can reliably select high-quality schematic diagrams that are representative and useful without introducing substantial noise or selection bias.

What would settle it

A manual review of a random sample from the dataset revealing that a large fraction consists of standard plots, natural images, or low-quality figures rather than true schematic diagrams would undermine the claim of a high-quality, usable resource.

Figures

Figures reproduced from arXiv: 2604.20857 by Ling Yue, Shaowu Pan, Tingwen Zhang, Zhen Xu.

Figure 1
Figure 1. Figure 1: DiagramBank-RAG framework (self-generated example). The above workflow diagram is automatically generated by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical Overview of the Dataset. (a) The average caption length (number of words) has steadily decreased from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical Three-Stage Retrieval Pipeline. The system uses a coarse-to-fine filtering strategy to ensure domain [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Retrieval. Comparing the baseline generation (a) against our RAG-augmented approach (b). The RAG model [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Retrieved References. The top three retrieved ex [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces DiagramBank, a dataset of 89,422 schematic diagrams extracted from top-tier scientific publications via an automated pipeline that performs figure extraction, pairs them with metadata (abstracts, captions, figure references), and applies a CLIP-based filter to separate schematics from plots or natural images. The authors release the dataset in a ready-to-index format and supply a retrieval-augmented generation codebase to demonstrate exemplar-driven teaser-figure synthesis.

Significance. If the curation pipeline reliably yields low-noise, representative schematics, DiagramBank would fill a practical gap in AI-scientist pipelines by supplying high-quality visual exemplars for multimodal retrieval and conditioned generation. The combination of scale, rich metadata, and accompanying RAG code makes the resource immediately usable for downstream research on scientific figure synthesis.

major comments (1)
  1. [Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.
minor comments (1)
  1. [Abstract] The abstract and release statement mention the Hugging Face and GitHub URLs but do not specify the exact JSON schema or indexing format of the released files; adding a short table or example record would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful for the referee's review, which highlights an important aspect of our work. Below we respond to the major comment and commit to revisions that will enhance the manuscript.

read point-by-point responses
  1. Referee: [Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.

    Authors: We agree with the referee that quantitative validation is necessary to confirm the quality of the curated dataset. In the revised manuscript, we will add a dedicated section on the curation pipeline validation. Specifically, we will provide precision and recall metrics for the CLIP-based schematic filter based on a human-annotated test set of 500 figures. We will also report the inter-annotator agreement score and include an ablation analysis on the effect of different similarity thresholds on the number and quality of retained schematics. These additions will directly address the concern and strengthen the evidence for the dataset's utility in retrieval-augmented generation tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation paper with no derivations or fitted predictions

full rationale

The paper is a data release describing the construction of DiagramBank via figure extraction and a CLIP-based filter. No equations, predictions, first-principles results, or model fittings are present that could reduce to the inputs by construction. The 89,422 count and downstream utility claims rest on the pipeline description rather than any self-referential logic or self-citation chain. This is a standard non-circular contribution of curated data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that schematic diagrams can be automatically distinguished from other figure types and that the extracted examples are suitable exemplars for generation tasks.

axioms (1)
  • domain assumption CLIP embeddings can reliably separate schematic diagrams from plots and natural images in scientific publications.
    The curation pipeline invokes this distinction to filter the collected figures.

pith-pipeline@v0.9.0 · 5564 in / 1149 out tokens · 58563 ms · 2026-05-15T18:47:13.594915+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Anonymous. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=5N3z9JQJKq

  2. [2]

    2024.PyMuPDF: Python bindings for the MuPDF library

    Artifex Software, Inc. 2024.PyMuPDF: Python bindings for the MuPDF library. https://github.com/pymupdf/PyMuPDF

  3. [3]

    Zenab Bosheah and Vilmos Bilicki. 2025. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Applied Sciences15, 5 (2025), 2274

  4. [4]

    Christopher Clark and Santosh Divvala. 2016. Pdffigures 2.0: Mining figures from research papers. InProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 143–152

  5. [5]

    Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264

  6. [6]

    2021.OpenCLIP

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021.OpenCLIP. doi:10.5281/zenodo.5143773 If you use this software, please cite it as below

  7. [7]

    KV Jobin, Ajoy Mondal, and CV Jawahar. 2019. Docfigure: A dataset for scientific document figure classification. In2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1. IEEE, 74–79

  8. [8]

    Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, and C Lee Giles. 2023. Acl-fig: A dataset for scientific figure classification.arXiv preprint arXiv:2301.12293(2023)

  9. [9]

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. InEuropean conference on computer vision. Springer, 235–251

  10. [10]

    Julian Klug and Urs Pietsch. 2024. Can artificial intelligence help for scientific illustration? Details matter.Critical Care28, 1 (2024), 196

  11. [11]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  12. [12]

    Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, and Lijun Wu. 2025. Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning. arXiv:2510.04081 [cs.CL] https://arxiv.org/abs/2510.04081

  13. [13]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

  14. [14]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

  15. [15]

    Doan, Nam V

    Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, and Nghi D. Q. Bui. 2025. CodeMMLU: A Multi-Task Bench- mark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. arXiv:2410.01999 [cs.SE] https://arxiv.org/abs/2410.01999

  16. [16]

    Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375

  17. [17]

    Misaki Ohashi, Manami Oka, and Kenji Ozawa. 2025. An Initial Study on Prompt Engineering for Automated Generation of Graphical Abstracts. In2025 18th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 1–6

  18. [18]

    2024.OpenAI Text Embeddings API

    OpenAI. 2024.OpenAI Text Embeddings API. https://platform.openai.com/docs/ guides/embeddings Model: text-embedding-3-large

  19. [19]

    2025.OpenReview Python Client

    OpenReview Team. 2025.OpenReview Python Client. https://github.com/ openreview/openreview-py Python library for the OpenReview API

  20. [20]

    Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang. 2025. Code2MCP: Transforming Code Repositories into MCP Services.arXiv preprint arXiv:2509.05941(2025)

  21. [21]

    Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. InProceedings of the 2019 international conference on management of data. 1981–1984

  22. [22]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  23. [23]

    Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Lin- chao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. 2025. Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks. arXiv:2505.16901 [cs.SE] https://arxiv.org/a...

  24. [24]

    Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315–13325

  25. [25]

    Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, and Shigeo Morishima. 2021. Visual summary identification from scientific publica- tions via self-supervised learning.Frontiers in Research Metrics and Analytics6 (2021), 719004

  26. [26]

    Sean T Yang, Po-Shen Lee, Lia Kazakova, Abhishek Joshi, Bum Mook Oh, Jevin D West, and Bill Howe. 2019. Identifying the central figure of a scientific paper. In2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1063–1070

  27. [27]

    Ling Yue, Shimin Di, and Shaowu Pan. 2025. Autonomous scientific discovery through hierarchical ai scientist systems.Preprints, July(2025)

  28. [28]

    Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. 2023. Diagrammergpt: Gen- erating open-domain, open-platform diagrams via llm planning.arXiv preprint arXiv:2310.12128(2023)

  29. [29]

    Xinyi Zhong, Zusheng Tan, Shen Gao, Jing Li, Jiaxing Shen, Jingyu Ji, Jeff Tang, and Billy Chiu. 2025. SMSMO: Learning to generate multimodal summary for scientific papers.Knowledge-Based Systems310 (2025), 112908

  30. [30]

    You are an expert on translating academic writing to visual specification

    Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265 , , Zhang et al. A Dataset Card / Datasheet for DiagramBank A.1 Dataset summary Name:DiagramBank Modalities:image (diagram) + text (caption, i...