arxiv: 2604.20857 · v1 · submitted 2026-02-28 · 💻 cs.IR · cs.AI

Recognition: no theorem link

DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

Tingwen Zhang , Ling Yue , Zhen Xu , Shaowu Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:47 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords diagram datasetretrieval-augmented generationscientific figuresschematic diagramsAI scientist systemsmultimodal retrievalfigure generationdataset curation

0 comments

The pith

DiagramBank curates 89,422 schematic diagrams from scientific publications to enable retrieval-augmented generation of high-quality figures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DiagramBank, a dataset of 89,422 schematic diagrams extracted from top-tier scientific publications. It addresses the bottleneck in AI scientist systems where generating publication-grade diagrams, such as teaser figures, remains challenging despite advances in writing manuscripts and code. The dataset is built via an automated pipeline that pulls figures along with their in-text references and uses CLIP to filter for schematics over plots or natural images. Each entry includes context from abstracts, captions, and figure references, supporting retrieval at varying levels of detail. The authors also release a codebase showing how to use the dataset for exemplar-conditioned figure synthesis.

Core claim

We present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities.

What carries the argument

The automated curation pipeline that extracts figures and in-text references, combined with a CLIP-based filter to identify schematic diagrams, and the pairing with rich context from abstracts, captions, and figure-reference pairs.

If this is right

AI scientist systems gain access to exemplars that support conceptual synthesis in teaser figure generation.
Multimodal retrieval becomes possible across query types from abstract summaries to specific figure references.
Exemplar-conditioned synthesis of publication-grade diagrams becomes a practical component in end-to-end paper generation.
The ready-to-index release format allows direct integration into existing retrieval-augmented pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curation approach could be adapted to collect other visual types such as data plots or experimental images for broader coverage.
Pairing the dataset with text-generation models might enable fully automated manuscript production that includes figures.
The collection of diagrams with metadata could support studies of effective visual communication patterns across scientific fields.

Load-bearing premise

The CLIP-based filter and automated extraction pipeline can reliably select high-quality schematic diagrams that are representative and useful without introducing substantial noise or selection bias.

What would settle it

A manual review of a random sample from the dataset revealing that a large fraction consists of standard plots, natural images, or low-quality figures rather than true schematic diagrams would undermine the claim of a high-quality, usable resource.

Figures

Figures reproduced from arXiv: 2604.20857 by Ling Yue, Shaowu Pan, Tingwen Zhang, Zhen Xu.

**Figure 1.** Figure 1: DiagramBank-RAG framework (self-generated example). The above workflow diagram is automatically generated by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Statistical Overview of the Dataset. (a) The average caption length (number of words) has steadily decreased from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hierarchical Three-Stage Retrieval Pipeline. The system uses a coarse-to-fine filtering strategy to ensure domain [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of Retrieval. Comparing the baseline generation (a) against our RAG-augmented approach (b). The RAG model [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Retrieved References. The top three retrieved ex [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiagramBank supplies a sizable set of schematic diagrams with paper metadata for retrieval-based figure generation, but the CLIP filter that defines its quality has no reported accuracy numbers.

read the letter

The main point is that this paper releases DiagramBank, a collection of 89,422 schematic diagrams pulled from top-tier scientific papers, each tied to abstracts, captions, and in-text references. The goal is to give retrieval systems concrete exemplars for generating conceptual figures like teaser images, which current AI scientist pipelines still handle poorly compared to data plots. They also ship a ready-to-index version on Hugging Face plus sample retrieval-augmented generation code, which makes the resource immediately testable. That combination of scale, schematic focus, and paired metadata is the concrete addition. It directly targets a workflow gap rather than re-deriving existing figure extraction methods. The curation pipeline description is clear enough on paper: extract figures, apply CLIP to keep schematics over plots or photos, and retain the surrounding text. For anyone building multimodal retrieval or automated scientific writing tools, this is usable material. The citation choices are standard and do not overclaim prior coverage. The soft spot is the missing validation on the filter itself. The abstract and description state that CLIP separates schematics, yet no precision, recall, threshold ablations, or human agreement scores appear. Without those, the fraction of actual high-quality diagrams versus misclassified items stays unknown, and that directly limits how much weight the downstream use case can carry. The rest of the paper stays proportionate and does not overstate what the release proves. This is for readers working on retrieval-augmented generation, scientific document AI, or multimodal datasets who need real schematic examples with context. It is not a methods breakthrough but a practical data artifact. I would bring it to a reading group as maybe, to discuss curation choices and how the data performs in actual retrieval setups. I would not cite the paper itself in the next year unless I start using the dataset. It deserves peer review because the resource addresses a documented bottleneck and the release lowers the barrier for others to test it; reviewers can press on the filter validation without the work being dismissed outright.

Referee Report

1 major / 1 minor

Summary. The paper introduces DiagramBank, a dataset of 89,422 schematic diagrams extracted from top-tier scientific publications via an automated pipeline that performs figure extraction, pairs them with metadata (abstracts, captions, figure references), and applies a CLIP-based filter to separate schematics from plots or natural images. The authors release the dataset in a ready-to-index format and supply a retrieval-augmented generation codebase to demonstrate exemplar-driven teaser-figure synthesis.

Significance. If the curation pipeline reliably yields low-noise, representative schematics, DiagramBank would fill a practical gap in AI-scientist pipelines by supplying high-quality visual exemplars for multimodal retrieval and conditioned generation. The combination of scale, rich metadata, and accompanying RAG code makes the resource immediately usable for downstream research on scientific figure synthesis.

major comments (1)

[Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.

minor comments (1)

[Abstract] The abstract and release statement mention the Hugging Face and GitHub URLs but do not specify the exact JSON schema or indexing format of the released files; adding a short table or example record would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful for the referee's review, which highlights an important aspect of our work. Below we respond to the major comment and commit to revisions that will enhance the manuscript.

read point-by-point responses

Referee: [Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.

Authors: We agree with the referee that quantitative validation is necessary to confirm the quality of the curated dataset. In the revised manuscript, we will add a dedicated section on the curation pipeline validation. Specifically, we will provide precision and recall metrics for the CLIP-based schematic filter based on a human-annotated test set of 500 figures. We will also report the inter-annotator agreement score and include an ablation analysis on the effect of different similarity thresholds on the number and quality of retained schematics. These additions will directly address the concern and strengthen the evidence for the dataset's utility in retrieval-augmented generation tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset curation paper with no derivations or fitted predictions

full rationale

The paper is a data release describing the construction of DiagramBank via figure extraction and a CLIP-based filter. No equations, predictions, first-principles results, or model fittings are present that could reduce to the inputs by construction. The 89,422 count and downstream utility claims rest on the pipeline description rather than any self-referential logic or self-citation chain. This is a standard non-circular contribution of curated data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that schematic diagrams can be automatically distinguished from other figure types and that the extracted examples are suitable exemplars for generation tasks.

axioms (1)

domain assumption CLIP embeddings can reliably separate schematic diagrams from plots and natural images in scientific publications.
The curation pipeline invokes this distinction to filter the collected figures.

pith-pipeline@v0.9.0 · 5564 in / 1149 out tokens · 58563 ms · 2026-05-15T18:47:13.594915+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Anonymous. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=5N3z9JQJKq

work page 2026
[2]

2024.PyMuPDF: Python bindings for the MuPDF library

Artifex Software, Inc. 2024.PyMuPDF: Python bindings for the MuPDF library. https://github.com/pymupdf/PyMuPDF

work page 2024
[3]

Zenab Bosheah and Vilmos Bilicki. 2025. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Applied Sciences15, 5 (2025), 2274

work page 2025
[4]

Christopher Clark and Santosh Divvala. 2016. Pdffigures 2.0: Mining figures from research papers. InProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 143–152

work page 2016
[5]

Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264

work page 2021
[6]

2021.OpenCLIP

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021.OpenCLIP. doi:10.5281/zenodo.5143773 If you use this software, please cite it as below

work page doi:10.5281/zenodo.5143773 2021
[7]

KV Jobin, Ajoy Mondal, and CV Jawahar. 2019. Docfigure: A dataset for scientific document figure classification. In2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1. IEEE, 74–79

work page 2019
[8]

Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, and C Lee Giles. 2023. Acl-fig: A dataset for scientific figure classification.arXiv preprint arXiv:2301.12293(2023)

work page arXiv 2023
[9]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. InEuropean conference on computer vision. Springer, 235–251

work page 2016
[10]

Julian Klug and Urs Pietsch. 2024. Can artificial intelligence help for scientific illustration? Details matter.Critical Care28, 1 (2024), 196

work page 2024
[11]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[12]

Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, and Lijun Wu. 2025. Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning. arXiv:2510.04081 [cs.CL] https://arxiv.org/abs/2510.04081

work page arXiv 2025
[13]

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

work page
[14]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Doan, Nam V

Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, and Nghi D. Q. Bui. 2025. CodeMMLU: A Multi-Task Bench- mark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. arXiv:2410.01999 [cs.SE] https://arxiv.org/abs/2410.01999

work page arXiv 2025
[16]

Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375

work page 2024
[17]

Misaki Ohashi, Manami Oka, and Kenji Ozawa. 2025. An Initial Study on Prompt Engineering for Automated Generation of Graphical Abstracts. In2025 18th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 1–6

work page 2025
[18]

2024.OpenAI Text Embeddings API

OpenAI. 2024.OpenAI Text Embeddings API. https://platform.openai.com/docs/ guides/embeddings Model: text-embedding-3-large

work page 2024
[19]

2025.OpenReview Python Client

OpenReview Team. 2025.OpenReview Python Client. https://github.com/ openreview/openreview-py Python library for the OpenReview API

work page 2025
[20]

Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, and Min-Ling Zhang. 2025. Code2MCP: Transforming Code Repositories into MCP Services.arXiv preprint arXiv:2509.05941(2025)

work page arXiv 2025
[21]

Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. InProceedings of the 2019 international conference on management of data. 1981–1984

work page 2019
[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[23]

Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Lin- chao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. 2025. Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks. arXiv:2505.16901 [cs.SE] https://arxiv.org/a...

work page arXiv 2025
[24]

Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315–13325

work page 2025
[25]

Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, and Shigeo Morishima. 2021. Visual summary identification from scientific publica- tions via self-supervised learning.Frontiers in Research Metrics and Analytics6 (2021), 719004

work page 2021
[26]

Sean T Yang, Po-Shen Lee, Lia Kazakova, Abhishek Joshi, Bum Mook Oh, Jevin D West, and Bill Howe. 2019. Identifying the central figure of a scientific paper. In2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1063–1070

work page 2019
[27]

Ling Yue, Shimin Di, and Shaowu Pan. 2025. Autonomous scientific discovery through hierarchical ai scientist systems.Preprints, July(2025)

work page 2025
[28]

Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. 2023. Diagrammergpt: Gen- erating open-domain, open-platform diagrams via llm planning.arXiv preprint arXiv:2310.12128(2023)

work page arXiv 2023
[29]

Xinyi Zhong, Zusheng Tan, Shen Gao, Jing Li, Jiaxing Shen, Jingyu Ji, Jeff Tang, and Billy Chiu. 2025. SMSMO: Learning to generate multimodal summary for scientific papers.Knowledge-Based Systems310 (2025), 112908

work page 2025
[30]

You are an expert on translating academic writing to visual specification

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265 , , Zhang et al. A Dataset Card / Datasheet for DiagramBank A.1 Dataset summary Name:DiagramBank Modalities:image (diagram) + text (caption, i...

work page arXiv 2026