Recognition: no theorem link
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation
Pith reviewed 2026-05-15 18:47 UTC · model grok-4.3
The pith
DiagramBank curates 89,422 schematic diagrams from scientific publications to enable retrieval-augmented generation of high-quality figures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities.
What carries the argument
The automated curation pipeline that extracts figures and in-text references, combined with a CLIP-based filter to identify schematic diagrams, and the pairing with rich context from abstracts, captions, and figure-reference pairs.
If this is right
- AI scientist systems gain access to exemplars that support conceptual synthesis in teaser figure generation.
- Multimodal retrieval becomes possible across query types from abstract summaries to specific figure references.
- Exemplar-conditioned synthesis of publication-grade diagrams becomes a practical component in end-to-end paper generation.
- The ready-to-index release format allows direct integration into existing retrieval-augmented pipelines.
Where Pith is reading between the lines
- The curation approach could be adapted to collect other visual types such as data plots or experimental images for broader coverage.
- Pairing the dataset with text-generation models might enable fully automated manuscript production that includes figures.
- The collection of diagrams with metadata could support studies of effective visual communication patterns across scientific fields.
Load-bearing premise
The CLIP-based filter and automated extraction pipeline can reliably select high-quality schematic diagrams that are representative and useful without introducing substantial noise or selection bias.
What would settle it
A manual review of a random sample from the dataset revealing that a large fraction consists of standard plots, natural images, or low-quality figures rather than true schematic diagrams would undermine the claim of a high-quality, usable resource.
Figures
read the original abstract
Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiagramBank, a dataset of 89,422 schematic diagrams extracted from top-tier scientific publications via an automated pipeline that performs figure extraction, pairs them with metadata (abstracts, captions, figure references), and applies a CLIP-based filter to separate schematics from plots or natural images. The authors release the dataset in a ready-to-index format and supply a retrieval-augmented generation codebase to demonstrate exemplar-driven teaser-figure synthesis.
Significance. If the curation pipeline reliably yields low-noise, representative schematics, DiagramBank would fill a practical gap in AI-scientist pipelines by supplying high-quality visual exemplars for multimodal retrieval and conditioned generation. The combination of scale, rich metadata, and accompanying RAG code makes the resource immediately usable for downstream research on scientific figure synthesis.
major comments (1)
- [Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.
minor comments (1)
- [Abstract] The abstract and release statement mention the Hugging Face and GitHub URLs but do not specify the exact JSON schema or indexing format of the released files; adding a short table or example record would improve reproducibility.
Simulated Author's Rebuttal
We are grateful for the referee's review, which highlights an important aspect of our work. Below we respond to the major comment and commit to revisions that will enhance the manuscript.
read point-by-point responses
-
Referee: [Curation Pipeline] The description of the automated curation pipeline (including the CLIP-based schematic-vs-plot filter) reports no quantitative validation: no precision/recall figures, no human-agreement scores on a labeled hold-out set, and no ablation on the similarity threshold. Because the central claim is that the released collection contains 89,422 high-quality schematics suitable for retrieval-augmented generation, the absence of these metrics leaves the utility of the dataset unverified.
Authors: We agree with the referee that quantitative validation is necessary to confirm the quality of the curated dataset. In the revised manuscript, we will add a dedicated section on the curation pipeline validation. Specifically, we will provide precision and recall metrics for the CLIP-based schematic filter based on a human-annotated test set of 500 figures. We will also report the inter-annotator agreement score and include an ablation analysis on the effect of different similarity thresholds on the number and quality of retained schematics. These additions will directly address the concern and strengthen the evidence for the dataset's utility in retrieval-augmented generation tasks. revision: yes
Circularity Check
No circularity: dataset curation paper with no derivations or fitted predictions
full rationale
The paper is a data release describing the construction of DiagramBank via figure extraction and a CLIP-based filter. No equations, predictions, first-principles results, or model fittings are present that could reduce to the inputs by construction. The 89,422 count and downstream utility claims rest on the pipeline description rather than any self-referential logic or self-citation chain. This is a standard non-circular contribution of curated data and code.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CLIP embeddings can reliably separate schematic diagrams from plots and natural images in scientific publications.
Reference graph
Works this paper leans on
-
[1]
Anonymous. 2026. AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=5N3z9JQJKq
work page 2026
-
[2]
2024.PyMuPDF: Python bindings for the MuPDF library
Artifex Software, Inc. 2024.PyMuPDF: Python bindings for the MuPDF library. https://github.com/pymupdf/PyMuPDF
work page 2024
-
[3]
Zenab Bosheah and Vilmos Bilicki. 2025. Challenges in Generating Accurate Text in Images: A Benchmark for Text-to-Image Models on Specialized Content. Applied Sciences15, 5 (2025), 2274
work page 2025
-
[4]
Christopher Clark and Santosh Divvala. 2016. Pdffigures 2.0: Mining figures from research papers. InProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 143–152
work page 2016
-
[5]
Ting-Yao Hsu, C Lee Giles, and Ting-Hao Huang. 2021. SciCap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3258–3264
work page 2021
-
[6]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. 2021.OpenCLIP. doi:10.5281/zenodo.5143773 If you use this software, please cite it as below
-
[7]
KV Jobin, Ajoy Mondal, and CV Jawahar. 2019. Docfigure: A dataset for scientific document figure classification. In2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 1. IEEE, 74–79
work page 2019
- [8]
-
[9]
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha- jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. InEuropean conference on computer vision. Springer, 235–251
work page 2016
-
[10]
Julian Klug and Urs Pietsch. 2024. Can artificial intelligence help for scientific illustration? Details matter.Critical Care28, 1 (2024), 196
work page 2024
-
[11]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
- [12]
-
[13]
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha
-
[14]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, and Nghi D. Q. Bui. 2025. CodeMMLU: A Multi-Task Bench- mark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs. arXiv:2410.01999 [cs.SE] https://arxiv.org/abs/2410.01999
-
[16]
Ishani Mondal, Zongxia Li, Yufang Hou, Anandhavelu Natarajan, Aparna Garimella, and Jordan Lee Boyd-Graber. 2024. SciDoc2Diagrammer-MAF: To- wards generation of scientific diagrams from documents guided by multi-aspect feedback refinement. InFindings of the Association for Computational Linguistics: EMNLP 2024. 13342–13375
work page 2024
-
[17]
Misaki Ohashi, Manami Oka, and Kenji Ozawa. 2025. An Initial Study on Prompt Engineering for Automated Generation of Graphical Abstracts. In2025 18th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 1–6
work page 2025
-
[18]
2024.OpenAI Text Embeddings API
OpenAI. 2024.OpenAI Text Embeddings API. https://platform.openai.com/docs/ guides/embeddings Model: text-embedding-3-large
work page 2024
-
[19]
OpenReview Team. 2025.OpenReview Python Client. https://github.com/ openreview/openreview-py Python library for the OpenReview API
work page 2025
- [20]
-
[21]
Mark Raasveldt and Hannes Mühleisen. 2019. Duckdb: an embeddable analytical database. InProceedings of the 2019 international conference on management of data. 1981–1984
work page 2019
-
[22]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[23]
Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Lin- chao Zhu, Rui Wang, Hang Yu, Jianguo Li, and Peng Di. 2025. Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks. arXiv:2505.16901 [cs.SE] https://arxiv.org/a...
-
[24]
Jingxuan Wei, Cheng Tan, Qi Chen, Gaowei Wu, Siyuan Li, Zhangyang Gao, Linzhuang Sun, Bihui Yu, and Ruifeng Guo. 2025. From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing. InProceedings of the Computer Vision and Pattern Recognition Conference. 13315–13325
work page 2025
-
[25]
Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, and Shigeo Morishima. 2021. Visual summary identification from scientific publica- tions via self-supervised learning.Frontiers in Research Metrics and Analytics6 (2021), 719004
work page 2021
-
[26]
Sean T Yang, Po-Shen Lee, Lia Kazakova, Abhishek Joshi, Bum Mook Oh, Jevin D West, and Bill Howe. 2019. Identifying the central figure of a scientific paper. In2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1063–1070
work page 2019
-
[27]
Ling Yue, Shimin Di, and Shaowu Pan. 2025. Autonomous scientific discovery through hierarchical ai scientist systems.Preprints, July(2025)
work page 2025
- [28]
-
[29]
Xinyi Zhong, Zusheng Tan, Shen Gao, Jing Li, Jiaxing Shen, Jingyu Ji, Jeff Tang, and Billy Chiu. 2025. SMSMO: Learning to generate multimodal summary for scientific papers.Knowledge-Based Systems310 (2025), 112908
work page 2025
-
[30]
You are an expert on translating academic writing to visual specification
Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. 2026. PaperBanana: Automating Academic Illustration for AI Scientists. arXiv:2601.23265 [cs.CL] https://arxiv.org/abs/2601.23265 , , Zhang et al. A Dataset Card / Datasheet for DiagramBank A.1 Dataset summary Name:DiagramBank Modalities:image (diagram) + text (caption, i...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.