pith. sign in

arxiv: 2604.17206 · v1 · submitted 2026-04-19 · 💻 cs.CV

SciDraw-6K: A Multilingual Scientific Illustration Dataset Generated by Google Gemini

Pith reviewed 2026-05-10 07:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords scientific illustration datasetmultilingual text-to-imageAI-generated diagramsscientific visualizationdataset releaseprompt engineeringdiffusion model fine-tuningschematic figures
0
0 comments X

The pith

SciDraw-6K provides 6,291 AI-generated scientific illustrations paired with prompts in eleven languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a new dataset called SciDraw-6K consisting of thousands of scientific illustrations created by image generation models. These illustrations are paired with text prompts translated into eleven languages to cover a range of scientific fields including physics, chemistry, and biomedicine. Unlike general image datasets, this one focuses on schematic diagrams, mechanism figures, and conceptual graphics that scientists use. The authors detail how they built it and release it publicly along with a website that uses it for generating scientific drawings. This matters because it gives researchers a targeted resource to improve how AI systems handle the specific demands of scientific visualization.

Core claim

SciDraw-6K is a curated dataset of 6,291 scientific illustrations synthesized by image-generation models, with each image paired with prompts in eleven languages spanning English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian. The images cover eight broad categories such as biomedical, chemistry, materials, electronics, environment, AI systems, physics, and other, and are produced mainly by specific model families. The dataset is purpose-built for scientific illustration including schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters, and is released to support multilingual文本到

What carries the argument

The SciDraw-6K dataset of synthesized scientific illustrations with multilingual prompt pairings, built through a dedicated generation and curation pipeline for schematic and conceptual graphics.

Load-bearing premise

The generated illustrations accurately and representatively capture the intended scientific concepts without significant factual distortions.

What would settle it

Expert scientists reviewing a sample of the images and finding frequent inaccuracies in depicted mechanisms, structures, or concepts would indicate the dataset may not be suitable as training data.

Figures

Figures reproduced from arXiv: 2604.17206 by Davie Chen.

Figure 1
Figure 1. Figure 1: Number of images per category. The biomedical category dominates, with a long thin tail [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-language non-null rate of prompt fields. All eleven languages are populated for 100% [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: English prompt length distribution (characters). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of images generated per month. Potential harms. Synthetic scientific imagery can in principle be misused to fabricate plausible￾looking but incorrect figures. We discourage use of SciDraw-6K imagery as ground-truth scientific evidence; the dataset is intended for visualization, education, and ML research purposes. 7 Conclusion We have introduced SciDraw-6K, a small but high-density dataset of 6,291 … view at source ↗
Figure 5
Figure 5. Figure 5: Gemini source-model distribution across approved images. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

We present SciDraw-6K, a curated dataset of 6,291 scientific illustrations synthesized by Google Gemini image-generation models, each paired with prompts in eleven languages (English, Simplified Chinese, Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, and Russian). Images span eight broad scientific categories -- biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long "other" tail -- and are produced primarily by the gemini-2.5-flash-image and gemini-3-pro-image-preview model families. In contrast to general-purpose text-to-image corpora that dominate the literature, SciDraw-6K is purpose-built for the scientific illustration genre: schematic diagrams, mechanism figures, table-of-contents graphics, and conceptual posters. We describe the construction pipeline, report dataset statistics, and document its use as the substrate of sci-draw.com, a public scientific drawing service. The dataset is released to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization. Dataset: https://huggingface.co/datasets/SciDrawAI/SciDraw-6K Code: https://github.com/SciDrawAI/scidraw-6k

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents SciDraw-6K, a dataset of 6,291 scientific illustrations generated primarily by gemini-2.5-flash-image and gemini-3-pro-image-preview models from Google Gemini. Each image is paired with prompts in eleven languages (English, Simplified/Traditional Chinese, Japanese, Korean, German, French, Spanish, Brazilian Portuguese, Italian, Russian) and spans eight categories (biomedical, chemistry, materials, electronics, environment, AI systems, physics, and a long 'other' tail). The work describes the construction pipeline, reports dataset statistics, documents its use as the substrate for the sci-draw.com service, and releases the data on Hugging Face with code on GitHub to support multilingual text-to-image research, domain-adapted diffusion fine-tuning, and prompt-engineering studies for scientific visualization.

Significance. If the generated images prove scientifically accurate, the dataset would fill a useful niche by providing a large, purpose-built, multilingual collection of schematic diagrams, mechanism figures, and conceptual posters that general-purpose text-to-image corpora do not emphasize. The public release with code and the explicit positioning for fine-tuning and prompt studies are strengths that could accelerate domain-specific work in computer vision.

major comments (2)
  1. [Abstract / construction pipeline] Abstract and construction pipeline description: the central claim that SciDraw-6K supplies a 'curated' resource 'suitable for ... scientific visualization research' is unsupported because the manuscript reports no validation of scientific accuracy—no expert review, no error-rate statistics, no comparison against ground-truth diagrams, and no explicit filtering criteria beyond broad category labels. This is load-bearing: without evidence that the images are faithful to the scientific concepts in the prompts (e.g., correct bond angles, circuit topologies, or process mechanisms), the dataset's utility for the stated downstream uses cannot be assessed.
  2. [Dataset statistics / release] Dataset statistics and release sections: the paper provides counts and category breakdowns but supplies no quantitative or qualitative evidence of curation for correctness, such as inter-annotator agreement on factual validity or rejection rates for implausible outputs. This omission directly affects the claim that the resource is ready for training or benchmarking.
minor comments (1)
  1. [Abstract] The abstract lists eleven languages but does not break down image counts or prompt quality per language; adding this table would improve transparency without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify that the manuscript does not report expert validation or quantitative accuracy metrics for the generated illustrations. We address each point below and will revise the manuscript to clarify the scope of our claims and add an explicit limitations discussion.

read point-by-point responses
  1. Referee: [Abstract / construction pipeline] Abstract and construction pipeline description: the central claim that SciDraw-6K supplies a 'curated' resource 'suitable for ... scientific visualization research' is unsupported because the manuscript reports no validation of scientific accuracy—no expert review, no error-rate statistics, no comparison against ground-truth diagrams, and no explicit filtering criteria beyond broad category labels. This is load-bearing: without evidence that the images are faithful to the scientific concepts in the prompts (e.g., correct bond angles, circuit topologies, or process mechanisms), the dataset's utility for the stated downstream uses cannot be assessed.

    Authors: We agree that no expert review, error-rate statistics, or ground-truth comparisons are reported. The word 'curated' in the abstract and pipeline description refers only to the systematic choice of eight scientific categories, prompt templates, and eleven-language translations; it does not imply post-generation verification of scientific fidelity. Because the images are synthesized by Gemini models, we did not perform such validation. We will revise the abstract to replace 'curated' with 'constructed' and insert a dedicated Limitations section that states the absence of accuracy validation, notes potential inaccuracies (e.g., incorrect diagrams), and clarifies that the dataset is released to enable community studies of AI-generated scientific visuals and domain-specific fine-tuning rather than as a ready-to-use benchmark of verified content. revision: yes

  2. Referee: [Dataset statistics / release] Dataset statistics and release sections: the paper provides counts and category breakdowns but supplies no quantitative or qualitative evidence of curation for correctness, such as inter-annotator agreement on factual validity or rejection rates for implausible outputs. This omission directly affects the claim that the resource is ready for training or benchmarking.

    Authors: We acknowledge that no inter-annotator agreement, rejection rates, or correctness statistics are provided. The released dataset contains all generated images without filtering for factual accuracy, as the goal is to supply a large, unfiltered multilingual corpus of Gemini outputs for research on prompt engineering and fine-tuning. We will update the dataset statistics and release sections to explicitly describe the lack of post-generation filtering and will add a short paragraph on how users may apply their own validation. The accompanying GitHub repository will be extended with example scripts for basic quality checks. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive dataset release

full rationale

The paper is a standard dataset release describing the generation of 6,291 images via Gemini models, multilingual prompt pairing, category statistics, and public hosting. It contains no derivations, equations, predictions, fitted parameters, uniqueness theorems, or self-citations that bear load on any claim. All content is observational and external (model outputs + release links), with no reduction of any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction and release paper with no mathematical derivations, fitted parameters, or postulated entities; it relies on standard practices of using commercial image generators and basic curation.

pith-pipeline@v0.9.0 · 5517 in / 1082 out tokens · 44929 ms · 2026-05-10T07:24:35.590788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Synthetic vision datasets from frontier generative models, 2024

    Yuntao Bai et al. Synthetic vision datasets from frontier generative models, 2024. Survey reference

  2. [2]

    SciDraw-6K: A multilingual scientific illustration dataset generated by Google Gemini

    Davie Chen. SciDraw-6K: A multilingual scientific illustration dataset generated by Google Gemini. Zenodo, 2026. DOI: 10.5281/zenodo.19642870

  3. [3]

    PaLI: A jointly-scaled multilingual language-image model.ICLR, 2023

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. PaLI: A jointly-scaled multilingual language-image model.ICLR, 2023

  4. [4]

    Gemini: A family of highly capable multimodal models

    Google DeepMind. Gemini: A family of highly capable multimodal models. Technical report, Google, 2024. 8 Figure 5: Gemini source-model distribution across approved images

  5. [5]

    Lee Giles, and Ting-Hao K

    Ting-Yao Hsu, C. Lee Giles, and Ting-Hao K. Huang. SciCap: Generating captions for scientific figures. InFindings of EMNLP, 2021

  6. [6]

    FigureQA: An annotated figure dataset for visual reasoning.ICLR Workshop, 2018

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, ´Akos K´ ad´ ar, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning.ICLR Workshop, 2018

  7. [7]

    JourneyDB: A benchmark for generative image understanding

    Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. JourneyDB: A benchmark for generative image understanding. InNeurIPS, 2023

  8. [8]

    Friedrich

    Obioma Pelka, Sven Koitka, Johannes R¨ uckert, Felix Nensa, and Christoph M. Friedrich. Radiology objects in context (ROCO): A multimodal image-dataset.MICCAI Workshop, 2018

  9. [9]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨ orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  10. [10]

    LAION- 5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION- 5B: An open large-scale dataset for training next generation image-text models. InNeurIPS Datasets and Benchmarks, 2022

  11. [11]

    Self-instruct: Aligning language model with self generated instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. ACL, 2023

  12. [12]

    DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models.ACL, 2023

    Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models.ACL, 2023. 9

  13. [13]

    AltDiffusion: A multilingual text-to-image diffusion model.arXiv preprint arXiv:2308.09991, 2023

    Fulong Ye, Guang Liu, Xinya Wu, and Lei Wu. AltDiffusion: A multilingual text-to-image diffusion model.arXiv preprint arXiv:2308.09991, 2023. 10