arxiv: 2601.00264 · v2 · submitted 2026-01-01 · 💻 cs.CV

Recognition: no theorem link

S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang , Longteng Guo , Pengkang Huo , Xuanxu Lin , Yichen Yuan , Jie Jiang , Jing Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords scientific datasetmultimodal learningimage-text alignmentscientific figuresAI for sciencemultimodal large language modelscaption enhancement

0 comments

The pith

S1-MMAlign supplies 15.5 million enhanced image-text pairs from scientific papers to close the semantic gap for multimodal AI in science.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces S1-MMAlign, a dataset of over 15.5 million high-quality image-text pairs extracted from 2.5 million open-access papers across disciplines like physics and biology. It addresses the weak alignment in raw scientific captions by using a pipeline that employs multimodal large language models to generate better captions from abstracts and citation contexts. This matters because it supplies a foundational resource for training AI models that understand complex scientific figures and text together. Without such aligned data, progress in AI for scientific discovery remains limited by the semantic gap between images and descriptions. The authors validate that the enhancements improve data quality and model performance on tasks like captioning and reasoning.

Core claim

S1-MMAlign is a large-scale multi-disciplinary dataset of 15.5 million image-text pairs derived from 2.5 million papers, created through an AI semantic enhancement pipeline that recaptions figures using context from abstracts and citations to achieve better alignment for multimodal learning in science.

What carries the argument

The AI-ready semantic enhancement pipeline that leverages multimodal large language models to synthesize comprehensive image captions from paper abstracts and figure citation contexts.

If this is right

Multimodal large language models show improved performance in zero-shot scientific captioning after training on the enhanced data.
Models trained on S1-MMAlign perform better in multi-domain scientific reasoning tasks.
The dataset supports visual instruction tuning for scientific applications.
Scientific foundation models can be developed using this aligned image-text resource across physics, biology, and engineering.
Downstream scientific intelligence applications gain from the higher quality data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such datasets could accelerate AI integration in fields with visual data like materials science or astronomy if similar pipelines are applied.
Recaptioning might introduce biases if the MLLMs favor certain interpretations, affecting model fairness in science.
The public availability enables broad community use for custom scientific AI tools.

Load-bearing premise

Multimodal large language models can reliably synthesize accurate and unbiased scientific context from abstracts and citations without introducing hallucinations.

What would settle it

A controlled test where human experts compare the enhanced captions to original ones and measure error rates or hallucinations in the recaptions would determine if the quality improvement holds.

Figures

Figures reproduced from arXiv: 2601.00264 by He Wang, Jie Jiang, Jing Liu, Longteng Guo, Pengkang Huo, Xuanxu Lin, Yichen Yuan.

**Figure 1.** Figure 1: Subject Distribution of S1-MMAlign. Physics (33%) and Computer Science (25%) constitute the dominant subsets, followed by Astronomy (13%), Biology (10%), and Mathematics (9%). The ’Others’ category (10%) encompasses diverse fields such as Engineering and Earth Science. Semantic Enrichment Analysis Beyond disciplinary breadth, the dataset is distinguished by its novel Semantic Enhancement Strategy. By inte… view at source ↗

**Figure 2.** Figure 2: Character Length Distribution Analysis. Comparative statistics reveal a significant shift in information density between the raw and enhanced corpora. Raw captions (orange) exhibit high volatility with a character count of 267±261 (mean ± std), reflecting the pervasive semantic sparsity and inconsistency inherent in original scientific metadata. In contrast, the semantically enhanced descriptions (blue) a… view at source ↗

**Figure 3.** Figure 3: Overview of the S1-MMAlign Data Construction Pipeline. The workflow consists of four distinct phases: (1) Data Ingestion from diverse sources including arXiv (LaTeX) and web crawls (PDF); (2) Preprocessing Pipeline, featuring archive integrity checks, Regexbased parsing, EPS-to-PNG conversion, and strict quality filtering; (3) Core AI Processing, employing the Qwen-VL architecture on an H100 GPU cluster … view at source ↗

**Figure 4.** Figure 4: File Organization of S1-MMAlign. The repository structure adapts to data volume: (A) Yearly Archives for massive sources like arXiv (e.g., images_2007.tar.gz); (B) Multi-Part Archives for large sources like bioRxiv (e.g., images.tar.gz.partaa); and (C) Single Archives for smaller datasets. All subsets include a jsonl directory for metadata. Metadata Schema The dataset follows a flat JSON structure where ea… view at source ↗

**Figure 5.** Figure 5: Empirical CDF of Text Quality (Pseudo-PPL). The plot illustrates the Cumulative Distribution Function of log10(pseudo-PPL) scores derived from SciBERT18. The blue curve (Enhanced Captions) demonstrates a pronounced leftward shift compared to the original captions, confirming a significant reduction in perplexity and superior alignment with scientific linguistic norms. As visualized in [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 6.** Figure 6: Distribution of CLIP Image-Text Consistency Scores. The histogram compares the alignment scores of original (orange) versus enhanced (blue) captions. The pronounced rightward shift and narrower spread of the blue distribution indicate that the enhancement strategy yields a dataset with higher semantic fidelity and greater consistency, providing a stronger supervision signal for multimodal representation le… view at source ↗

read the original abstract

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that leverages advanced multimodal large language models to recaption images, by synthesizing comprehensive context from paper abstracts and the citation contexts of corresponding figures. Technical validation confirms that our enhancement pipeline markedly improves data quality via reduced SciBERT pseudo-perplexity and enhanced CLIP image-text alignment, while also significantly boosting multimodal large language models performance in zero-shot scientific captioning, multi-domain scientific reasoning, and visual instruction tuning. S1-MMAlign provides a pivotal foundational resource for cross-modal scientific understanding in the AI for Science era, supporting the development of scientific foundation models and a wide range of downstream scientific intelligence applications. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Large new scientific figure dataset via MLLM recaptioning, but quality rests on indirect proxy metrics without direct accuracy checks.

read the letter

The main thing here is S1-MMAlign, a dataset of 15.5 million image-text pairs from 2.5 million papers spanning physics, biology, and engineering. The authors built it by running an MLLM pipeline that rewrites raw figure captions using abstracts and citation context to create more detailed pairs for training multimodal models on scientific visuals like heatmaps and microscopy images. Releasing the full set publicly on Hugging Face is a practical step that gives the community something concrete to work with at this scale. Prior efforts were smaller or stayed in the general domain, so the volume and disciplinary spread count as real additions. The reported validation shows the enhanced pairs cut SciBERT pseudo-perplexity, raise CLIP alignment scores, and lift downstream MLLM results on zero-shot captioning, multi-domain reasoning, and instruction tuning. Those metric gains are straightforward to measure and indicate the pipeline moves the data in a useful direction on average. The soft spot is exactly what the stress test flags: all checks are distributional and indirect. There is no direct comparison of the new captions against the actual figure content, no human expert review, and no error analysis for hallucinations or omitted details. Proxy improvements can appear even when the generated text adds unsupported claims or misses visual specifics, and the paper does not test that risk head-on. Details on paper selection criteria, bias controls across fields, and statistical significance of the gains are also thin. This is aimed at groups training or evaluating scientific multimodal models who need larger aligned data. The resource has enough substance and public availability to deserve peer review, where referees can examine the curation pipeline and push for stronger factual validation.

Referee Report

2 major / 1 minor

Summary. The paper presents S1-MMAlign, a large-scale dataset of over 15.5 million image-text pairs extracted from 2.5 million open-access scientific papers spanning physics, biology, engineering and other fields. It introduces an AI enhancement pipeline that uses multimodal LLMs to recaption figures by synthesizing context from paper abstracts and citation contexts, and reports that this pipeline reduces SciBERT pseudo-perplexity, improves CLIP alignment scores, and yields performance gains for MLLMs on zero-shot scientific captioning, multi-domain reasoning, and visual instruction tuning. The dataset is released publicly.

Significance. A high-quality, multi-disciplinary scientific figure-text dataset at this scale would be a valuable resource for training and evaluating scientific foundation models and multimodal systems. The public release and the reported downstream gains on captioning and reasoning tasks are positive contributions if the recaption quality can be substantiated beyond proxy metrics.

major comments (2)

[Abstract / Technical Validation] Abstract and technical validation section: the central claim that the enhancement pipeline produces higher-quality pairs rests on proxy metrics (reduced SciBERT pseudo-perplexity and improved CLIP alignment) plus downstream task gains, yet no direct human expert review, factual-accuracy audit, or comparison of recaptions against the actual figure content is described. This leaves open the possibility that distributional improvements occur even when unsupported details or omissions are introduced.
[Experimental Setup] Experimental setup description: details on data selection criteria, filtering steps, potential biases from the MLLM recaptioning process, and statistical significance testing of the reported performance improvements are absent, which undermines verifiability of the claimed gains.

minor comments (1)

[Abstract] The abstract would benefit from explicit numerical values for the reported perplexity reduction, CLIP score improvement, and task-specific accuracy gains rather than qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, clarifying our approach and indicating revisions where appropriate to improve verifiability and transparency.

read point-by-point responses

Referee: [Abstract / Technical Validation] Abstract and technical validation section: the central claim that the enhancement pipeline produces higher-quality pairs rests on proxy metrics (reduced SciBERT pseudo-perplexity and improved CLIP alignment) plus downstream task gains, yet no direct human expert review, factual-accuracy audit, or comparison of recaptions against the actual figure content is described. This leaves open the possibility that distributional improvements occur even when unsupported details or omissions are introduced.

Authors: We acknowledge that the current manuscript relies exclusively on proxy metrics (SciBERT pseudo-perplexity and CLIP alignment) and downstream task improvements rather than direct human expert review or factual audits of recaptions against figure content. At the scale of 15.5 million pairs, exhaustive manual review is logistically prohibitive, and our validation follows common practice in large-scale dataset papers. We agree this leaves room for potential hallucinations or omissions and will add an explicit limitations paragraph discussing this in the revised manuscript, along with any feasible small-scale human spot-checks we can perform. revision: partial
Referee: [Experimental Setup] Experimental setup description: details on data selection criteria, filtering steps, potential biases from the MLLM recaptioning process, and statistical significance testing of the reported performance improvements are absent, which undermines verifiability of the claimed gains.

Authors: We agree that the original manuscript omitted key details on data selection, filtering, MLLM-induced biases, and statistical testing for conciseness. In the revision we will expand the Experimental Setup section to specify: paper and figure selection criteria from the open-access corpus, all filtering thresholds applied, a dedicated discussion of potential biases (including hallucination risks from the recaptioning MLLM), and statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the reported performance gains on captioning and reasoning tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset curation with external validation metrics

full rationale

The paper describes construction of a large image-text dataset via an AI recaptioning pipeline and validates quality through proxy metrics (SciBERT pseudo-perplexity, CLIP alignment) plus downstream MLLM task gains. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The central pipeline output is evaluated against independent distributional and task-based benchmarks rather than reducing to its own inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the effectiveness of the MLLM-based recaptioning process, which assumes faithful synthesis of scientific meaning from limited context.

axioms (1)

domain assumption Multimodal large language models can accurately and reliably recaption scientific figures by synthesizing context from abstracts and citation sentences
This assumption underpins the entire enhancement pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5554 in / 1267 out tokens · 34123 ms · 2026-05-16T18:13:51.842644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Jumper, J.et al.Highly accurate protein structure prediction with alphafold.Nature596, 583–589 (2021)

work page 2021
[2]

Taylor, R.et al.Galactica: A large language model for science.arXiv preprint arXiv:2211.09085(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

InEuropean conference on computer vision, 740–755 (Springer, 2014)

Lin, T.-Y.et al.Microsoft coco: Common objects in context. InEuropean conference on computer vision, 740–755 (Springer, 2014)

work page 2014
[4]

InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2022)

Schuhmann, C.et al.Laion-5b: An open large-scale dataset for training next generation image-text models. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2022)

work page 2022
[5]

Hsu, T.-Y., Yang, C.et al.Scicap: Generating captions for scientific figures.arXiv preprint arXiv:2110.11624(2021)

work page arXiv 2021
[6]

& Wang, D

Lin, Z., Yin, Y., Liu, L. & Wang, D. Sciscinet: A large-scale open data lake for the science of science research.Scientific Data10, 315 (2023)

work page 2023
[7]

Methani, N., Ganguly, P., Khapra, M. M. & Kumar, P. Plotqa: Reasoning over scientific plots.arXiv preprint arXiv:1909.00997(2020)

work page arXiv 1909
[8]

Sever, R.et al.biorxiv: the preprint server for biology.bioRxiv(2019)

work page 2019
[9]

& Lee, Y

Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. InNeurIPS(2023)

work page 2023
[10]

Bai, J.et al.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023)

work page 2023
[11]

InInternational conference on machine learning, 8748–8763 (PMLR, 2021)

Radford, A.et al.Learning transferable visual models from natural language supervision. InInternational conference on machine learning, 8748–8763 (PMLR, 2021)

work page 2021
[12]

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Mineru: An open-source intelligent data extraction tool (2024)

OpenDataLab. Mineru: An open-source intelligent data extraction tool (2024)

work page 2024
[14]

Qwen3-VL Technical Report

Team, Q. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025). 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Tschannen, M.et al.Siglip2: Multilingualvision-languageencoderswithimprovedsemantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Kwon, W.et al.Efficient memory management for large language model serving with pagedattention (2023)

work page 2023
[17]

InProceedings of the 13th Annual Conference on Innovative Data Systems Research (CIDR)(Amsterdam, The Netherlands, 2023)

Low, Y.et al.Git is for data. InProceedings of the 13th Annual Conference on Innovative Data Systems Research (CIDR)(Amsterdam, The Netherlands, 2023). Published under CC BY 4.0 license

work page 2023
[18]

& Cohan, A

Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676(2019)

work page arXiv 1903
[19]

& Toutanova, K

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding (2019)

work page 2019
[20]

& Hoi, S

Li, J., Li, D., Xiong, C. & Hoi, S. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InInternational conference on machine learning, 12888–12900 (PMLR, 2022)

work page 2022
[21]

In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4548–4559 (2024)

Tarsi, T.et al.Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4548–4559 (2024). Competing Interests The authors declare no competing interests. 12

work page 2024