Recognition: no theorem link
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
Pith reviewed 2026-05-16 18:13 UTC · model grok-4.3
The pith
S1-MMAlign supplies 15.5 million enhanced image-text pairs from scientific papers to close the semantic gap for multimodal AI in science.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S1-MMAlign is a large-scale multi-disciplinary dataset of 15.5 million image-text pairs derived from 2.5 million papers, created through an AI semantic enhancement pipeline that recaptions figures using context from abstracts and citations to achieve better alignment for multimodal learning in science.
What carries the argument
The AI-ready semantic enhancement pipeline that leverages multimodal large language models to synthesize comprehensive image captions from paper abstracts and figure citation contexts.
If this is right
- Multimodal large language models show improved performance in zero-shot scientific captioning after training on the enhanced data.
- Models trained on S1-MMAlign perform better in multi-domain scientific reasoning tasks.
- The dataset supports visual instruction tuning for scientific applications.
- Scientific foundation models can be developed using this aligned image-text resource across physics, biology, and engineering.
- Downstream scientific intelligence applications gain from the higher quality data.
Where Pith is reading between the lines
- Such datasets could accelerate AI integration in fields with visual data like materials science or astronomy if similar pipelines are applied.
- Recaptioning might introduce biases if the MLLMs favor certain interpretations, affecting model fairness in science.
- The public availability enables broad community use for custom scientific AI tools.
Load-bearing premise
Multimodal large language models can reliably synthesize accurate and unbiased scientific context from abstracts and citations without introducing hallucinations.
What would settle it
A controlled test where human experts compare the enhanced captions to original ones and measure error rates or hallucinations in the recaptions would determine if the quality improvement holds.
Figures
read the original abstract
Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that leverages advanced multimodal large language models to recaption images, by synthesizing comprehensive context from paper abstracts and the citation contexts of corresponding figures. Technical validation confirms that our enhancement pipeline markedly improves data quality via reduced SciBERT pseudo-perplexity and enhanced CLIP image-text alignment, while also significantly boosting multimodal large language models performance in zero-shot scientific captioning, multi-domain scientific reasoning, and visual instruction tuning. S1-MMAlign provides a pivotal foundational resource for cross-modal scientific understanding in the AI for Science era, supporting the development of scientific foundation models and a wide range of downstream scientific intelligence applications. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents S1-MMAlign, a large-scale dataset of over 15.5 million image-text pairs extracted from 2.5 million open-access scientific papers spanning physics, biology, engineering and other fields. It introduces an AI enhancement pipeline that uses multimodal LLMs to recaption figures by synthesizing context from paper abstracts and citation contexts, and reports that this pipeline reduces SciBERT pseudo-perplexity, improves CLIP alignment scores, and yields performance gains for MLLMs on zero-shot scientific captioning, multi-domain reasoning, and visual instruction tuning. The dataset is released publicly.
Significance. A high-quality, multi-disciplinary scientific figure-text dataset at this scale would be a valuable resource for training and evaluating scientific foundation models and multimodal systems. The public release and the reported downstream gains on captioning and reasoning tasks are positive contributions if the recaption quality can be substantiated beyond proxy metrics.
major comments (2)
- [Abstract / Technical Validation] Abstract and technical validation section: the central claim that the enhancement pipeline produces higher-quality pairs rests on proxy metrics (reduced SciBERT pseudo-perplexity and improved CLIP alignment) plus downstream task gains, yet no direct human expert review, factual-accuracy audit, or comparison of recaptions against the actual figure content is described. This leaves open the possibility that distributional improvements occur even when unsupported details or omissions are introduced.
- [Experimental Setup] Experimental setup description: details on data selection criteria, filtering steps, potential biases from the MLLM recaptioning process, and statistical significance testing of the reported performance improvements are absent, which undermines verifiability of the claimed gains.
minor comments (1)
- [Abstract] The abstract would benefit from explicit numerical values for the reported perplexity reduction, CLIP score improvement, and task-specific accuracy gains rather than qualitative statements.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below, clarifying our approach and indicating revisions where appropriate to improve verifiability and transparency.
read point-by-point responses
-
Referee: [Abstract / Technical Validation] Abstract and technical validation section: the central claim that the enhancement pipeline produces higher-quality pairs rests on proxy metrics (reduced SciBERT pseudo-perplexity and improved CLIP alignment) plus downstream task gains, yet no direct human expert review, factual-accuracy audit, or comparison of recaptions against the actual figure content is described. This leaves open the possibility that distributional improvements occur even when unsupported details or omissions are introduced.
Authors: We acknowledge that the current manuscript relies exclusively on proxy metrics (SciBERT pseudo-perplexity and CLIP alignment) and downstream task improvements rather than direct human expert review or factual audits of recaptions against figure content. At the scale of 15.5 million pairs, exhaustive manual review is logistically prohibitive, and our validation follows common practice in large-scale dataset papers. We agree this leaves room for potential hallucinations or omissions and will add an explicit limitations paragraph discussing this in the revised manuscript, along with any feasible small-scale human spot-checks we can perform. revision: partial
-
Referee: [Experimental Setup] Experimental setup description: details on data selection criteria, filtering steps, potential biases from the MLLM recaptioning process, and statistical significance testing of the reported performance improvements are absent, which undermines verifiability of the claimed gains.
Authors: We agree that the original manuscript omitted key details on data selection, filtering, MLLM-induced biases, and statistical testing for conciseness. In the revision we will expand the Experimental Setup section to specify: paper and figure selection criteria from the open-access corpus, all filtering thresholds applied, a dedicated discussion of potential biases (including hallucination risks from the recaptioning MLLM), and statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) for the reported performance gains on captioning and reasoning tasks. revision: yes
Circularity Check
No circularity: empirical dataset curation with external validation metrics
full rationale
The paper describes construction of a large image-text dataset via an AI recaptioning pipeline and validates quality through proxy metrics (SciBERT pseudo-perplexity, CLIP alignment) plus downstream MLLM task gains. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The central pipeline output is evaluated against independent distributional and task-based benchmarks rather than reducing to its own inputs by construction, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal large language models can accurately and reliably recaption scientific figures by synthesizing context from abstracts and citation sentences
Reference graph
Works this paper leans on
-
[1]
Jumper, J.et al.Highly accurate protein structure prediction with alphafold.Nature596, 583–589 (2021)
work page 2021
-
[2]
Taylor, R.et al.Galactica: A large language model for science.arXiv preprint arXiv:2211.09085(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
InEuropean conference on computer vision, 740–755 (Springer, 2014)
Lin, T.-Y.et al.Microsoft coco: Common objects in context. InEuropean conference on computer vision, 740–755 (Springer, 2014)
work page 2014
-
[4]
Schuhmann, C.et al.Laion-5b: An open large-scale dataset for training next generation image-text models. InThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2022)
work page 2022
- [5]
- [6]
- [7]
-
[8]
Sever, R.et al.biorxiv: the preprint server for biology.bioRxiv(2019)
work page 2019
- [9]
-
[10]
Bai, J.et al.Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023)
work page 2023
-
[11]
InInternational conference on machine learning, 8748–8763 (PMLR, 2021)
Radford, A.et al.Learning transferable visual models from natural language supervision. InInternational conference on machine learning, 8748–8763 (PMLR, 2021)
work page 2021
-
[12]
Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[13]
Mineru: An open-source intelligent data extraction tool (2024)
OpenDataLab. Mineru: An open-source intelligent data extraction tool (2024)
work page 2024
-
[14]
Team, Q. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025). 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Tschannen, M.et al.Siglip2: Multilingualvision-languageencoderswithimprovedsemantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Kwon, W.et al.Efficient memory management for large language model serving with pagedattention (2023)
work page 2023
-
[17]
Low, Y.et al.Git is for data. InProceedings of the 13th Annual Conference on Innovative Data Systems Research (CIDR)(Amsterdam, The Netherlands, 2023). Published under CC BY 4.0 license
work page 2023
-
[18]
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676(2019)
-
[19]
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding (2019)
work page 2019
- [20]
-
[21]
In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4548–4559 (2024)
Tarsi, T.et al.Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 4548–4559 (2024). Competing Interests The authors declare no competing interests. 12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.