arxiv: 2604.27604 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.CE

Recognition: unknown

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

Junpeng Ding , Zichen Tang , Haihong E , Mengyuan Ji , Yang Liu , Haolin Tian , Haiyang Sun , Pengqi Sun

show 12 more authors

Yang Xu Yichen Liu Haocheng Gao Zijie Xi Ruomeng Jiang Peizhi Zhao Rongjin Li Yuanze Li Jiacheng Liu Zhongjun Yang Jintong Chen Siying Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:49 UTC · model grok-4.3

classification 💻 cs.CV cs.CE

keywords SPUR benchmarkscientific experimental imagesmultimodal large language modelsAI for scienceimage perception and reasoningcross-panel understandingmultimodal chain-of-thought

0 comments

The pith

Current multimodal AI models fall significantly short of expert-level performance when interpreting scientific experimental images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPUR, a benchmark of 4,264 question-answer pairs drawn from 1,084 expert-curated scientific images. It tests three capabilities: fine-grained visual perception across panel types, understanding of relations among an average of 14.3 panels per image, and qualitative plus quantitative reasoning that mirrors how experts draw conclusions from experimental evidence. Evaluation of 20 multimodal large language models and four chain-of-thought variants shows consistent underperformance relative to expert standards. This gap limits how much AI can currently contribute to analyzing visual data in laboratory research. The work therefore identifies a concrete obstacle for applying AI systems to scientific discovery.

Core claim

SPUR evaluates multimodal large language models on scientific experimental images through panel-level fine-grained perception on six panel types across numerical, morphological, and localization dimensions; cross-panel relation understanding on complex multi-panel figures; and expert-level reasoning across five experimental paradigms. Testing reveals that 20 MLLMs and four MCoT methods fall significantly short of the requirements for expert-level scientific image interpretation.

What carries the argument

The SPUR benchmark, which supplies expert-curated multi-panel scientific images and QA pairs to measure perception, cross-panel understanding, and reasoning in multimodal models.

If this is right

AI systems must advance substantially in visual perception and multi-panel reasoning before they can support scientific image analysis at expert levels.
Multimodal chain-of-thought techniques need targeted improvements to handle the cross-panel relations and inference steps typical of experimental data.
The AI-for-Science field encounters a clear bottleneck in processing visual experimental evidence.
Model development should prioritize the specific weaknesses exposed in numerical and morphological perception as well as complex relational reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SPUR can serve as an ongoing yardstick to track whether new models close the performance gap over time.
The benchmark's emphasis on real experimental images suggests that training data for scientific AI should incorporate more authentic multi-panel figures.
Performance gains on SPUR may translate to improved assistance in other visual-heavy scientific domains such as materials characterization or biological microscopy.

Load-bearing premise

The expert-curated images, their panel classifications, and the generated QA pairs accurately and without bias represent the full range of expert-level perception, understanding, and reasoning required for scientific experimental images.

What would settle it

Demonstrating that leading MLLMs reach expert-comparable accuracy on the expert-level reasoning portion of SPUR would show the claimed shortfall has been closed.

Figures

Figures reproduced from arXiv: 2604.27604 by Haihong E, Haiyang Sun, Haocheng Gao, Haolin Tian, Jiacheng Liu, Jintong Chen, Junpeng Ding, Mengyuan Ji, Peizhi Zhao, Pengqi Sun, Rongjin Li, Ruomeng Jiang, Siying Lin, Yang Liu, Yang Xu, Yichen Liu, Yuanze Li, Zhongjun Yang, Zichen Tang, Zijie Xi.

**Figure 1.** Figure 1: Qualitative Reasoning (Qual.) on a complex view at source ↗

**Figure 2.** Figure 2: Representative QA pairs across SPUR’s three cognitive stages. See Appendix view at source ↗

**Figure 3.** Figure 3: Overview of SPUR. (Left) The hierarchical task taxonomy comprising seven core tasks. (Right) The three-stage QA pair curation pipeline. 2 SPUR Benchmark We introduce SPUR to evaluate MLLMs’ perception, understanding, and reasoning capabilities on scientific experimental images through seven core tasks. SPUR comprises 4,264 multiple-choice questions (MCQs) paired with 1,084 scientific images and texts fro… view at source ↗

**Figure 4.** Figure 4: Distribution of (a) Panel Categories and (b) Disciplinary Categories in SPUR. Image Collection. We systematically curated over 5,000 open-access papers from PMC, applying two selection criteria: publication within the last decade and a source journal impact factor (IF) > 3.0. Using automated PDF parsing, we extracted 5,632 scientific images. These were manually annotated with related sentences, contextua… view at source ↗

**Figure 5.** Figure 5: Fine-grained results based on (a) Staining view at source ↗

**Figure 6.** Figure 6: Error analysis of a Quantitative Reasoning view at source ↗

**Figure 7.** Figure 7: User interface of the academic paper selection platform utilized during the paper-level curation step. view at source ↗

**Figure 8.** Figure 8: Visualization of panel category detections in SPUR. view at source ↗

**Figure 9.** Figure 9: Distribution of (a) Staining Image Categories and (b) Experimental Paradigms in SPUR. Provider Model Release Version Size Proprietary MLLMs OpenAI GPT-5.1 2025-11 gpt-5.1 OpenAI o4-mini-high 2025-04 o4-mini-high GPT-4o 2024-11 gpt-4o-2024-11-20 Google Gemini 3 Pro Preview 2025-11 gemini-3-pro-preview Gemini 2.5 Pro Preview 2025-06 gemini-2.5-pro-preview-06-05 Anthropic Claude 3.7 Sonnet (thinking) 2025-02 … view at source ↗

**Figure 10.** Figure 10: Example of a Numerical Perception (NP) task in the Panel-Level Fine-Grained Perception stage. view at source ↗

**Figure 11.** Figure 11: Example of a Morphological Perception (MP) task in the Panel-Level Fine-Grained Perception stage. view at source ↗

**Figure 12.** Figure 12: Example of an Information Localization (IL) task in the Panel-Level Fine-Grained Perception stage. view at source ↗

**Figure 13.** Figure 13: Example of a Trend Analysis (TA) task in the Cross-Panel Relation Understanding stage. view at source ↗

**Figure 14.** Figure 14: Example of a Heterogeneous Integration (HI) task in the Cross-Panel Relation Understanding stage. view at source ↗

**Figure 15.** Figure 15: Example of a Qualitative Reasoning task in the Expert-Level Reasoning stage. view at source ↗

**Figure 16.** Figure 16: Example of a Quantitative Reasoning task in the Expert-Level Reasoning stage. view at source ↗

read the original abstract

We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPUR is a useful new benchmark for MLLM performance on complex scientific images, though its claims about expert-level gaps need better anchoring to human performance.

read the letter

The main thing here is a new benchmark called SPUR for testing multimodal models on scientific experimental images. It highlights that current MLLMs are not up to expert level on these tasks, but the strength of that conclusion depends on how well the questions were made. The paper does something useful by focusing on real scientific figures rather than generic images. It breaks down perception into numerical, morphological, and localization aspects across six panel types. Then it looks at how models handle relations between an average of 14 panels in one image. Finally it tests reasoning over five different experimental paradigms. They ran this on 20 different models and a few chain of thought approaches, which gives a decent overview of where things stand. That level of detail on scientific images is new compared to general vision-language benchmarks. It could help guide work in AI for science by showing specific failure modes. The weak part is the lack of reported checks on the QA pairs. The description says expert-curated, but there is no word on inter-annotator agreement or whether non-experts or other experts got the answers right at high rates. Without a human performance number on the test set, the gap between models and expert-level is hard to measure precisely. Low scores could come from ambiguous questions or label errors as easily as from model limits. The stress test note flags this correctly as the load-bearing assumption. If the methods section has more on how they generated and validated the 4264 pairs, that would fix most of the concern. Otherwise the results are suggestive but not definitive. This paper is aimed at the AI4S community and people building or evaluating MLLMs for technical domains. Anyone thinking about benchmarks for complex visual reasoning would find it relevant. I would send it for peer review. The idea is solid and the evaluation is broad, even if the validation details need tightening. Referees can help sort out the construction process.

Referee Report

3 major / 2 minor

Summary. The paper introduces the SPUR benchmark for evaluating multimodal large language models (MLLMs) on scientific experimental images. It consists of 4,264 QA pairs from 1,084 expert-curated images, focusing on panel-level fine-grained perception (numerical, morphological, localization), cross-panel relation understanding in multi-panel images (average 14.3 panels), and expert-level reasoning across five experimental paradigms. The evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods concludes that current models significantly underperform expert-level requirements, pointing to a bottleneck in AI for Science.

Significance. Should the benchmark prove robust, this work is significant as it provides a large-scale, specialized testbed for MLLM capabilities in a critical domain for scientific discovery. By targeting specific aspects like fine-grained perception and complex reasoning over experimental data, it can help identify and address gaps in AI systems intended for scientific applications. The broad model evaluation offers a useful snapshot of the state of the art.

major comments (3)

[Abstract and Benchmark Construction] The central claim that models fall significantly short of expert-level requirements depends on the 4,264 QA pairs being accurate, unbiased ground truth. The abstract describes the pairs as 'expert-curated' and 'generated' but supplies no information on question validation, inter-annotator agreement, image selection criteria, or controls for curator bias. This is load-bearing because all reported performance gaps and the AI4S bottleneck conclusion inherit any flaws in QA construction.
[Evaluation and Results] No human expert performance baseline or independent difficulty rating on the final QA pairs is reported. Without this, it is impossible to quantify how far the 20 MLLMs and four MCoT methods actually fall short of 'expert-level' or to rule out that low scores reflect benchmark artifacts rather than general limitations.
[Cross-Panel Relation Understanding] The cross-panel understanding component relies on images with an average of 14.3 panels and five reasoning paradigms, yet the manuscript provides no ablation or analysis showing how performance varies with panel count or specific relation types. This weakens the ability to localize the precise bottlenecks claimed.

minor comments (2)

[Experiments] Clarify the exact metrics (accuracy, exact match, etc.) used for each sub-task and how 'expert-level' is operationalized in the results tables.
[Introduction] Add a short related-work subsection contrasting SPUR with existing scientific VQA or chart-understanding benchmarks to better highlight its unique contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below, providing clarifications and indicating the specific revisions incorporated into the updated version to improve transparency and analytical depth.

read point-by-point responses

Referee: [Abstract and Benchmark Construction] The central claim that models fall significantly short of expert-level requirements depends on the 4,264 QA pairs being accurate, unbiased ground truth. The abstract describes the pairs as 'expert-curated' and 'generated' but supplies no information on question validation, inter-annotator agreement, image selection criteria, or controls for curator bias. This is load-bearing because all reported performance gaps and the AI4S bottleneck conclusion inherit any flaws in QA construction.

Authors: We agree that explicit documentation of the QA construction pipeline is necessary to support the benchmark's validity and our conclusions. While the original manuscript summarized the expert curation process in Section 3, we acknowledge that it did not provide sufficient detail on validation, agreement metrics, selection criteria, or bias controls. In the revised manuscript, we have substantially expanded Section 3 (now including subsections 3.1–3.4) to describe: the image selection criteria (domain coverage, complexity thresholds, and exclusion rules), the multi-expert question generation and validation workflow (initial drafting followed by independent review rounds), inter-annotator agreement results, and bias mitigation steps (including blinded reviews and curator diversity). These additions directly address the load-bearing nature of the ground truth and allow readers to assess its robustness. revision: yes
Referee: [Evaluation and Results] No human expert performance baseline or independent difficulty rating on the final QA pairs is reported. Without this, it is impossible to quantify how far the 20 MLLMs and four MCoT methods actually fall short of 'expert-level' or to rule out that low scores reflect benchmark artifacts rather than general limitations.

Authors: The referee correctly identifies the value of a human baseline for calibrating the reported performance gaps. The original submission focused on model evaluations without including human results, which limits direct comparison to expert-level performance. In the revised manuscript, we have added a new subsection (Section 4.4) reporting human expert performance on a stratified subset of 300 QA pairs (100 per task type), collected from five domain experts. We also include independent difficulty ratings assigned by the experts to all QA pairs. These additions allow quantification of the gap to expert performance and help rule out benchmark artifacts as the primary cause of low model scores. revision: yes
Referee: [Cross-Panel Relation Understanding] The cross-panel understanding component relies on images with an average of 14.3 panels and five reasoning paradigms, yet the manuscript provides no ablation or analysis showing how performance varies with panel count or specific relation types. This weakens the ability to localize the precise bottlenecks claimed.

Authors: We concur that additional analysis would strengthen the localization of bottlenecks in cross-panel understanding. The original manuscript reported aggregate results for this component but did not include breakdowns by panel count or relation type. In the revised manuscript, we have added an ablation study in Section 5.2, including performance stratification by panel-number bins and by relation category (e.g., sequential, comparative, causal). The new results and accompanying figure demonstrate how accuracy varies with these factors, providing clearer evidence for the claimed bottlenecks in handling complex multi-panel scientific images. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluations are independent empirical measurements

full rationale

The paper introduces a new dataset of expert-curated images and generated QA pairs, then reports empirical performance of external MLLMs on those fixed test items. No equations, fitted parameters, or self-citations are used to derive the central claim; the reported shortfall of models relative to expert-level requirements is a direct comparison against the newly constructed ground truth rather than a quantity defined in terms of the paper's own inputs or prior self-referential results. The construction steps (image curation, QA generation, panel classification) are presented as external human processes whose validity is assumed rather than mathematically derived from the evaluation outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that models fall short of expert level rests on the validity of the curated QA pairs as faithful proxies for expert perception and reasoning; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert-curated images and QA pairs accurately capture the requirements for expert-level scientific image interpretation across the defined panel types and paradigms.
Invoked to support the claim that model performance indicates a bottleneck in AI4S.

pith-pipeline@v0.9.0 · 5563 in / 1395 out tokens · 70873 ms · 2026-05-07T06:49:23.383646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages

[1]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Sciknoweval: Evaluating multi-level scien- tific knowledge of large language models.Preprint, arXiv:2406.09098. Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, and Rongrong Ji. 2024. Cantor: Inspiring multimodal chain-of- thought of mllm. InProceedings of the 32nd ACM Inter...

work page arXiv 2024
[2]

arXiv preprint arXiv:2411.10442 , year=

Curran Associates, Inc. Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. 2024. What factors affect multi- modal in-context learning? an in-depth exploration. InAdvances in Neural Information Processing Sys- tems, volume 37, pages 123207–123236. Curran As- sociates, Inc. Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. 2024....

work page arXiv 2024
[3]

quantitative trend analysis

Scientists’ first exam: Probing cognitive abil- ities of MLLM via perception, understanding, and reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. A Examples from SPUR To provide a comprehensive view of our dataset, this section visualizes representative examples across the predefined t...

2025