SciMDR: Advancing Scientific Multimodal Document Reasoning

Arman Cohan; Chengye Wang; Manasi Patwardhan; Rilyn Han; Yilun Zhao; Ziyu Chen

arxiv: 2603.12249 · v2 · submitted 2026-03-12 · 💻 cs.CL · cs.AI· cs.CV

SciMDR: Advancing Scientific Multimodal Document Reasoning

Ziyu Chen , Yilun Zhao , Chengye Wang , Rilyn Han , Manasi Patwardhan , Arman Cohan This is my paper

Pith reviewed 2026-05-15 11:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords scientific document reasoningmultimodal QA datasetssynthesize-and-regroundclaim-centric synthesisdocument-level reasoningscientific papersfine-tuning improvements

0 comments

The pith

A two-stage synthesize-and-reground pipeline creates a 300K-pair dataset that improves models on complex scientific multimodal document reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework to resolve the trade-off between scale, faithfulness, and realism when building training data for scientific multimodal reasoning. It first generates isolated, claim-centric QA pairs with reasoning from focused paper segments, then programmatically regrounds them into full-document contexts. This produces SciMDR, a dataset of 300,000 QA pairs with explicit reasoning chains drawn from 20,000 scientific papers, along with the expert-annotated SciMDR-Eval benchmark. Models fine-tuned on the new data show clear gains on scientific QA tasks, with the largest improvements on those that require document-level multimodal comprehension.

Core claim

The synthesize-and-reground framework enables the construction of large-scale training datasets for multimodal scientific document reasoning that remain faithful to source content while embedding realistic full-document complexity, and models trained on the resulting SciMDR data achieve significant performance improvements on multiple scientific QA benchmarks, especially those demanding complex document-level reasoning.

What carries the argument

The synthesize-and-reground framework, a two-stage pipeline of claim-centric QA synthesis followed by document-scale regrounding, that carries the argument by producing faithful yet realistic training examples at scale.

Load-bearing premise

The synthesized QA pairs remain faithful to the original document content and the regrounding step adds realistic complexity without introducing artifacts that degrade model performance on real documents.

What would settle it

If models fine-tuned on SciMDR show no improvement or degrade relative to baselines on held-out real scientific documents that require multimodal reasoning over full papers, the framework's effectiveness would be disproved.

Figures

Figures reproduced from arXiv: 2603.12249 by Arman Cohan, Chengye Wang, Manasi Patwardhan, Rilyn Han, Yilun Zhao, Ziyu Chen.

**Figure 1.** Figure 1: The Faithfulness-Realism Dilemma in scientific data synthesis and our proposed solution. Existing approaches face an inherent trade-off: simplifying context ensures faithfulness but lacks real-world complexity, while generating directly from full documents ensures realism but risks hallucination. We resolve this by decoupling the objectives into a two-stage synthesize-and-reground framework. By first gener… view at source ↗

**Figure 1.** Figure 1: Overview of the Reliable Synthesis via … Extraction Visual-Claim Match Only Visual Text Only Re-Embedding Info Localization Injection Multimodal LLM Fine-Tuning SciQA Specialist Model Back to Document Navigating through Noise Think 1. Find Fig X. 2. Find Sec Y. 3. Fig X show [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the synthesize-and-reground framework. The pipeline operates in two stages: ClaimCentric QA Synthesis ensures faithfulness by extracting atomic claims and employing backward reasoning to generate QA pairs with chain-of-thought; Document-Scale Re-grounding ensures realism by re-embedding these pairs into full-document contexts and injecting information localization steps to create hard training… view at source ↗

**Figure 3.** Figure 3: TQA generation prompt. This prompt generates questions testing deep understanding of scientific content without visual evidence. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: LLM judge prompt. This prompt evaluates model responses based on text citation (0.30), image citation (0.30), and answer accuracy (0.40). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Claim extraction prompt. This prompt guides the LLM to distill paragraphs into structured, verifiable claims serving as blueprints for QA generation. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Visual grounding prompt. This prompt matches textual claims with visual evidence, determining relationship types (Supports, Quantifies, Illustrates, Elaborates, Contradicts). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: MQA generation prompt. This prompt generates questions requiring synthesis of textual and visual information across five reasoning types (EEQ, CIM, HVI, CAC, ARS). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: VQA generation prompt. This prompt generates questions answerable solely from visual information across eight reasoning categories. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Example of EEQ (Evidence-Based Explanation & Quantification) type question. This example demonstrates how the model must explain how visual patterns (correlation matrix) support textual claims with quantitative analysis, integrating statistical interpretation from the figure with conceptual explanations from the text. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Example of CIM (Concept-to-Instance Mapping) type question. This example shows how the model links abstract architectural components (encoder, decoder, ResidualLSTM) described in text to their concrete visual representations in the system diagram, tracing information flow across modules. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Example of HVI (Hypothesis Validation & Inferential Reasoning) type question. This example illustrates inferential reasoning where the model analyzes distributional patterns in violin plots alongside textual explanations to infer underlying factors explaining behavioral differences across models. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Example of CAC (Critical Analysis & Consistency Check) type question. This example demonstrates critical evaluation of whether textual claims are accurately supported by visual data, requiring careful assessment of evidence strength and potential discrepancies. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Example of ARS (Argumentative Role & Synthesis) type question. This example shows how the model synthesizes visual evidence and textual arguments to articulate the overall scientific contribution and understand the role of visual elements in supporting the main thesis. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The synthesize-and-reground pipeline gives a practical route to 300K-scale training data for scientific multimodal QA, but the abstract leaves the size of the gains and any regrounding artifacts unverified.

read the letter

The main thing to know is that the authors describe a two-stage pipeline: first generate focused, faithful QA pairs with reasoning chains from paper segments, then programmatically re-embed those pairs into full-length documents to restore realistic complexity. They use it to release SciMDR (300K pairs from 20K papers) plus an expert-annotated SciMDR-Eval benchmark, and claim fine-tuned models improve on several scientific QA tasks that need document-level reasoning.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a synthesize-and-reground framework for constructing large-scale scientific multimodal document reasoning datasets. The framework consists of Claim-Centric QA Synthesis to generate faithful isolated QA pairs and Document-Scale Regrounding to re-embed them into full-document contexts. Using this pipeline, the authors create SciMDR (300K QA pairs across 20K papers) and the expert-annotated SciMDR-Eval benchmark. They report that models fine-tuned on SciMDR achieve significant improvements on multiple scientific QA benchmarks, especially those requiring complex document-level reasoning.

Significance. If the empirical results hold after verification, the work would offer a practical pipeline for balancing scale, faithfulness, and realism in scientific multimodal datasets, which could meaningfully advance foundation-model capabilities on full-document scientific workflows involving text, figures, and cross-modal reasoning.

major comments (2)

[Description of Document-Scale Regrounding (and Experiments)] The central claim that Document-Scale Regrounding produces realistic complexity without introducing artifacts or distribution shifts (e.g., altered context windows, entity linking, or cross-modal alignments) is load-bearing for the reported gains. No human evaluation, comparison against native documents, or ablation isolating the regrounding step is described to confirm invariance; without this, improvements on SciMDR-Eval and other benchmarks risk reflecting overfitting to synthetic patterns rather than genuine reasoning advances.
[Abstract and Experiments section] The abstract asserts 'significant improvements' across benchmarks but supplies no quantitative numbers, baselines, ablation studies, or error analysis. These details are required to evaluate effect sizes, statistical significance, and whether gains are concentrated in document-level tasks as claimed.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly stating the scale of improvements (e.g., absolute gains on key metrics) to allow readers to assess the claims without reading the full experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have updated the manuscript to incorporate the feedback where feasible.

read point-by-point responses

Referee: [Description of Document-Scale Regrounding (and Experiments)] The central claim that Document-Scale Regrounding produces realistic complexity without introducing artifacts or distribution shifts (e.g., altered context windows, entity linking, or cross-modal alignments) is load-bearing for the reported gains. No human evaluation, comparison against native documents, or ablation isolating the regrounding step is described to confirm invariance; without this, improvements on SciMDR-Eval and other benchmarks risk reflecting overfitting to synthetic patterns rather than genuine reasoning advances.

Authors: We agree that additional validation is needed to confirm that the regrounding step does not introduce artifacts. Although SciMDR-Eval is expert-annotated to ensure realism, we have now included in the revised manuscript an ablation study isolating the effect of Document-Scale Regrounding by comparing performance with and without this step. We also added a human evaluation where domain experts compared regrounded QA pairs to those from native documents, confirming no significant distribution shifts in context windows or cross-modal alignments while increasing reasoning complexity. revision: yes
Referee: [Abstract and Experiments section] The abstract asserts 'significant improvements' across benchmarks but supplies no quantitative numbers, baselines, ablation studies, or error analysis. These details are required to evaluate effect sizes, statistical significance, and whether gains are concentrated in document-level tasks as claimed.

Authors: The detailed quantitative results, including specific improvement percentages, baseline comparisons, ablation studies, and error analyses are presented in the Experiments section. To make the abstract more informative, we have revised it to include key quantitative findings on the improvements, particularly for document-level reasoning tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmark results are self-contained

full rationale

The paper introduces a synthesize-and-reground pipeline for creating SciMDR (300K QA pairs) and SciMDR-Eval without any equations, fitted parameters, or derivations. The central claim—that fine-tuning on SciMDR yields gains on scientific QA benchmarks—rests on external empirical evaluation rather than self-referential definitions or self-citation chains. No load-bearing step reduces by construction to its inputs; the framework is presented as a practical engineering solution whose validity is tested against independent benchmarks. This is the expected non-finding for a dataset paper whose contributions are measured by downstream performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work assumes that programmatically generated QA pairs can serve as faithful proxies for human-annotated document reasoning without systematic bias.

pith-pipeline@v0.9.0 · 5472 in / 1097 out tokens · 31186 ms · 2026-05-15T11:28:18.975360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Text Citation Score (0.30 points) Evaluate whether the model accurately found and cited relevant textual content: •0.30 points: The model accurately identified and cited all relevant text passages that fully support the answer •0.20 points: The model identified and cited most relevant text passages, with minor omissions • 0.10 points: The model cited some...

work page
[2]

Image Citation Score (0.30 points) Evaluate whether the model accurately identified and referenced relevant images: •0.30 points: The model accurately identified and referenced all relevant images needed to answer the question •0.20 points: The model identified and referenced most relevant images, with minor omissions •0.10 points: The model referenced so...

work page
[3]

id": "T_claim_01

Answer Accuracy Score (0.40 points) Evaluate whether the model correctly answered the key points of the question: •0.40 points: The model’s answer correctly addresses all key points and matches the ground truth • 0.20 points: The model’s answer partially addresses the question but misses some key points or contains minor errors •0.0 points: The model’s an...

work page
[4]

Locate Visual Elements: Search the image and caption for relevant visual elements (e.g., lines in a graph, bars in a chart, labels, specific regions, or text)

work page
[5]

If evidence exists, determine its nature: Does it support, quantify, illustrate, or contradict the claim?

Critically Evaluate the Evidence: Is the visual elementdirectlyandexplicitlyrelated to the claim? Thematic relevance alone is insufficient. If evidence exists, determine its nature: Does it support, quantify, illustrate, or contradict the claim?

work page
[6]

exists in visual

Construct the visual grounding Object: If youcannot finddirect visual evidence, construct {"exists in visual": false} . If youfinddirect evidence, construct a complete object with "exists in visual": true along with relationship type, visual element description, andjustification. Output Requirements Output a single, complete, augmented JSON object. Ensure...

work page
[7]

Example: ”The authors claim that [statement]

Evidence-Based Explanation & Quantification (EEQ) Core: Explain HOW and WHY a visual element supports a textual claim, and quantify that support. Example: ”The authors claim that [statement]. How exactly does the data in [Figure/Table X] support this claim, and can you quantify the effect?”

work page
[8]

Example: ”The paper defines ’[concept]’ in Section X

Concept-to-Instance Mapping (CIM) Core: Link an abstract concept, architecture, or process described in text to its concrete visual representation. Example: ”The paper defines ’[concept]’ in Section X. Identify the corresponding components in [Figure Y] and explain how they match the description.”

work page
[9]

Example: ”The hypothesis is that [hypothesis]

Hypothesis Validation & Inferential Reasoning (HVI) Core: Use combined evidence from text and visuals to validate a hypothesis, infer conclusions, or predict outcomes. Example: ”The hypothesis is that [hypothesis]. How do the results in [Figure X], combined with the text’s interpretation, validate this hypothesis?”

work page
[10]

Example: ”The text describes the improvement in [Figure X] as ’significant’

Critical Analysis & Consistency Check (CAC) Core: Critically evaluate whether textual claims are accurately supported by visual data. Example: ”The text describes the improvement in [Figure X] as ’significant’. Based on the visual evidence and scale, is this characterization accurate?”

work page
[11]

Argumentative Role & Synthesis (ARS) Core: Summarize the overall scientific takeaway and the specific role of visual evidence in the paper’s main argument. Example: ”What is the core scientific takeaway from the combination of [Figure X] and its description in the text?” Task For each claim, generate one question requiring deep, integrated understanding o...

work page
[12]

EI, Extremum Identification (Max/Min): Asks to find the highest, lowest, largest, or smallest value, or the entity associated with it

work page
[13]

4.CT, Counting: Requires counting the number of elements that meet a specific numerical criterion

CO, Computation: Requires a mathematical calculation (e.g., sum, difference, average, percentage change) based on data points from the image. 4.CT, Counting: Requires counting the number of elements that meet a specific numerical criterion

work page
[14]

6.TP, Trend & Pattern Analysis: Focuses on overall behavior of data over time, correlations, or specific patterns

CR, Comparison & Ranking: Requires comparing two or more data points or finding an entity with a specific rank. 6.TP, Trend & Pattern Analysis: Focuses on overall behavior of data over time, correlations, or specific patterns. 7.IP, Inference & Prediction: Asks for a projection, estimation based on a trend, or hypothetical outcome

work page
[15]

Task Based on the provided image context and the specified question category, generate one QA pair

MS, Compositional Reasoning (Multi-Step): A complex question that requires combining two or more of the above types. Task Based on the provided image context and the specified question category, generate one QA pair. Question Category:Choose the MOST appropriate question sub-type that would lead to a challenging and insightful question.{VISUAL ONLY QUESTI...

work page
[16]

Question Generation: The generated question must be relevant to the specified category and must be answerable solelyby analyzing the visual information in the image context

work page
[17]

Global Image Description: First, give a comprehensive and detailed description of what you see in the image. Describe the type of visualization, its main components, labels, colors, layout, values, the magnitude and positional relationships of values of each element, and any important visual elements

work page
[18]

Identify which parts of the image are relevant to answering the question

Relevant Parts of Image: Connect the image description to the specific question being asked. Identify which parts of the image are relevant to answering the question

work page
[19]

Each step should build on the previous one

Step-by-Step Reasoning: Provide step-by-step reasoning to find the answer. Each step should build on the previous one. 5.Answer: State the final answer clearly in a single, complete sentence

work page
[20]

Short Form Answer: Provide a concise version of the answer, typically a number, word, or short phrase, suitable for automated evaluation

work page
[21]

question_type

JSON Structure: Your final output MUST be a single, raw JSON object strictly adhering to the following structure. Output Format [ { "question_type": "Select from [DR, EI, CO, CT, CR, TP, IP, MS]", "question": "The question you generated", "global_image_description": "...", "relevant_parts_of_image": "...", "step_by_step_reasoning": "...", "answer": "A ful...

work page

[1] [1]

Text Citation Score (0.30 points) Evaluate whether the model accurately found and cited relevant textual content: •0.30 points: The model accurately identified and cited all relevant text passages that fully support the answer •0.20 points: The model identified and cited most relevant text passages, with minor omissions • 0.10 points: The model cited some...

work page

[2] [2]

Image Citation Score (0.30 points) Evaluate whether the model accurately identified and referenced relevant images: •0.30 points: The model accurately identified and referenced all relevant images needed to answer the question •0.20 points: The model identified and referenced most relevant images, with minor omissions •0.10 points: The model referenced so...

work page

[3] [3]

id": "T_claim_01

Answer Accuracy Score (0.40 points) Evaluate whether the model correctly answered the key points of the question: •0.40 points: The model’s answer correctly addresses all key points and matches the ground truth • 0.20 points: The model’s answer partially addresses the question but misses some key points or contains minor errors •0.0 points: The model’s an...

work page

[4] [4]

Locate Visual Elements: Search the image and caption for relevant visual elements (e.g., lines in a graph, bars in a chart, labels, specific regions, or text)

work page

[5] [5]

If evidence exists, determine its nature: Does it support, quantify, illustrate, or contradict the claim?

Critically Evaluate the Evidence: Is the visual elementdirectlyandexplicitlyrelated to the claim? Thematic relevance alone is insufficient. If evidence exists, determine its nature: Does it support, quantify, illustrate, or contradict the claim?

work page

[6] [6]

exists in visual

Construct the visual grounding Object: If youcannot finddirect visual evidence, construct {"exists in visual": false} . If youfinddirect evidence, construct a complete object with "exists in visual": true along with relationship type, visual element description, andjustification. Output Requirements Output a single, complete, augmented JSON object. Ensure...

work page

[7] [7]

Example: ”The authors claim that [statement]

Evidence-Based Explanation & Quantification (EEQ) Core: Explain HOW and WHY a visual element supports a textual claim, and quantify that support. Example: ”The authors claim that [statement]. How exactly does the data in [Figure/Table X] support this claim, and can you quantify the effect?”

work page

[8] [8]

Example: ”The paper defines ’[concept]’ in Section X

Concept-to-Instance Mapping (CIM) Core: Link an abstract concept, architecture, or process described in text to its concrete visual representation. Example: ”The paper defines ’[concept]’ in Section X. Identify the corresponding components in [Figure Y] and explain how they match the description.”

work page

[9] [9]

Example: ”The hypothesis is that [hypothesis]

Hypothesis Validation & Inferential Reasoning (HVI) Core: Use combined evidence from text and visuals to validate a hypothesis, infer conclusions, or predict outcomes. Example: ”The hypothesis is that [hypothesis]. How do the results in [Figure X], combined with the text’s interpretation, validate this hypothesis?”

work page

[10] [10]

Example: ”The text describes the improvement in [Figure X] as ’significant’

Critical Analysis & Consistency Check (CAC) Core: Critically evaluate whether textual claims are accurately supported by visual data. Example: ”The text describes the improvement in [Figure X] as ’significant’. Based on the visual evidence and scale, is this characterization accurate?”

work page

[11] [11]

Argumentative Role & Synthesis (ARS) Core: Summarize the overall scientific takeaway and the specific role of visual evidence in the paper’s main argument. Example: ”What is the core scientific takeaway from the combination of [Figure X] and its description in the text?” Task For each claim, generate one question requiring deep, integrated understanding o...

work page

[12] [12]

EI, Extremum Identification (Max/Min): Asks to find the highest, lowest, largest, or smallest value, or the entity associated with it

work page

[13] [13]

4.CT, Counting: Requires counting the number of elements that meet a specific numerical criterion

CO, Computation: Requires a mathematical calculation (e.g., sum, difference, average, percentage change) based on data points from the image. 4.CT, Counting: Requires counting the number of elements that meet a specific numerical criterion

work page

[14] [14]

6.TP, Trend & Pattern Analysis: Focuses on overall behavior of data over time, correlations, or specific patterns

CR, Comparison & Ranking: Requires comparing two or more data points or finding an entity with a specific rank. 6.TP, Trend & Pattern Analysis: Focuses on overall behavior of data over time, correlations, or specific patterns. 7.IP, Inference & Prediction: Asks for a projection, estimation based on a trend, or hypothetical outcome

work page

[15] [15]

Task Based on the provided image context and the specified question category, generate one QA pair

MS, Compositional Reasoning (Multi-Step): A complex question that requires combining two or more of the above types. Task Based on the provided image context and the specified question category, generate one QA pair. Question Category:Choose the MOST appropriate question sub-type that would lead to a challenging and insightful question.{VISUAL ONLY QUESTI...

work page

[16] [16]

Question Generation: The generated question must be relevant to the specified category and must be answerable solelyby analyzing the visual information in the image context

work page

[17] [17]

Global Image Description: First, give a comprehensive and detailed description of what you see in the image. Describe the type of visualization, its main components, labels, colors, layout, values, the magnitude and positional relationships of values of each element, and any important visual elements

work page

[18] [18]

Identify which parts of the image are relevant to answering the question

Relevant Parts of Image: Connect the image description to the specific question being asked. Identify which parts of the image are relevant to answering the question

work page

[19] [19]

Each step should build on the previous one

Step-by-Step Reasoning: Provide step-by-step reasoning to find the answer. Each step should build on the previous one. 5.Answer: State the final answer clearly in a single, complete sentence

work page

[20] [20]

Short Form Answer: Provide a concise version of the answer, typically a number, word, or short phrase, suitable for automated evaluation

work page

[21] [21]

question_type

JSON Structure: Your final output MUST be a single, raw JSON object strictly adhering to the following structure. Output Format [ { "question_type": "Select from [DR, EI, CO, CT, CR, TP, IP, MS]", "question": "The question you generated", "global_image_description": "...", "relevant_parts_of_image": "...", "step_by_step_reasoning": "...", "answer": "A ful...

work page