MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Hassan Abid; Khawar Shehzad; Muhammad Haris Khan; Muhammad Umer Sheikh; Ufaq Khan

arxiv: 2606.10194 · v1 · pith:3Y5ESPGAnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

MMClima: A Framework for Multimodal Climate Science Data and Evaluation

Muhammad Umer Sheikh , Hassan Abid , Khawar Shehzad , Ufaq Khan , Muhammad Haris Khan This is my paper

Pith reviewed 2026-06-27 17:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal QAclimate sciencebenchmark datasetquestion answeringdata synthesismultimodal modelsclimate changevisual reasoning

0 comments

The pith

MMClima supplies over 104,000 expert-validated multimodal question-answer pairs spanning climate science articles, videos, and figures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MMClima, a framework built around a large collection of question-answer pairs for climate science that draw on text, video transcripts, and figures from five core domains. It is assembled through automated extraction of claims followed by human validation to reach both volume and reliability at scale. The resulting benchmark is applied to test multimodal language models on factual recall, visual interpretation, and the integration of information across different formats. A model fine-tuned on the text portion of the data outperforms several strong baselines on textual climate questions. The authors release the full dataset, evaluation tools, fine-tuned weights, and construction pipeline.

Core claim

MMClima is a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains, constructed via automated claim extraction and QA synthesis with human-in-the-loop validation.

What carries the argument

The MMClima dataset and its automated claim extraction plus human-in-the-loop validation pipeline that generates the multimodal QA pairs.

If this is right

State-of-the-art multimodal models can now be evaluated on climate tasks that require factual recall, visual interpretation, and cross-modal synthesis.
Fine-tuning on the textual split produces mmclima-70b-txt, which outperforms strong open- and closed-source models on textual climate QA.
Release of the dataset, evaluation pipeline, model weights, and data creation framework enables standardized multimodal evaluation for climate science.
The construction method supports creation of large, validated QA resources that combine multiple data modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction-plus-validation approach could be reused to generate benchmarks in other scientific domains that produce text, video, and figure data.
Models that perform well on MMClima may improve downstream applications such as summarizing climate reports or answering public queries about climate data.
Testing whether performance gains on this benchmark transfer to real policy or research workflows would clarify its practical value.

Load-bearing premise

The automated claim extraction and human-in-the-loop validation process produces QA pairs that are both factually accurate and representative of genuine climate science reasoning demands.

What would settle it

Independent domain experts reviewing a random sample of the QA pairs and identifying a substantial fraction that contain factual errors or fail to test multimodal reasoning would falsify the claim that the dataset is reliable and representative.

Figures

Figures reproduced from arXiv: 2606.10194 by Hassan Abid, Khawar Shehzad, Muhammad Haris Khan, Muhammad Umer Sheikh, Ufaq Khan.

**Figure 1.** Figure 1: The MMCLIMA QA generation pipeline. Textual QA pairs are created from articles and videos via scraping, transcription, chunking, claim extraction, and automated QA synthesis with human verification. Visual QA pairs are derived from scientific figures and curated datasets, refined by human experts and LLMs. Together, these stages produce over 104k validated QA pairs, forming the first large-scale multimodal… view at source ↗

**Figure 2.** Figure 2: Samples from MMCLIMA, covering textual QA (MCQ, free-form, cloze) and VQA (MCQ, yes/no, open-ended) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sample distributions across climate topics: (a) Textual dataset and (b) VQA dataset. of new domains or modalities. For instance, extending from Wikipedia text to YouTube transcripts required only an added keywords, while downstream modules remained unchanged. Similarly, incorporating IPCC figures and OurWorldInData charts into the VQA branch reused the same synthesis and validation stages. By releasing bo… view at source ↗

**Figure 4.** Figure 4: Radar plots comparing leading models on (a) textual QA and (b) visual QA. Each axis corresponds to one of the five climate science domains introduced in §3.1 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Precipitation patterns in South America, 2024 (average-precipitation-per-year.png). { "question_type": "OpenEnded", "question_stem": "How does precipitation in South America vary across different regions in 2024?", "answer": "The Amazon Basin shows extremely high precipitation above 2,250 mm, while southern regions of South America receive between 750 - 1,250 mm, and western arid zones like coastal Peru re… view at source ↗

**Figure 6.** Figure 6: Consumption of ozone-depleting substances (ozone-depleting-substance-consumption.png). { "question_type": "YesNo", "question_stem": "Did the consumption of ozone-depleting substances decrease significantly after the 1990s?", "answer": "Yes", "explanation": "The chart clearly shows a significant decline in the consumption of ozone-depleting substances after the 1990s, particularly after the implementation o… view at source ↗

**Figure 7.** Figure 7: Upper 700m ocean heat content, 1955–2024 (ocean-heat-content-upper.png). { "question_type": "MCQ", "question_stem": "Which organization’s data shows the highest increase in ocean heat content in the top 700 meters from 1955 to 2024?", "options": { "A": "NOAA", "B": "IAP", "C": "MRI/JMA", "D": "CSIRO" }, "correct_answer": "A", "explanation": "The NOAA data shows the highest and steepest increase in ocean he… view at source ↗

**Figure 8.** Figure 8: Fine-tuning dynamics of MMCLIMA-70B-TXT. (a) Gradient norm stabilizes quickly, indicating well-conditioned optimization. (b) Training loss decreases sharply before plateauing near 0.4, suggesting effective domain adaptation. Training configuration. The model was trained for 3 epochs with a batch size of 8 and 1 checkpoint saved per epoch. We performed 3 evaluation passes at uniform intervals. Optimization … view at source ↗

**Figure 9.** Figure 9: Fine-tuning dynamics for additional backbones. Both models exhibit stable optimization and monotonic loss reduction under the same settings as Section A.7. MMCLIMA supervision. Takeaways. Fine-tuning on MMCLIMA reliably lifts MCQ, Cloze, and Freeform across two distinct base models and under reduced-data regimes, indicating that (i) gains are not architecture-specific and (ii) even partial supervision is e… view at source ↗

read the original abstract

Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet existing climate QA benchmarks are small, mostly textual, and cover a narrow range of models. We introduce MMClima, a large-scale multimodal climate question answering framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five core climate science domains. MMClima is constructed via automated claim extraction and QA synthesis with human-in-the-loop validation to ensure both scale and reliability. Using MMClima, we benchmark state-of-the-art multimodal language models on tasks requiring factual recall, visual interpretation, and cross-modal synthesis. We additionally fine-tune on the textual split to produce mmclima-70b-txt, a domain-adapted baseline that outperforms strong open- and closed-source models on textual QA. We release the dataset, evaluation pipeline, fine-tuned model weights, and data creation framework to support standardized multimodal evaluation for climate science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMClima is mainly a data release of 104k multimodal climate QA pairs, and its value depends on whether the human validation step holds up under scrutiny.

read the letter

The main thing to know is that the authors built and released MMClima, a dataset of over 104,000 question-answer pairs drawn from climate articles, video transcripts, and figures across five domains. They used automated claim extraction plus human review to reach that scale, then ran benchmarks on several multimodal models and fine-tuned a 70B text model that beats some baselines on the text portion.

What stands out is the multimodal coverage and the climate focus. Prior climate QA sets are smaller and mostly text-only, so this combination of sources at this size is new. Releasing the full dataset, the evaluation code, the data creation tools, and the fine-tuned weights is the practical part that matters most.

The soft spot is the validation process. The abstract calls the pairs expert-validated but supplies no numbers on rejection rates, inter-annotator agreement, or spot-check accuracy. Without those details the claim that the data is reliable at scale rests on an unshown step. The reported fine-tuning gain is also limited to text, so the multimodal side is mostly an off-the-shelf evaluation rather than a demonstration that adaptation improves cross-modal reasoning.

This is for groups working on multimodal models or domain-specific benchmarks who need a ready test set in climate science. A reader who wants to run or extend standardized evaluations would get direct use from the released materials.

I would send it to peer review. The dataset construction is the core claim, and referees can check the validation evidence and any error analysis in the full text. If those sections are thin the paper can still serve as a data resource once the numbers are clear.

Referee Report

2 major / 0 minor

Summary. The paper introduces MMClima, a large-scale multimodal climate QA framework with 104k+ expert-validated question-answer pairs spanning articles, video transcriptions, and figures across five climate science domains. The dataset is constructed via automated claim extraction and human-in-the-loop validation. The authors benchmark state-of-the-art multimodal models on factual recall, visual interpretation, and cross-modal synthesis, fine-tune a textual model (mmclima-70b-txt) claimed to outperform baselines, and release the dataset, evaluation pipeline, model weights, and data creation framework.

Significance. If the validation process is shown to produce accurate and representative pairs, MMClima would address a clear gap by supplying the first large-scale multimodal benchmark for climate science AI, where prior resources were small and text-only. The explicit release of the full dataset, pipeline, and fine-tuned weights is a concrete strength that directly supports reproducibility and standardized evaluation in the field.

major comments (2)

[Abstract] Abstract: the claim that mmclima-70b-txt 'outperforms strong open- and closed-source models on textual QA' is presented without any quantitative results, tables, or error analysis, so the performance assertion cannot be evaluated from the given text.
[Abstract] Abstract: the description of the 'human-in-the-loop validation' process provides no statistics on inter-annotator agreement, error rates, or sampling methodology, which is load-bearing for the central claim that the 104k+ pairs are both factually accurate and representative of genuine climate science reasoning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We have revised the abstract to incorporate quantitative performance results and validation statistics, ensuring the central claims are supported and evaluable directly from the abstract while retaining full details in the main text.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that mmclima-70b-txt 'outperforms strong open- and closed-source models on textual QA' is presented without any quantitative results, tables, or error analysis, so the performance assertion cannot be evaluated from the given text.

Authors: We agree that the abstract should include quantitative results to allow direct evaluation of the performance claim. The main manuscript already contains the full results, tables, and error analysis in Section 5 and associated tables. We have revised the abstract to add a concise summary of the key metrics demonstrating the outperformance. revision: yes
Referee: [Abstract] Abstract: the description of the 'human-in-the-loop validation' process provides no statistics on inter-annotator agreement, error rates, or sampling methodology, which is load-bearing for the central claim that the 104k+ pairs are both factually accurate and representative of genuine climate science reasoning.

Authors: We agree that the abstract should include these validation statistics to substantiate the dataset quality claim. The main manuscript provides the complete description and statistics in Section 3.2. We have revised the abstract to briefly report the inter-annotator agreement, error rates, and sampling methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a data-construction and benchmarking effort: it describes automated claim extraction plus human-in-the-loop validation to build a multimodal QA dataset, followed by model evaluation and fine-tuning. No equations, fitted parameters, derivations, or load-bearing self-citations appear in the abstract or stated methodology. The central claim (release of 104k+ validated pairs) rests on the success of the described process rather than reducing to any input by construction. This is a standard non-derivational dataset paper; the derivation chain is empty.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset construction relies on standard automated NLP pipelines and human validation; no free parameters, new axioms, or invented entities are introduced beyond the dataset itself.

pith-pipeline@v0.9.1-grok · 5711 in / 1030 out tokens · 14818 ms · 2026-06-27T17:14:25.382349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 4 internal anchors

[1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

arXiv:2005.11401. Li, X., Ding, J., and Elhoseiny, M. Vrsbench: A versatile vision–language benchmark dataset for remote sensing image understanding. InAdvances in Neural Informa- tion Processing Systems, volume 37, 2024. Datasets and Benchmarks Track. arXiv:2406.12384. Liang, P., Bommasani, R., et al. Holistic evaluation of language models.Transactions o...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

10 MMClima: A Framework for Multimodal Climate Science Data and Evaluation Masry, A., Long, D

ICLR 2025 Poster. 10 MMClima: A Framework for Multimodal Climate Science Data and Evaluation Masry, A., Long, D. X., Tan, J. Q., Joty, S., and Hoque, E. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279, Dublin, Ireland, 2022. Ass...

2025
[3]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

arXiv:2201.11903. Wikipedia contributors. Wikipedia, the free encyclopedia,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Zhang, T., Kishore, V ., Wu, F., Weinberger, K

Accessed 2026-01-29. Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations,

2026
[5]

BERTScore: Evaluating Text Generation with BERT

arXiv:1904.09675. Zhao, Y ., Luo, X., Luo, J., Zhang, W., Xiao, Z., Ju, W., Yu, P. S., and Zhang, M. Multifaceted evaluation of audio- visual capability for MLLMs: Effectiveness, efficiency, generalizability and robustness. InFindings of the As- sociation for Computational Linguistics: EMNLP 2025, pp. 1026–1041, Suzhou, China, 2025. Association for Comput...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[6]

arXiv:2306.05685. Zhu, H. and Tiwari, P. Climate change from large language models, 2023. arXiv:2312.11985. 11 MMClima: A Framework for Multimodal Climate Science Data and Evaluation A. Appendix A.1. Chunk Distribution Across Themes This section shows the number of chunks obtained from both sources, Wikipedia and YouTube transcripts. Table 5.Number of chu...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Standalone→The claim must be understandable by itself without extra context
[8]

Verifiable→The claim must be fact-based and checkable against reliable sources
[9]

Universally Accepted→The claim must be widely recognized in climate science, policy, or history
[10]

Atomic→Each claim must express exactly one fact
[11]

[Warning] Reject vague/general statements

Concrete→Prefer entities, events, dates, laws, or measurements. [Warning] Reject vague/general statements. Categories: factual, conceptual, causal, policy, statistical. Output each claim as JSON with: claim, category, explanation, title, url, theme, chunk id, pageid. Extract only a few high-quality, universally accepted claims per chunk. --- Title: chunk[...
[13]

- The question must be complete by itself (not vague or partial)

Regardless of this flag: - Create one clear *standalone question * derived directly from the claim. - The question must be complete by itself (not vague or partial). - Provide a *short, precise answer * also derived directly from the claim. - The answer must be a short but complete sentence, factual, and to the point
[14]

whitelabel

Add a new field "whitelabel", which is an array of *keywords that answer the question in short way *. - Examples: ["Yes"], ["No"], ["1997"], ["CO2"], ["1 meter"], ["Paris Agreement"]. - Use the most important short tokens that answer the question. Return JSON only in this format (no commentary, no markdown fences): {{ "climate related": true/false, "quest...

1997
[16]

Regardless of this flag, *always create one exam-style multiple-choice question (MCQ)* based on the claim
[17]

- The question must test knowledge of the claim

MCQ Requirements: - Question for MCQ must be complete and unambiguous. - The question must test knowledge of the claim. - Provide exactly 4 answer options labeled A, B, C, D. - Only ONE option must be correct. - Wrong answers must be plausible but clearly incorrect. - The correct answer must be directly supported by the claim
[18]

climate related

Return JSON only in this exact format (no explanation, no commentary, no markdown fences): {{ "climate related": true/false, "question": "...", "options":{{ "A": "...", "B": "...", "C": "...", "D": "..." }}, "correct answer": "A/B/C/D" 15 MMClima: A Framework for Multimodal Climate Science Data and Evaluation }} Claim: "{claim text}" Context: Title = "{ti...
[19]

climate related

Determine if the claim is *about climate science, climate change, environment, sustainability, or related topics *. - If yes→set "climate related": true. - If no→set "climate related": false
[20]

1997", "Paris Agreement

Regardless of this flag: - Create a *cloze question * (fill-in-the-blank statement) derived directly from the claim. - Replace the most important factual value in the claim with a blank: " ". - The cloze must be complete and understandable by itself (not vague or partial). - Provide the correct *answer* that fills the blank. - The answer must be a short b...

1997
[21]

The same locked subset is used for all models

Deterministic proportional sampling.Within each stratum, sample 10% without replacement using a fixed random seed; concatenate strata to form the subset. The same locked subset is used for all models. 3.Calibration phase (using already-run models).Starting from the reviewer’s suggestion (20%), we swept candidate rates {20%,15%,10%,5%} . For each rate, and...
[22]

Free BERT

Evaluation phase.We then evaluated new higher-tier closed models strictly on the locked 10% subset with identical prompts and evaluation scripts. Table 11.Results on the locked 10% calibrated subset following the reviewer’s proposed strategy. “Free BERT” refers to BERTScore (F1) on free-form answers. Model MCQ Acc Cloze Weighted Free BERT google gemini-2....

[1] [1]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

arXiv:2005.11401. Li, X., Ding, J., and Elhoseiny, M. Vrsbench: A versatile vision–language benchmark dataset for remote sensing image understanding. InAdvances in Neural Informa- tion Processing Systems, volume 37, 2024. Datasets and Benchmarks Track. arXiv:2406.12384. Liang, P., Bommasani, R., et al. Holistic evaluation of language models.Transactions o...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[2] [2]

10 MMClima: A Framework for Multimodal Climate Science Data and Evaluation Masry, A., Long, D

ICLR 2025 Poster. 10 MMClima: A Framework for Multimodal Climate Science Data and Evaluation Masry, A., Long, D. X., Tan, J. Q., Joty, S., and Hoque, E. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pp. 2263–2279, Dublin, Ireland, 2022. Ass...

2025

[3] [3]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

arXiv:2201.11903. Wikipedia contributors. Wikipedia, the free encyclopedia,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Zhang, T., Kishore, V ., Wu, F., Weinberger, K

Accessed 2026-01-29. Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations,

2026

[5] [5]

BERTScore: Evaluating Text Generation with BERT

arXiv:1904.09675. Zhao, Y ., Luo, X., Luo, J., Zhang, W., Xiao, Z., Ju, W., Yu, P. S., and Zhang, M. Multifaceted evaluation of audio- visual capability for MLLMs: Effectiveness, efficiency, generalizability and robustness. InFindings of the As- sociation for Computational Linguistics: EMNLP 2025, pp. 1026–1041, Suzhou, China, 2025. Association for Comput...

work page internal anchor Pith review Pith/arXiv arXiv 1904

[6] [6]

arXiv:2306.05685. Zhu, H. and Tiwari, P. Climate change from large language models, 2023. arXiv:2312.11985. 11 MMClima: A Framework for Multimodal Climate Science Data and Evaluation A. Appendix A.1. Chunk Distribution Across Themes This section shows the number of chunks obtained from both sources, Wikipedia and YouTube transcripts. Table 5.Number of chu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Standalone→The claim must be understandable by itself without extra context

[8] [8]

Verifiable→The claim must be fact-based and checkable against reliable sources

[9] [9]

Universally Accepted→The claim must be widely recognized in climate science, policy, or history

[10] [10]

Atomic→Each claim must express exactly one fact

[11] [11]

[Warning] Reject vague/general statements

Concrete→Prefer entities, events, dates, laws, or measurements. [Warning] Reject vague/general statements. Categories: factual, conceptual, causal, policy, statistical. Output each claim as JSON with: claim, category, explanation, title, url, theme, chunk id, pageid. Extract only a few high-quality, universally accepted claims per chunk. --- Title: chunk[...

[12] [13]

- The question must be complete by itself (not vague or partial)

Regardless of this flag: - Create one clear *standalone question * derived directly from the claim. - The question must be complete by itself (not vague or partial). - Provide a *short, precise answer * also derived directly from the claim. - The answer must be a short but complete sentence, factual, and to the point

[13] [14]

whitelabel

Add a new field "whitelabel", which is an array of *keywords that answer the question in short way *. - Examples: ["Yes"], ["No"], ["1997"], ["CO2"], ["1 meter"], ["Paris Agreement"]. - Use the most important short tokens that answer the question. Return JSON only in this format (no commentary, no markdown fences): {{ "climate related": true/false, "quest...

1997

[14] [16]

Regardless of this flag, *always create one exam-style multiple-choice question (MCQ)* based on the claim

[15] [17]

- The question must test knowledge of the claim

MCQ Requirements: - Question for MCQ must be complete and unambiguous. - The question must test knowledge of the claim. - Provide exactly 4 answer options labeled A, B, C, D. - Only ONE option must be correct. - Wrong answers must be plausible but clearly incorrect. - The correct answer must be directly supported by the claim

[16] [18]

climate related

Return JSON only in this exact format (no explanation, no commentary, no markdown fences): {{ "climate related": true/false, "question": "...", "options":{{ "A": "...", "B": "...", "C": "...", "D": "..." }}, "correct answer": "A/B/C/D" 15 MMClima: A Framework for Multimodal Climate Science Data and Evaluation }} Claim: "{claim text}" Context: Title = "{ti...

[17] [19]

climate related

Determine if the claim is *about climate science, climate change, environment, sustainability, or related topics *. - If yes→set "climate related": true. - If no→set "climate related": false

[18] [20]

1997", "Paris Agreement

Regardless of this flag: - Create a *cloze question * (fill-in-the-blank statement) derived directly from the claim. - Replace the most important factual value in the claim with a blank: " ". - The cloze must be complete and understandable by itself (not vague or partial). - Provide the correct *answer* that fills the blank. - The answer must be a short b...

1997

[19] [21]

The same locked subset is used for all models

Deterministic proportional sampling.Within each stratum, sample 10% without replacement using a fixed random seed; concatenate strata to form the subset. The same locked subset is used for all models. 3.Calibration phase (using already-run models).Starting from the reviewer’s suggestion (20%), we swept candidate rates {20%,15%,10%,5%} . For each rate, and...

[20] [22]

Free BERT

Evaluation phase.We then evaluated new higher-tier closed models strictly on the locked 10% subset with identical prompts and evaluation scripts. Table 11.Results on the locked 10% calibrated subset following the reviewer’s proposed strategy. “Free BERT” refers to BERTScore (F1) on free-form answers. Model MCQ Acc Cloze Weighted Free BERT google gemini-2....