pith. machine review for the scientific record. sign in

arxiv: 2505.20740 · v3 · submitted 2025-05-27 · 💻 cs.AI

Recognition: unknown

MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs

Authors on Pith no claims yet
classification 💻 cs.AI
keywords reasoningscientificbenchmarkearthmllmsmultimodalmsearthaddress
0
0 comments X
read the original abstract

The rapid advancement of multimodal large language models (MLLMs) offers new opportunities for complex scientific challenges, yet their application in earth science-especially at the graduate level-remains underexplored due to a lack of benchmarks reflecting the depth and complexity of geoscientific reasoning. Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. Covering the five major spheres of Earth science-atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere-MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers. The benchmark supports tasks such as scientific figure captioning, multiple choice questions, and open-ended reasoning, providing a scalable, high-fidelity resource for developing and evaluating MLLMs in scientific reasoning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GeoR-Bench: Evaluating Geoscience Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.