MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Changquan Gu; Dawei Zhou; Jianpeng Chen; Jingru Gan; Junkai Zhang; Ling Li; Mingyu Derek Ma; Wei Wang; Xiaoxuan Wang; Yanqiao Zhu

arxiv: 2510.12171 · v2 · pith:Y77XZRWZnew · submitted 2025-10-14 · 💻 cs.AI

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Junkai Zhang , Jingru Gan , Xiaoxuan Wang , Zian Jia , Changquan Gu , Jianpeng Chen , Yanqiao Zhu , Mingyu Derek Ma

show 3 more authors

Dawei Zhou Ling Li Wei Wang

This is my paper

classification 💻 cs.AI

keywords reasoningmaterialsmatscibenchmodelssciencequestionscurrentdifficulty

0 comments

read the original abstract

Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less studied. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 subfields, together with a three-tier difficulty classification based on the reasoning length needed to solve each problem. MatSciBench includes detailed reference solutions for 946 questions, supports process-level error analysis, and contains 315 questions with images for evaluating multimodal reasoning. We evaluate leading thinking and non-thinking LLMs on MatSciBench, and further test three reasoning methods for non-thinking models: basic chain-of-thought prompting, tool augmentation, and self-correction. The results show that current models still face clear limits in college-level materials science reasoning. DeepSeek-R1 achieves the highest score on text-only questions at 75.22% accuracy, and GPT-5 performs the best on questions with images at 53.02%. Our analysis shows that tool augmentation improves many non-thinking models in a token-efficient way, while self-correction often fails to provide reliable gains and can revise correct answers into incorrect ones. We further analyze performance across difficulty levels, reasoning efficiency, multimodal reasoning, and failure patterns, and find that current models are mainly limited by domain knowledge gaps, calculation errors, problem comprehension failures, and difficulty in extracting precise information from scientific figures. Overall, MatSciBench provides a clear testbed for measuring current LLM limitations and guiding future work on scientific reasoning in materials science.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
cs.AI 2026-05 unverdicted novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechan...
PolyReal: A Benchmark for Real-World Polymer Science Workflows
cs.CV 2026-04 unverdicted novelty 7.0

PolyReal benchmark shows leading MLLMs perform well on polymer knowledge reasoning but drop sharply on practical tasks like lab safety analysis and raw data extraction.
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
cs.AI 2026-03 accept novelty 7.0

ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.