ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Aaditya Baranwal; Shruti Vyas

arxiv: 2602.03108 · v4 · submitted 2026-02-03 · 💻 cs.CL

ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Aaditya Baranwal , Shruti Vyas This is my paper

Pith reviewed 2026-05-16 08:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords chemistry benchmarkLLM evaluationprogressive difficultyscientific reasoningchemistry questionsmodel limitations

0 comments

The pith

Large language models perform well on basic chemistry questions but their accuracy declines sharply with increasing complexity and reasoning demands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ChemPro, a benchmark of 4100 chemistry question-answer pairs divided into four progressive difficulty levels from basic to high-school standard. It covers multiple-choice and numerical questions across biochemistry, inorganic, organic, and physical chemistry, with a balanced mix of recall, multi-concept, and problem-solving items. Evaluation of more than fifty state-of-the-art LLMs shows strong results on straightforward questions yet consistent drops as problems require longer reasoning chains or integration of several concepts. The work treats this pattern as evidence of specific limitations in LLMs' general scientific reasoning rather than isolated knowledge gaps. The benchmark is constructed to mirror student academic testing so that performance trends can be tracked across difficulty tiers.

Core claim

ChemPro shows that LLMs maintain high accuracy on basic chemistry questions but experience declining performance as question difficulty rises through multi-concept integration, long-horizon reasoning, and nuanced problem-solving, exposing understudied limitations in scientific understanding.

What carries the argument

ChemPro, a progressive benchmark of 4100 natural-language questions organized in four coherent difficulty sections that mirror student academic evaluations in chemistry.

If this is right

LLM developers need training methods that specifically target multi-step scientific reasoning rather than isolated facts.
Current general-knowledge benchmarks likely overestimate LLM capabilities in science by under-sampling harder problem types.
The observed accuracy drop suggests LLMs will struggle with real laboratory or research tasks that chain many chemistry concepts.
Future model releases can be tracked on ChemPro to measure whether scaling or new architectures close the gap at higher difficulty levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A parallel progressive benchmark in physics or biology could reveal whether the same complexity-related decline appears across scientific domains.
If the pattern holds, hybrid systems that combine LLMs with external symbolic chemistry solvers or simulators may be required for reliable high-school-level performance.
The benchmark could serve as a diagnostic tool to test whether new reasoning techniques, such as chain-of-thought or tool-use, flatten the performance drop at higher difficulty tiers.

Load-bearing premise

The questions validly measure LLM proficiency because they were designed to match the difficulty progression found in basic to high-school chemistry curricula.

What would settle it

Administer the same ChemPro questions to a cohort of high-school students and compare their accuracy curve across the four difficulty sections to the LLM curve; if LLMs match or surpass student performance at the highest levels, the claimed decline would not hold.

read the original abstract

We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChemPro is a new chemistry benchmark with 4100 questions in four progressive sections, but the difficulty claims rest on design assertions without independent checks.

read the letter

The main thing to know is that this paper releases ChemPro, a new set of 4100 chemistry questions split into four sections of increasing difficulty, and reports that LLMs do fine on the easy ones but accuracy falls as the questions get more involved. They cover the usual subfields—bio, inorganic, organic, physical—mixing recall, multi-concept, and numerical problems in a way that mirrors high-school tests. The evaluation runs across more than fifty models, which gives a decent snapshot of where current systems sit on this material. That dataset and the broad model sweep are the concrete additions here. The progressive structure is a reasonable organizing idea for seeing capability gaps in one domain. The soft spot is the missing grounding for the difficulty progression itself. The abstract states the questions were carefully designed with a gradual increase, but there is no account of curriculum mapping, blind expert ratings, student pre-testing, or any statistical check that the sections actually differ in reasoning load rather than length or format. Without those steps the drop in scores could trace to correlated factors like more numerical items or longer contexts instead of true complexity. The work is aimed at people building or testing LLMs for science education and basic research tools. A reader who needs a chemistry-specific test set would get practical value once the questions are released and the construction details are filled in. The thinking is straightforward and the contribution is honest on its own terms. I would send it to peer review so the authors can add the validation steps; the benchmark idea is worth refining even if the current version is preliminary.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChemPro, a benchmark of 4100 chemistry QA pairs divided into four sections of increasing difficulty, spanning MCQ and numerical questions across bio-, inorganic, organic, and physical chemistry. Designed to mimic basic-to-high-school student evaluations, it evaluates 52 LLMs and reports that accuracy declines with rising complexity, underscoring limitations in LLMs' scientific reasoning.

Significance. If the progressive difficulty is independently validated, ChemPro would offer a useful diagnostic for LLM weaknesses in multi-concept reasoning and problem-solving, complementing existing science benchmarks and motivating targeted improvements in model training for domain-specific tasks.

major comments (2)

[Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that the four sections constitute a 'gradual increase in the question difficulty' and are 'carefully designed analogous to a student's academic evaluation' is unsupported by any described validation procedure (curriculum mapping, blind expert rating, student pre-testing, or inter-rater reliability). This is load-bearing for the main result, as the reported accuracy decline could stem from unmeasured confounds rather than reasoning depth.
[§4] §4 (Experiments) and associated tables: The analysis of performance decline across sections does not report controls or regressions for potential confounds such as average token length, balance of numerical vs. MCQ formats, or topic distribution per section. Without these, the attribution of the decline specifically to 'different types and levels of complexity' remains unconvincing.

minor comments (2)

[§4] The description of the 45+7 models evaluated could include a clearer breakdown by open-source vs. proprietary and parameter scale in a dedicated table.
[Figures] Figure captions and example questions would benefit from explicit difficulty-level labels to aid reader interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to strengthen the description of benchmark construction and to include controls for potential confounds in the experimental analysis.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that the four sections constitute a 'gradual increase in the question difficulty' and are 'carefully designed analogous to a student's academic evaluation' is unsupported by any described validation procedure (curriculum mapping, blind expert rating, student pre-testing, or inter-rater reliability). This is load-bearing for the main result, as the reported accuracy decline could stem from unmeasured confounds rather than reasoning depth.

Authors: We agree that the manuscript would benefit from a more explicit account of how the four sections were constructed. The sections were ordered according to standard high-school chemistry curriculum progression (factual recall and single-concept questions in Section 1, progressing to multi-concept integration and numerical problem-solving in Section 4), with topics drawn from typical bio-, inorganic, organic, and physical chemistry syllabi. However, no formal curriculum mapping, blind expert ratings, or student pre-testing was performed or reported. In the revision we will add a dedicated subsection in §3 that details the design rationale, provides representative examples from each section, and states the absence of formal inter-rater validation. We will also note this limitation explicitly when interpreting the accuracy trends. revision: partial
Referee: [§4] §4 (Experiments) and associated tables: The analysis of performance decline across sections does not report controls or regressions for potential confounds such as average token length, balance of numerical vs. MCQ formats, or topic distribution per section. Without these, the attribution of the decline specifically to 'different types and levels of complexity' remains unconvincing.

Authors: We acknowledge that the current analysis does not include explicit controls for the listed confounds. In the revised manuscript we will add: (i) average token lengths per section, (ii) the proportion of MCQ versus numerical questions in each section, and (iii) topic distributions across the four chemistry sub-domains. We will further include a regression analysis (or stratified accuracy tables) that examines whether the observed performance drop remains significant after accounting for these factors. If the decline persists, we will strengthen the claim; if not, we will qualify the interpretation accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: direct benchmark contribution with no derivations or self-referential reductions

full rationale

The paper introduces ChemPro as a new dataset of 4100 questions partitioned into four difficulty sections by explicit design choice, with no equations, fitted parameters, predictions, or load-bearing self-citations. The claim of gradual difficulty increase is asserted via the authors' construction ('carefully designed analogous to a student's academic evaluation') rather than derived from prior results or reduced to inputs by construction. No steps match any enumerated circularity pattern; the work is a self-contained data and evaluation contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are used; the paper contributes a new dataset and empirical evaluation results.

pith-pipeline@v0.9.0 · 5507 in / 953 out tokens · 37480 ms · 2026-05-16T08:37:36.708131+00:00 · methodology

ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)