ChemPro: A Progressive Chemistry Benchmark for Large Language Models
Pith reviewed 2026-05-16 08:37 UTC · model grok-4.3
The pith
Large language models perform well on basic chemistry questions but their accuracy declines sharply with increasing complexity and reasoning demands.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChemPro shows that LLMs maintain high accuracy on basic chemistry questions but experience declining performance as question difficulty rises through multi-concept integration, long-horizon reasoning, and nuanced problem-solving, exposing understudied limitations in scientific understanding.
What carries the argument
ChemPro, a progressive benchmark of 4100 natural-language questions organized in four coherent difficulty sections that mirror student academic evaluations in chemistry.
If this is right
- LLM developers need training methods that specifically target multi-step scientific reasoning rather than isolated facts.
- Current general-knowledge benchmarks likely overestimate LLM capabilities in science by under-sampling harder problem types.
- The observed accuracy drop suggests LLMs will struggle with real laboratory or research tasks that chain many chemistry concepts.
- Future model releases can be tracked on ChemPro to measure whether scaling or new architectures close the gap at higher difficulty levels.
Where Pith is reading between the lines
- A parallel progressive benchmark in physics or biology could reveal whether the same complexity-related decline appears across scientific domains.
- If the pattern holds, hybrid systems that combine LLMs with external symbolic chemistry solvers or simulators may be required for reliable high-school-level performance.
- The benchmark could serve as a diagnostic tool to test whether new reasoning techniques, such as chain-of-thought or tool-use, flatten the performance drop at higher difficulty tiers.
Load-bearing premise
The questions validly measure LLM proficiency because they were designed to match the difficulty progression found in basic to high-school chemistry curricula.
What would settle it
Administer the same ChemPro questions to a cohort of high-school students and compare their accuracy curve across the four difficulty sections to the LLM curve; if LLMs match or surpass student performance at the highest levels, the claimed decline would not hold.
read the original abstract
We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChemPro, a benchmark of 4100 chemistry QA pairs divided into four sections of increasing difficulty, spanning MCQ and numerical questions across bio-, inorganic, organic, and physical chemistry. Designed to mimic basic-to-high-school student evaluations, it evaluates 52 LLMs and reports that accuracy declines with rising complexity, underscoring limitations in LLMs' scientific reasoning.
Significance. If the progressive difficulty is independently validated, ChemPro would offer a useful diagnostic for LLM weaknesses in multi-concept reasoning and problem-solving, complementing existing science benchmarks and motivating targeted improvements in model training for domain-specific tasks.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that the four sections constitute a 'gradual increase in the question difficulty' and are 'carefully designed analogous to a student's academic evaluation' is unsupported by any described validation procedure (curriculum mapping, blind expert rating, student pre-testing, or inter-rater reliability). This is load-bearing for the main result, as the reported accuracy decline could stem from unmeasured confounds rather than reasoning depth.
- [§4] §4 (Experiments) and associated tables: The analysis of performance decline across sections does not report controls or regressions for potential confounds such as average token length, balance of numerical vs. MCQ formats, or topic distribution per section. Without these, the attribution of the decline specifically to 'different types and levels of complexity' remains unconvincing.
minor comments (2)
- [§4] The description of the 45+7 models evaluated could include a clearer breakdown by open-source vs. proprietary and parameter scale in a dedicated table.
- [Figures] Figure captions and example questions would benefit from explicit difficulty-level labels to aid reader interpretation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript to strengthen the description of benchmark construction and to include controls for potential confounds in the experimental analysis.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that the four sections constitute a 'gradual increase in the question difficulty' and are 'carefully designed analogous to a student's academic evaluation' is unsupported by any described validation procedure (curriculum mapping, blind expert rating, student pre-testing, or inter-rater reliability). This is load-bearing for the main result, as the reported accuracy decline could stem from unmeasured confounds rather than reasoning depth.
Authors: We agree that the manuscript would benefit from a more explicit account of how the four sections were constructed. The sections were ordered according to standard high-school chemistry curriculum progression (factual recall and single-concept questions in Section 1, progressing to multi-concept integration and numerical problem-solving in Section 4), with topics drawn from typical bio-, inorganic, organic, and physical chemistry syllabi. However, no formal curriculum mapping, blind expert ratings, or student pre-testing was performed or reported. In the revision we will add a dedicated subsection in §3 that details the design rationale, provides representative examples from each section, and states the absence of formal inter-rater validation. We will also note this limitation explicitly when interpreting the accuracy trends. revision: partial
-
Referee: [§4] §4 (Experiments) and associated tables: The analysis of performance decline across sections does not report controls or regressions for potential confounds such as average token length, balance of numerical vs. MCQ formats, or topic distribution per section. Without these, the attribution of the decline specifically to 'different types and levels of complexity' remains unconvincing.
Authors: We acknowledge that the current analysis does not include explicit controls for the listed confounds. In the revised manuscript we will add: (i) average token lengths per section, (ii) the proportion of MCQ versus numerical questions in each section, and (iii) topic distributions across the four chemistry sub-domains. We will further include a regression analysis (or stratified accuracy tables) that examines whether the observed performance drop remains significant after accounting for these factors. If the decline persists, we will strengthen the claim; if not, we will qualify the interpretation accordingly. revision: yes
Circularity Check
No circularity: direct benchmark contribution with no derivations or self-referential reductions
full rationale
The paper introduces ChemPro as a new dataset of 4100 questions partitioned into four difficulty sections by explicit design choice, with no equations, fitted parameters, predictions, or load-bearing self-citations. The claim of gradual difficulty increase is asserted via the authors' construction ('carefully designed analogous to a student's academic evaluation') rather than derived from prior results or reduced to inputs by construction. No steps match any enumerated circularity pattern; the work is a self-contained data and evaluation contribution.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.