arxiv: 2307.10635 · v3 · submitted 2023-07-20 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang , Ziniu Hu , Pan Lu , Yanqiao Zhu , Jieyu Zhang , Satyen Subramaniam , Arjun R. Loomba , Shichang Zhang

show 2 more authors

Yizhou Sun Wei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM benchmarkingscientific reasoningcollege-level problemsmathematicschemistryphysicsprompting strategieserror analysis

0 comments

The pith

Large language models achieve at most 43.22 percent on a new benchmark of college-level math, chemistry, and physics problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SciBench, a dataset of collegiate scientific problems, to move beyond high-school benchmarks and test deeper reasoning in LLMs. It runs representative models with multiple prompting methods and reports the top overall score at just 43.22 percent. Error analysis splits failures into ten distinct problem-solving abilities, revealing that no prompting approach improves every skill at once. A reader would care because closing these gaps could let AI contribute meaningfully to scientific work rather than only handling simpler tasks. The central finding is that current models lack the integrated reasoning needed for genuine university-level science.

Core claim

SciBench shows that current LLMs fall short of satisfactory performance on college-level scientific problem-solving, with the best overall score reaching only 43.22 percent across mathematics, chemistry, and physics problems, and no single prompting strategy consistently outperforms the rest.

What carries the argument

SciBench, the curated benchmark of collegiate scientific problems that evaluates LLMs through multiple prompting strategies and error categorization into ten abilities.

If this is right

LLMs need targeted advances in multiple reasoning skills to support scientific research.
Prompting methods that boost one ability often reduce performance in others.
Error breakdowns by the ten abilities can direct future model improvements.
Benchmarks limited to high-school problems miss the capabilities needed for university science.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained directly on similar collegiate datasets might narrow the observed gap.
Extending the benchmark to biology or engineering could reveal domain-specific weaknesses.
Human oversight will remain essential for scientific tasks until these reasoning shortfalls are addressed.

Load-bearing premise

The curated problems in SciBench represent the full range of reasoning abilities required for genuine collegiate scientific work.

What would settle it

A new model or prompting method that scores above 70 percent on the full SciBench set while preserving accuracy on unrelated benchmarks would directly challenge the reported shortfall.

read the original abstract

Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciBench gives a usable new dataset of collegiate math/chem/physics problems and a ten-skill error taxonomy, but the 43% ceiling claim sits on an unverified assumption that the problems fairly represent college-level scientific reasoning.

read the letter

The main thing here is a new benchmark of collegiate problems in math, chemistry, and physics plus a user-study error taxonomy that splits failures into ten categories. LLMs top out at 43% overall, and no single prompting approach wins across the board. That is the concrete result the paper delivers. The dataset release and the taxonomy are the parts that actually add something new; prior science benchmarks stayed mostly at high-school level, so moving the target up is a clear step. The empirical setup is straightforward to replicate once the data is out, and the finding that some prompts trade off one skill for another is a practical observation worth knowing. The soft spot is exactly the one the stress-test note flags: there is no expert baseline, no inter-rater check on solvability, and no audit against standard college curricula. Without those, the low scores could come from narrow or unusually difficult items rather than a general shortfall in LLM scientific reasoning. The abstract also skips details on exclusion criteria and statistical tests, which leaves the performance numbers harder to interpret. This paper is for people building or testing LLM reasoning in technical domains; the released artifacts give them something concrete to work with even if the headline interpretation needs more support. I would send it to peer review because the new dataset and taxonomy are real contributions that referees can evaluate directly on their own terms.

Referee Report

2 major / 1 minor

Summary. The paper introduces SciBench, a benchmark of carefully curated collegiate-level problems drawn from mathematics, chemistry, and physics. It reports an in-depth evaluation of open-source and proprietary LLMs under multiple prompting strategies, with the strongest model reaching only 43.22% overall accuracy, and presents a user-study-based categorization of LLM errors into ten distinct problem-solving abilities. The analysis concludes that no prompting strategy dominates across all abilities.

Significance. If SciBench is shown to be a representative proxy for genuine collegiate scientific reasoning, the work supplies a concrete, publicly useful benchmark that quantifies current LLM limitations beyond high-school algebra and supplies a fine-grained error taxonomy that can guide future model development. The empirical design and the observation that prompting gains in one skill can produce losses in others are useful contributions.

major comments (2)

[Dataset curation] Dataset curation section: the manuscript supplies no expert human performance baseline, inter-rater reliability statistics, or systematic coverage audit against standard college curricula. Because the headline claim (best LLM score of 43.22%) rests on the assumption that SciBench problems are representative of collegiate scientific reasoning, the absence of these validation metrics leaves open the possibility that low scores reflect benchmark artifacts rather than general reasoning deficits.
[Error analysis] User-study and error-categorization section: the derivation of the ten problem-solving abilities and the protocol of the user study (number of evaluators, agreement rates, resolution of disagreements) are not described in sufficient detail to allow readers to assess the reliability of the error taxonomy.

minor comments (1)

[Abstract] The abstract states that SciBench is 'expansive' yet gives no aggregate statistics (total problems, domain breakdown, average solution length). Adding these numbers would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the suggested additions will strengthen the manuscript's claims and address both major comments below, with planned revisions.

read point-by-point responses

Referee: [Dataset curation] Dataset curation section: the manuscript supplies no expert human performance baseline, inter-rater reliability statistics, or systematic coverage audit against standard college curricula. Because the headline claim (best LLM score of 43.22%) rests on the assumption that SciBench problems are representative of collegiate scientific reasoning, the absence of these validation metrics leaves open the possibility that low scores reflect benchmark artifacts rather than general reasoning deficits.

Authors: We acknowledge that these validation elements would better support the representativeness of SciBench. In the revised version we will add (1) expert human performance baselines on a sampled subset of problems, (2) inter-rater reliability statistics from the curation process, and (3) a topic-coverage mapping to standard college curricula. These constitute a partial revision because new data collection is required. revision: partial
Referee: [Error analysis] User-study and error-categorization section: the derivation of the ten problem-solving abilities and the protocol of the user study (number of evaluators, agreement rates, resolution of disagreements) are not described in sufficient detail to allow readers to assess the reliability of the error taxonomy.

Authors: We agree the protocol details were insufficient. The revised manuscript will expand this section to specify: the iterative derivation of the ten abilities from pilot error coding; the user-study setup with three evaluators; inter-annotator agreement rates (Fleiss' kappa); and the disagreement-resolution procedure (discussion to consensus). This will be incorporated as a full revision. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with direct measurement on held-out problems

full rationale

The paper introduces SciBench as a curated dataset of collegiate problems and reports LLM performance scores (max 43.22%) obtained by direct evaluation of model outputs against ground-truth answers. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the evaluation pipeline. The central result is a straightforward empirical measurement rather than a derivation that reduces to its own inputs by construction. The user study on error categorization is likewise an independent post-hoc analysis and does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the curated problems validly test the targeted reasoning skills; no free parameters, new physical entities, or ad-hoc constants are introduced.

axioms (1)

domain assumption The selected collegiate problems accurately sample the reasoning abilities required for university-level science.
Benchmark validity depends on this curation judgment stated in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1081 out tokens · 34108 ms · 2026-05-16T12:56:18.951338+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages
cs.CL 2026-05 unverdicted novelty 7.0

FinVQA is a new multilingual benchmark for Indic financial VQA with three difficulty levels and four formats, paired with the FIND framework for faithful numerical reasoning via fine-tuning and constrained decoding.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
cs.AI 2026-03 accept novelty 7.0

ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.
Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots
cs.HC 2026-03 conditional novelty 7.0

A DIF-based statistical method identifies items where humans and LLMs show systematic performance differences on chemistry and entrance exams, supporting AI-aware assessment design.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
cs.LG 2026-05 unverdicted novelty 6.0

PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines o...
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
cs.AI 2026-04 unverdicted novelty 6.0

TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid...
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
cs.LG 2026-04 unverdicted novelty 6.0

PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
physics.comp-ph 2026-03 unverdicted novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
cs.AI 2026-02 unverdicted novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
cs.CV 2025-12 conditional novelty 6.0

FPBench evaluates 20 MLLMs across 8 fingerprint tasks on 7 datasets and shows fine-tuning vision and language encoders improves performance by 7-39%.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 5.0

Adaptive MAS exhibit topological overfitting across domains and illusory coordination where surface accuracy masks non-ideal internal behaviors.
Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research
stat.AP 2026-02 unverdicted novelty 5.0

GPT-4o exhibits daily and weekly periodic fluctuations in performance on a fixed physics task, accounting for about 20% of observed variance.
Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
cs.CL 2026-01 unverdicted novelty 5.0

MDLMs lag autoregressive models in performance because parallel modeling weakens inter-token dependencies, yet they adapt generation order to task demands and show promise in a generate-then-edit paradigm.
TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature
cs.CL 2026-05 unverdicted novelty 4.0

TCMIIES is a zero-install browser platform with schema-guided LLM prompting that achieves over 94% structured output compliance for academic information extraction, including support for Chinese databases.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 19 Pith papers · 2 internal anchors

[1]

and Reid, P

7 Engel, T. and Reid, P. J. Thermodynamics, statistical thermody- namics, and kinetics. Prentice Hall Upper saddle River, 2010. 4, 12 Frieder, S., Pinchetti, L., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., Petersen, P. C., Chevalier, A., and Berner, J. Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867, 2023. 4 Fu, Y ., Ou, L., Ch...

work page arXiv 2010
[2]

N., Kou, B., and Zhang, T

5 Kabir, S., Udo-Imeh, D. N., Kou, B., and Zhang, T. Who an- swers it better? an in-depth analysis of chatgpt and stack over- flow answers to software engineering questions. arXiv preprint arXiv:2308.02312, 2023. 4 Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,...

work page arXiv 2023
[3]

GPT-4 Technical Report

4, 12 Mishra, S., Finlayson, M., Lu, P., Tang, L., Welleck, S., Baral, C., Rajpurohit, T., Tafjord, O., Sabharwal, A., Clark, P., et al. Lila: A unified benchmark for mathematical reasoning. In The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. 3 OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.c...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Toolformer: Language Models Can Teach Themselves to Use Tools

1, 5, 7 Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. 1 Stewart, J., Watson, S., and Clegg, D. Calculus: Early transcen- dentals, 8th. Edition, Brooks/Cole, Cengae learning, 2012. 4, 13 Sun...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Logical Decomposition and Analysis Skills: This ability involves decomposing the problem into smaller, manageable parts, and understanding the relationships between these parts

work page
[6]

Identification of Assumptions: This skill involves the AI’s ability to recognize relevant and necessary assumptions in the problem

work page
[7]

Spatial Perception: This is important for understanding problems in areas such as physics and chemistry, where you need to visualize molecules, forces, fields, etc

work page
[8]

Causal Reasoning: This is the ability to understand cause and effect relationships

work page
[9]

Problem Deduction Skills: This pertains to the ability to infer and deduce potential solutions or underlying principles from the given information in a problem

work page
[10]

Abstract Reasoning: This skill involves the ability to understand complex concepts that can’t be perceived physically, and to recognize patterns or relationships beyond concrete examples

work page
[11]

Scientific Literacy: This skill involves a comprehensive understanding of key scientific principles, terminology, and methodologies across a range of disciplines

work page
[12]

Code Conversion Skills: This denotes the ability to accurately translate solution steps into different programming languages, like Python or Wolfram, without syntax errors

work page
[13]

Logical Reasoning: This is the ability to make a reasoned argument and to identify fallacies or inconsistencies in an argument or set of data

work page
[14]

Causal Reasoning

Calculation Skills: This involves the ability to accurately carry out mathematical operations and computations. Conclude your final error reason category number within \boxed{}. Training Prompt for Zero-Shot Chain-of-Thought. Stage 1: Input: [Input-Question] Let’s think step by step. Output: <explanation> Stage 2: Input: [Input-Question] Let’s think step ...

work page
[15]

Error Reason

Identification of Assumptions . Figure S16. An example problem is inaccurately solved by error reason 2. Identification of Assumptions. "Error Reason" denotes the output from the LLM Verifier utilized in the classification of error causes. In the example, the mistaken step is highlighted in red. 26 SCIBENCH : Evaluating College-Level Scientific Problem-So...

work page