arxiv: 2604.09836 · v2 · submitted 2026-04-10 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

COMPOSITE-Stem

Kyle Waters , Lucas Nuzzi , Tadhg Looram , Alessandro Tomasiello , Ariel Ghislain Kemogne Kamdoum , Bikun Li , Damien Sileo , Egor Kretov

show 15 more authors

Francesco Fournier-Facio Georgios Soloupis Haile Kassahun Hew Wolff Jiaqi Cai Lianghui Li Marc Roth Mohinder Naiya Naixu Guo Qicheng Tang Richard Wheeler Samuele Sala Serguei Popov Steven Dillmann Yuqi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords AI benchmarkscientific discoveryagent evaluationphysicsbiologychemistrymathematicsLLM grading

0 comments

The pith

COMPOSITE-STEM introduces 70 expert-written tasks in physics, biology, chemistry, and mathematics where the strongest frontier AI agent scores only 21 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark of 70 tasks written by doctoral researchers to test whether AI agents can carry out meaningful steps in scientific discovery. Existing evaluations often use saturated problems with narrow outputs, so this one mixes exact matching with flexible rubric scoring judged partly by another model. Four frontier models were run on the full set through an agent harness, and the best result was 21 percent. That low ceiling is presented as evidence that current agents still lack the integrated reasoning needed for real scientific work. The tasks and grading protocol are released openly to let others build and measure against the same standard.

Core claim

COMPOSITE-STEM is a collection of 70 expert-curated tasks spanning physics, biology, chemistry, and mathematics that require agents to produce scientifically useful outputs. The evaluation combines exact-match checks with criterion-based rubrics and an LLM-as-a-jury protocol so that open-ended but meaningful answers can be scored. When four frontier models were tested with an adapted multimodal agent setup, the highest score reached 21 percent, which the paper interprets as showing that the benchmark measures capabilities still outside the range of present systems.

What carries the argument

The COMPOSITE-STEM benchmark itself, built from 70 doctoral-level tasks and a hybrid grading protocol that pairs exact matching with LLM jury evaluation of rubric criteria.

If this is right

Current agents need substantial advances in long-horizon planning and cross-domain integration before they can contribute reliably to scientific discovery.
The open release of the 70 tasks allows direct comparison of new agents against the 21 percent baseline.
Developers can use the rubric-based scoring to identify which specific scientific skills remain hardest for models.
The benchmark supplies a concrete target for measuring whether agent improvements translate into usable scientific output rather than just higher scores on saturated tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could redirect evaluation effort away from narrow math problems toward tasks that combine observation, hypothesis, and interpretation.
Low scores may encourage training regimes that emphasize iterative experimental design rather than single-shot answers.
The gap between 21 percent and perfect performance points to multimodal reasoning and sustained context management as likely bottlenecks worth targeted testing.

Load-bearing premise

That the 70 tasks curated by doctoral-level researchers combined with the LLM-as-a-jury grading protocol provide a valid and unbiased measure of AI capabilities for scientific discovery.

What would settle it

A future model that routinely scores above 50 percent on the released tasks, or an independent panel of domain experts concluding that the tasks do not reflect typical open-ended scientific workflows, would indicate the benchmark does not yet capture the intended gap.

Figures

Figures reproduced from arXiv: 2604.09836 by Alessandro Tomasiello, Ariel Ghislain Kemogne Kamdoum, Bikun Li, Damien Sileo, Egor Kretov, Francesco Fournier-Facio, Georgios Soloupis, Haile Kassahun, Hew Wolff, Jiaqi Cai, Kyle Waters, Lianghui Li, Lucas Nuzzi, Marc Roth, Mohinder Naiya, Naixu Guo, Qicheng Tang, Richard Wheeler, Samuele Sala, Serguei Popov, Steven Dillmann, Tadhg Looram, Yuqi Li.

**Figure 1.** Figure 1: Portex Datalab eval-builder interface used by contributors to draft task instructions, attach reference files, and define grading rubrics [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: COMPOSITE-STEM domain-level task counts. AsymmetryZero Grading Protocol All grading in COMPOSITE-STEM is graded using the AsymmetryZero framework Looram et al. (2026), an evaluation protocol we previously developed to operationalize expert grading preferences as stable, auditable semantic contracts. AsymmetryZero was designed to address a core challenge in benchmark design. When tasks admit multiple vali… view at source ↗

**Figure 4.** Figure 4: Dense task-model outcome heatmap grouped by domain (green=pass, red=fail, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics curated by doctoral-level researchers. It evaluates four frontier models via an adapted multimodal Terminus-2 agent harness in the Harbor framework, combining exact-match grading, criterion-based rubrics, and an LLM-as-a-jury protocol for flexible outputs. The top model scores 21%, which the authors interpret as evidence that the benchmark captures capabilities beyond current agent reach. All tasks are open-sourced.

Significance. If the scoring protocol proves reliable, the benchmark would address saturation in existing expert-written STEM evaluations by enabling assessment of open-ended scientific reasoning. The open-sourcing of tasks and contributor permissions are strengths that support reproducibility and further research on AI for scientific discovery.

major comments (2)

[Evaluation section] Evaluation section (LLM-as-a-jury protocol): No calibration, inter-rater agreement statistics, or human-expert comparison is reported for the LLM jury combined with rubric scoring. This is load-bearing for the central claim, as the 21% top score is presented as demonstrating capabilities beyond current agents; without validation, the aggregate may reflect grading inconsistencies rather than genuine performance gaps on scientifically meaningful tasks.
[Benchmark Description] Benchmark curation (70 tasks): Details on task validation by the doctoral-level curators, inter-judge agreement during curation, or explicit exclusion criteria are absent. This undermines interpretation of the low scores as a trustworthy signal of agent limitations, since the tasks themselves form the basis for the claim that COMPOSITE-STEM measures beyond current reach.

minor comments (1)

[Abstract] The abstract states four models were evaluated but does not name them or the specific Terminus-2 adaptations; adding these would improve clarity without altering the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript on COMPOSITE-STEM. We appreciate the emphasis on rigorous validation of both the grading protocol and task curation to support the benchmark's claims. We respond to each major comment below and will incorporate revisions to address the points raised.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (LLM-as-a-jury protocol): No calibration, inter-rater agreement statistics, or human-expert comparison is reported for the LLM jury combined with rubric scoring. This is load-bearing for the central claim, as the 21% top score is presented as demonstrating capabilities beyond current agents; without validation, the aggregate may reflect grading inconsistencies rather than genuine performance gaps on scientifically meaningful tasks.

Authors: We agree that the absence of calibration and agreement statistics for the LLM-as-a-jury protocol is a limitation that weakens confidence in the reported scores. In the revised manuscript, we will add a dedicated subsection under Evaluation describing a post-hoc calibration study: a random 20% sample of model outputs will be independently scored by two human experts using the identical rubrics, with inter-rater agreement (Cohen's kappa) and agreement with the LLM jury reported. This will directly test whether grading inconsistencies could explain the 21% ceiling. revision: yes
Referee: [Benchmark Description] Benchmark curation (70 tasks): Details on task validation by the doctoral-level curators, inter-judge agreement during curation, or explicit exclusion criteria are absent. This undermines interpretation of the low scores as a trustworthy signal of agent limitations, since the tasks themselves form the basis for the claim that COMPOSITE-STEM measures beyond current reach.

Authors: The tasks were developed and initially validated by doctoral-level domain experts, with each task cross-reviewed by at least one additional curator for scientific accuracy and clarity. However, we did not include formal inter-curator agreement metrics or explicit exclusion criteria in the submitted manuscript. In revision, we will expand the Benchmark Description section to document the curation workflow, report inter-curator agreement on task inclusion (computed retrospectively where possible), and list exclusion criteria such as tasks solvable via rote recall or lacking open-ended scientific reasoning components. This will strengthen the basis for interpreting the performance gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on new tasks

full rationale

The paper introduces COMPOSITE-STEM as a set of 70 expert-curated tasks and reports direct empirical performance scores (top model at 21%) obtained via exact-match, rubric, and LLM-as-jury grading. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The result is a measurement on newly defined tasks rather than any derivation that reduces to its own inputs by construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of expert task curation and LLM-as-jury grading; no free parameters, new physical entities, or ad-hoc mathematical constructs are introduced.

axioms (1)

domain assumption LLM-as-a-jury grading using criterion-based rubrics produces reliable assessments of scientific outputs
The evaluation protocol depends on this without reported validation metrics in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1164 out tokens · 60349 ms · 2026-05-10T17:24:41.354069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Humanity's Last Exam

Center for AI Safety , Scale AI , and HLE Contributors Consortium . A benchmark of expert-level academic questions to assess AI capabilities. Nature, 649 0 (8099): 0 1139--1146, 2026. doi:10.1038/s41586-025-09962-4. URL https://doi.org/10.1038/s41586-025-09962-4

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[2]

Harbor Framework . Harbor. GitHub repository, 2026. URL https://github.com/harbor-framework/harbor

2026
[3]

Asymmetryzero: A framework for operationalizing human expert preferences as semantic evals, March 2026

Tadhg Looram, Lucas Nuzzi, Kyle Waters, and Steven Dillmann. Asymmetryzero: A framework for operationalizing human expert preferences as semantic evals, March 2026. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6497799. SSRN preprint

2026
[4]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868, 2026. URL https://arxiv.org/abs/2601.11868

work page internal anchor Pith review arXiv 2026
[5]

Accessed: 2026-04-29

Tejal Patwardhan et al. Gdpval: Evaluating AI model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025. URL https://arxiv.org/abs/2510.04374

work page arXiv 2025
[6]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein et al. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023. URL https://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023
[7]

Sunstein, Eric Topol, Brendan Foody, and Osvald Nitski

Bertie Vidgen et al. The AI productivity index: APEX -v1-extended. arXiv preprint arXiv:2509.25721, 2025. URL https://arxiv.org/abs/2509.25721

work page arXiv 2025
[8]

Vidgen, A

Bertie Vidgen et al. APEX -agents. arXiv preprint arXiv:2601.14242, 2026. URL https://arxiv.org/abs/2601.14242

work page arXiv 2026
[9]

arXiv:2601.21165 , institution =

Miles Wang et al. Frontierscience: Evaluating AI 's ability to perform expert-level scientific tasks. arXiv preprint arXiv:2601.21165, 2026. URL https://arxiv.org/abs/2601.21165

work page arXiv 2026