Recognition: unknown
COMPOSITE-Stem
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
COMPOSITE-STEM introduces 70 expert-written tasks in physics, biology, chemistry, and mathematics where the strongest frontier AI agent scores only 21 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COMPOSITE-STEM is a collection of 70 expert-curated tasks spanning physics, biology, chemistry, and mathematics that require agents to produce scientifically useful outputs. The evaluation combines exact-match checks with criterion-based rubrics and an LLM-as-a-jury protocol so that open-ended but meaningful answers can be scored. When four frontier models were tested with an adapted multimodal agent setup, the highest score reached 21 percent, which the paper interprets as showing that the benchmark measures capabilities still outside the range of present systems.
What carries the argument
The COMPOSITE-STEM benchmark itself, built from 70 doctoral-level tasks and a hybrid grading protocol that pairs exact matching with LLM jury evaluation of rubric criteria.
If this is right
- Current agents need substantial advances in long-horizon planning and cross-domain integration before they can contribute reliably to scientific discovery.
- The open release of the 70 tasks allows direct comparison of new agents against the 21 percent baseline.
- Developers can use the rubric-based scoring to identify which specific scientific skills remain hardest for models.
- The benchmark supplies a concrete target for measuring whether agent improvements translate into usable scientific output rather than just higher scores on saturated tests.
Where Pith is reading between the lines
- Widespread adoption could redirect evaluation effort away from narrow math problems toward tasks that combine observation, hypothesis, and interpretation.
- Low scores may encourage training regimes that emphasize iterative experimental design rather than single-shot answers.
- The gap between 21 percent and perfect performance points to multimodal reasoning and sustained context management as likely bottlenecks worth targeted testing.
Load-bearing premise
That the 70 tasks curated by doctoral-level researchers combined with the LLM-as-a-jury grading protocol provide a valid and unbiased measure of AI capabilities for scientific discovery.
What would settle it
A future model that routinely scores above 50 percent on the released tasks, or an independent panel of domain experts concluding that the tasks do not reflect typical open-ended scientific workflows, would indicate the benchmark does not yet capture the intended gap.
Figures
read the original abstract
AI agents hold growing promise for accelerating scientific discovery; yet, a lack of frontier evaluations hinders adoption into real workflows. Expert-written benchmarks have proven effective at measuring AI reasoning, but most at this stage have become saturated and only measure performance on constrained outputs. To help address this gap, we introduce COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics, curated by doctoral-level researchers. Our benchmark combines exact-match grading and criterion-based rubrics with an LLM-as-a-jury grading protocol, allowing more flexible assessment of scientifically meaningful outputs. Using an adapted multimodal Terminus-2 agent harness within the Harbor agentic evaluation framework, we evaluate four frontier models. The top-performing model achieves 21%, demonstrating that COMPOSITE-STEM captures capabilities beyond current agent reach. All tasks are open-sourced with contributor permission to support reproducibility and to promote additional research towards AI's acceleration of scientific progress in these domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces COMPOSITE-STEM, a benchmark of 70 expert-written tasks in physics, biology, chemistry, and mathematics curated by doctoral-level researchers. It evaluates four frontier models via an adapted multimodal Terminus-2 agent harness in the Harbor framework, combining exact-match grading, criterion-based rubrics, and an LLM-as-a-jury protocol for flexible outputs. The top model scores 21%, which the authors interpret as evidence that the benchmark captures capabilities beyond current agent reach. All tasks are open-sourced.
Significance. If the scoring protocol proves reliable, the benchmark would address saturation in existing expert-written STEM evaluations by enabling assessment of open-ended scientific reasoning. The open-sourcing of tasks and contributor permissions are strengths that support reproducibility and further research on AI for scientific discovery.
major comments (2)
- [Evaluation section] Evaluation section (LLM-as-a-jury protocol): No calibration, inter-rater agreement statistics, or human-expert comparison is reported for the LLM jury combined with rubric scoring. This is load-bearing for the central claim, as the 21% top score is presented as demonstrating capabilities beyond current agents; without validation, the aggregate may reflect grading inconsistencies rather than genuine performance gaps on scientifically meaningful tasks.
- [Benchmark Description] Benchmark curation (70 tasks): Details on task validation by the doctoral-level curators, inter-judge agreement during curation, or explicit exclusion criteria are absent. This undermines interpretation of the low scores as a trustworthy signal of agent limitations, since the tasks themselves form the basis for the claim that COMPOSITE-STEM measures beyond current reach.
minor comments (1)
- [Abstract] The abstract states four models were evaluated but does not name them or the specific Terminus-2 adaptations; adding these would improve clarity without altering the results.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review of our manuscript on COMPOSITE-STEM. We appreciate the emphasis on rigorous validation of both the grading protocol and task curation to support the benchmark's claims. We respond to each major comment below and will incorporate revisions to address the points raised.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (LLM-as-a-jury protocol): No calibration, inter-rater agreement statistics, or human-expert comparison is reported for the LLM jury combined with rubric scoring. This is load-bearing for the central claim, as the 21% top score is presented as demonstrating capabilities beyond current agents; without validation, the aggregate may reflect grading inconsistencies rather than genuine performance gaps on scientifically meaningful tasks.
Authors: We agree that the absence of calibration and agreement statistics for the LLM-as-a-jury protocol is a limitation that weakens confidence in the reported scores. In the revised manuscript, we will add a dedicated subsection under Evaluation describing a post-hoc calibration study: a random 20% sample of model outputs will be independently scored by two human experts using the identical rubrics, with inter-rater agreement (Cohen's kappa) and agreement with the LLM jury reported. This will directly test whether grading inconsistencies could explain the 21% ceiling. revision: yes
-
Referee: [Benchmark Description] Benchmark curation (70 tasks): Details on task validation by the doctoral-level curators, inter-judge agreement during curation, or explicit exclusion criteria are absent. This undermines interpretation of the low scores as a trustworthy signal of agent limitations, since the tasks themselves form the basis for the claim that COMPOSITE-STEM measures beyond current reach.
Authors: The tasks were developed and initially validated by doctoral-level domain experts, with each task cross-reviewed by at least one additional curator for scientific accuracy and clarity. However, we did not include formal inter-curator agreement metrics or explicit exclusion criteria in the submitted manuscript. In revision, we will expand the Benchmark Description section to document the curation workflow, report inter-curator agreement on task inclusion (computed retrospectively where possible), and list exclusion criteria such as tasks solvable via rote recall or lacking open-ended scientific reasoning components. This will strengthen the basis for interpreting the performance gap. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation on new tasks
full rationale
The paper introduces COMPOSITE-STEM as a set of 70 expert-curated tasks and reports direct empirical performance scores (top model at 21%) obtained via exact-match, rubric, and LLM-as-jury grading. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The result is a measurement on newly defined tasks rather than any derivation that reduces to its own inputs by construction, satisfying the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-jury grading using criterion-based rubrics produces reliable assessments of scientific outputs
Reference graph
Works this paper leans on
-
[1]
Center for AI Safety , Scale AI , and HLE Contributors Consortium . A benchmark of expert-level academic questions to assess AI capabilities. Nature, 649 0 (8099): 0 1139--1146, 2026. doi:10.1038/s41586-025-09962-4. URL https://doi.org/10.1038/s41586-025-09962-4
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[2]
Harbor Framework . Harbor. GitHub repository, 2026. URL https://github.com/harbor-framework/harbor
2026
-
[3]
Asymmetryzero: A framework for operationalizing human expert preferences as semantic evals, March 2026
Tadhg Looram, Lucas Nuzzi, Kyle Waters, and Steven Dillmann. Asymmetryzero: A framework for operationalizing human expert preferences as semantic evals, March 2026. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6497799. SSRN preprint
2026
-
[4]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868, 2026. URL https://arxiv.org/abs/2601.11868
work page internal anchor Pith review arXiv 2026
-
[5]
Tejal Patwardhan et al. Gdpval: Evaluating AI model performance on real-world economically valuable tasks. arXiv preprint arXiv:2510.04374, 2025. URL https://arxiv.org/abs/2510.04374
-
[6]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein et al. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023. URL https://arxiv.org/abs/2311.12022
work page internal anchor Pith review arXiv 2023
-
[7]
Sunstein, Eric Topol, Brendan Foody, and Osvald Nitski
Bertie Vidgen et al. The AI productivity index: APEX -v1-extended. arXiv preprint arXiv:2509.25721, 2025. URL https://arxiv.org/abs/2509.25721
- [8]
-
[9]
arXiv:2601.21165 , institution =
Miles Wang et al. Frontierscience: Evaluating AI 's ability to perform expert-level scientific tasks. arXiv preprint arXiv:2601.21165, 2026. URL https://arxiv.org/abs/2601.21165
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.