arxiv: 2604.10718 · v1 · submitted 2026-04-12 · 💻 cs.AI

Recognition: unknown

SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

Udari Madhushani Sehwag , Elaine Lau , Haniyeh Ehsani Oskouie , Shayan Shabihi , Erich Liang , Andrea Toledo , Guillermo Mangialardi , Sergio Fonrouge

show 9 more authors

Ed-Yeremai Hernandez Cardona Paula Vergara Utkarsh Tyagi Chen Bo Calvin Zhang Pavi Bhatter Nicholas Johnson Furong Huang Ernesto Gabriel Hernandez Montoya Bing Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsscientific experiment predictionbenchmark evaluationAI reliabilityphysics biology chemistry tasksmodel calibrationexperimental guidance

0 comments

The pith

LLMs achieve 14-26% accuracy predicting real scientific experiment outcomes, similar to human experts at 20%, but cannot judge when their predictions are reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SciPredict, a set of 405 tasks drawn from recent published studies across 33 subfields in physics, biology, and chemistry, to test whether large language models can forecast what an experiment will actually show. Current models reach 14-26% accuracy on these tasks while human experts reach about 20%. Some frontier models edge past the human baseline, yet overall performance stays far too low to guide real laboratory decisions. Models also show no useful self-awareness: their accuracy remains around 20% whether they express high confidence or judge an outcome as unpredictable without running the experiment. Human experts, by contrast, improve from roughly 5% to 80% accuracy as they rate outcomes more predictable from theory alone.

Core claim

SciPredict reveals that LLMs achieve accuracies between 14% and 26% on predicting outcomes of 405 real scientific experiments across physics, biology, and chemistry, compared to human experts at approximately 20%. Although some advanced models surpass human performance, the overall level falls short of enabling reliable experimental guidance. Models do not distinguish reliable from unreliable predictions, maintaining about 20% accuracy irrespective of their expressed confidence or judgments on predictability. Human experts, however, show strong calibration, with accuracy increasing from 5% to 80% as they rate outcomes more predictable without physical tests. This demonstrates that superhuman

What carries the argument

The SciPredict benchmark of 405 tasks drawn directly from recent empirical papers in 33 specialized sub-fields, used to measure both raw prediction accuracy and calibration of expressed confidence.

If this is right

AI systems cannot yet reliably select which experiments to run without physical validation.
Raw predictive accuracy alone is insufficient; models must also improve at knowing when their output is trustworthy.
Current prompting and training methods do not produce the kind of uncertainty awareness humans demonstrate.
Benchmarks focused solely on knowledge recall miss the calibration failures that limit practical use in research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A follow-up benchmark using only post-cutoff experiments would isolate whether models are retrieving memorized results or performing genuine forward prediction.
Training objectives that reward accurate self-assessment of uncertainty could close the calibration gap observed between models and humans.
The benchmark's structure, built from published studies, could be reused to track progress as new models are released.

Load-bearing premise

The 405 tasks fairly represent the difficulty of real experimental prediction without data leakage from model training data and that the human expert baseline supplies an unbiased comparison.

What would settle it

A test of the same models on experiments published after every model's training cutoff date, measuring whether accuracy remains below 30% or rises substantially.

Figures

Figures reproduced from arXiv: 2604.10718 by Andrea Toledo, Bing Liu, Chen Bo Calvin Zhang, Ed-Yeremai Hernandez Cardona, Elaine Lau, Erich Liang, Ernesto Gabriel Hernandez Montoya, Furong Huang, Guillermo Mangialardi, Haniyeh Ehsani Oskouie, Nicholas Johnson, Paula Vergara, Pavi Bhatter, Sergio Fonrouge, Shayan Shabihi, Udari Madhushani Sehwag, Utkarsh Tyagi.

**Figure 1.** Figure 1: Key findings of SciPredict. Frontier models exhibit fundamental gaps in accuracy and calibration robustness in scientific experiment outcome prediction. We highlight four key failure modes using a representative subset of SOTA models: Claude O4.5 (Claude Opus 4.5), OpenAI GPT-5.2, Gemini 3P (Gemini 3 Pro), Llama 3.3 (Meta Llama 3.3 70B), and Qwen 3 235B. (a) Providing expert-curated background knowledge (B… view at source ↗

**Figure 2.** Figure 2: LLM-enhanced efficient scientific research workflow. The figure illustrates how LLM-powered experimental outcome prediction can be integrated into the scientific research process. Phase 1 involves ideation and experimental design through literature review and hypothesis formulation. Phase 2 represents a fast, low-cost prediction loop where LLMs predict experimental outcomes and identify high-potential expe… view at source ↗

**Figure 3.** Figure 3: SciPredict curation pipeline. The benchmark construction involves four integrated stages: (Top-Left) Data Collection from preprint repositories ensuring a post-March 2025 cutoff to prevent data leakage; (Top-Middle) Expert Annotation where domain specialists convert raw papers into MCQ, numerical, and free-form prediction tasks; (Bottom-Left) Task Structure enforcement, ensuring every sample includes granu… view at source ↗

**Figure 4.** Figure 4: Accuracy with and without background knowledge. Accuracy (%) of each evaluated model under two input conditions: (a) W/o background knowledge: the model receives only the experimental setup, measurements, and the question; (b) W/ background knowledge: the same information as previous case with the addition of expert-curated background knowledge. Finding #1: Human performance is close to the average model… view at source ↗

**Figure 6.** Figure 6: Human vs self-generated background knowledge. Evaluated accuracy (%) for the models under the four prediction conditions defined in §Sec. 4.3. BK generally yields the highest accuracy, while SBK frequently degrades accuracy relative to NBK, indicating that models fail to reliably generate useful predictive context. Furthermore, SABK rarely improves upon BK, suggesting that adding synthetic information lik… view at source ↗

**Figure 7.** Figure 7: Models are poorly calibrated in self-reported confidence, difficulty, and feasibility, whereas human calibration correlated with accuracy. The top row plots empirical accuracy (y-axis) against model-provided confidence/difficulty/feasibility metrics. Each circle represents a model at a particular confidence/difficulty/feasibility level, and its color corresponds to the percentage of the number of questions… view at source ↗

**Figure 8.** Figure 8: Analysis of model errors. We errors into a hierarchical taxonomy spanning five top-level (in black background) categories and 16 specific error types. The heatmap shows the percentage of incorrect responses containing each error type for each evaluated model. Error categories (as defined in Tab. 8) progress from surface-level issues (e.g., Comprehension & Scope) to deeper reasoning failures (e.g., Logical … view at source ↗

**Figure 9.** Figure 9: Accuracy is highly sensitive to the question format. Answer format drives large swings in model accuracy under identical prediction tasks. We evaluate each model in the NBK setting across four response formats: multiple-choice questions (MCQ), free-form (FF), numerical value (NUM), and MCQs rewritten into matched free-form prompts (MCQ→FF). Accuracy is generally the highest for MCQs, lower for free-form, a… view at source ↗

**Figure 10.** Figure 10: Domain specific accuracy. Heatmap of model accuracy (%) on benchmark questions, broken down by scientific domain (Biology, Physics, Chemistry). Results are provided for the evaluated models and human baseline. Overall, frontier models achieve the highest accuracies, but performance varies by domain; Chemistry tends to be the most challenging subset, and several models (including the human baseline) exhib… view at source ↗

**Figure 11.** Figure 11: Model accuracy on SciPredict correlates with performance on the HLE benchmark. Benchmark performance correlates with general hard-reasoning performance. This is a scatter plot of each evaluated model’s accuracy on SciPredict in the no-background-knowledge (NBK) setting (y axis) versus its HLE text-only accuracy (x-axis). The solid line shows a linear fit and the shaded region indicates the corresponding … view at source ↗

**Figure 12.** Figure 12: Diversity of the experts recruited for benchmark construction and human baseline. Top left: A plot of the highest degree distribution of experts recruited for benchmark construction. Top center: A plot of the domain expertise of experts recruited for benchmark construction. Top right: A plot of the highest degree distribution of experts recruited for human baseline. Bottom left: A heatmap of the countries… view at source ↗

**Figure 13.** Figure 13: Analysis of model errors for high feasible questions. We employ an LLM judge to systematically classify errors in model predictions according to a hierarchical taxonomy spanning five top-level (in black background) categories and 16 specific error types. The heatmap shows the percentage of incorrect responses containing each error type for each evaluated model. Error categories progress from surface-level… view at source ↗

read the original abstract

Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs hit only 14-26% accuracy on real experiment predictions and show zero calibration, while humans calibrate well, on this new benchmark.

read the letter

The one thing to know is that current LLMs remain far from useful for picking which experiments to run because they guess wrong most of the time and cannot tell reliable guesses from bad ones, even when they claim high confidence. Humans do better on the second part even at similar overall accuracy around 20% percent. The paper builds SciPredict from 405 tasks drawn from recent published studies across 33 subfields in physics, biology, and chemistry. This focus on direct outcome prediction rather than fact recall or step-by-step reasoning fills a gap left by existing science benchmarks. Releasing the full dataset and code is a clear positive that lets others inspect the tasks and rerun the evaluations. The calibration contrast comes through clearly in the reported numbers: model accuracy stays flat near 20% regardless of stated confidence or predictability judgment, while human accuracy rises from roughly 5% to 80% as experts rate outcomes more predictable without running the experiment. That distinction is the most useful part of the work and stands on its own as evidence of a practical limitation. The main soft spots sit in the task construction and baseline details. Tasks come from published studies, so models could have encountered the papers or closely related results in training data, and the abstract gives no explicit steps taken to measure or rule out leakage. The human expert comparison also needs more on selection criteria and exact prompting to confirm the information given to people matches what the models receive. These are fixable but they matter for how much weight the accuracy gap should carry. The paper is aimed at groups building AI tools meant to support experimental decisions or anyone running targeted LLM evaluations in domain science. Readers who care about calibration and real-world utility over raw benchmark scores will get something concrete from the released data. It deserves a serious referee because the benchmark is public, the claims are narrow and testable, and the calibration result adds a distinct angle beyond another accuracy table. I would send it for peer review after asking for leakage checks and clearer human protocol.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SciPredict, a benchmark of 405 tasks derived from recent empirical studies in 33 sub-fields of physics, biology, and chemistry. It evaluates whether LLMs can predict experimental outcomes, reporting accuracies of 14-26% for models versus approximately 20% for human experts. The work further examines calibration, showing that models maintain roughly 20% accuracy regardless of confidence level or self-assessed predictability, while human experts calibrate strongly (accuracy rising from ~5% to ~80% as predictability increases). Data and code are released for reproducibility.

Significance. If the benchmark construction and human-model comparison prove robust, the results would provide concrete evidence of current LLMs' limitations in experimental prediction and uncertainty awareness, underscoring that superhuman performance in science requires both accurate forecasts and reliable self-assessment of reliability. The public release of the 405 tasks and evaluation code is a clear strength that supports follow-on work.

major comments (3)

[§3] §3 (Task Construction): The claim that the 405 tasks are free of data leakage from model training corpora is central to interpreting the 14-26% accuracies as genuine generalization rather than memorization, yet the manuscript provides no explicit verification procedure (e.g., n-gram overlap checks, date-cutoff analysis, or contamination audits against the evaluated models' training data). This directly affects the soundness of the central claim.
[§4.3] §4.3 (Human Baseline): The human expert performance of ≈20% and the calibration curve (5% to 80%) are load-bearing for the contrast with models, but details on expert selection criteria, domain expertise matching to the 33 sub-fields, and prompt/information parity with the LLM setup are insufficient to rule out bias in the comparison.
[§5] §5 (Results): The statement that 'some frontier models exceed human performance' requires accompanying statistical tests (e.g., McNemar or bootstrap confidence intervals on the 405-task set) to establish whether the small margins are significant or attributable to sampling variability.

minor comments (2)

[Abstract] Abstract: The sentence 'Although some frontier models exceed human performance model accuracy is still far below...' is missing punctuation and should read 'performance, model accuracy is still far below...'.
[Results] Figure 2 or equivalent calibration plot: Axis labels and legend should explicitly state the number of tasks per bin to allow readers to assess the reliability of the 5%-80% human calibration curve.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and commit to revisions that enhance the rigor of our claims without altering the core findings.

read point-by-point responses

Referee: §3 (Task Construction): The claim that the 405 tasks are free of data leakage from model training corpora is central to interpreting the 14-26% accuracies as genuine generalization rather than memorization, yet the manuscript provides no explicit verification procedure (e.g., n-gram overlap checks, date-cutoff analysis, or contamination audits against the evaluated models' training data). This directly affects the soundness of the central claim.

Authors: We selected all 405 tasks exclusively from peer-reviewed papers published in 2023–2024, after the documented training cutoffs of the evaluated models (e.g., GPT-4, Claude 3, and Llama-3 variants). While this temporal filtering substantially reduces the risk of leakage, we agree that an explicit verification procedure strengthens the interpretation. In the revised manuscript we will add a dedicated subsection in §3 that reports (i) n-gram overlap analysis against publicly available pre-training corpora proxies and (ii) a date-cutoff audit confirming zero overlap with model training windows. We will also note the practical limits of full contamination audits given proprietary training data. revision: yes
Referee: §4.3 (Human Baseline): The human expert performance of ≈20% and the calibration curve (5% to 80%) are load-bearing for the contrast with models, but details on expert selection criteria, domain expertise matching to the 33 sub-fields, and prompt/information parity with the LLM setup are insufficient to rule out bias in the comparison.

Authors: We concur that greater transparency is required. The revised §4.3 will explicitly state: (1) recruitment criteria (PhD-level researchers and postdocs with at least two first-author publications in the target sub-field), (2) domain-matching protocol (experts were assigned tasks only within their documented publication areas across the 33 sub-fields), and (3) information parity (humans received identical task statements, background paragraphs, and output format instructions as the LLMs, with no additional literature access). These additions will allow readers to assess potential bias directly. revision: yes
Referee: §5 (Results): The statement that 'some frontier models exceed human performance' requires accompanying statistical tests (e.g., McNemar or bootstrap confidence intervals on the 405-task set) to establish whether the small margins are significant or attributable to sampling variability.

Authors: We accept this recommendation. The revised §5 will report 95% bootstrap confidence intervals for all accuracy figures and will include McNemar’s test (with continuity correction) for paired comparisons between each frontier model and the human baseline across the 405 tasks. These tests will quantify whether observed differences (e.g., 26% vs. 20%) are statistically distinguishable from sampling variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with external validation

full rationale

The paper introduces SciPredict as an empirical benchmark consisting of 405 tasks drawn from recent published studies across 33 sub-fields. Model and human accuracies are measured directly against the actual experimental outcomes reported in those studies, with no mathematical derivation, equations, fitted parameters, or ansatz involved. Human expert baselines and model prompting setups are compared to these external ground-truth results rather than reducing to any internal definition or self-citation chain. No load-bearing steps rely on self-citations for uniqueness theorems or renamings of known results. The evaluation is self-contained against external benchmarks (real study outcomes and independent human experts), satisfying the criteria for a non-circular empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions in LLM benchmarking such as tasks being representative of real science and absence of training data contamination; no free parameters, axioms beyond standard evaluation practices, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5650 in / 1161 out tokens · 35938 ms · 2026-05-10T15:52:35.754253+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 32 canonical work pages · 7 internal anchors

[1]

Abdel-Rehim, H

A. Abdel-Rehim, H. Zenil, O. Orhobor, M. Fisher, R. J. Collins, E. Bourne, G. W. Fearnley, E. Tate, H. X. Smith, L. N. Soldatova, et al. Scientific hypothesis generation bylargelanguagemodels: laboratoryvalidationinbreast cancer treatment.Journal of the Royal Society Interface, 22(227):20240674, 2025

2025
[2]

Ali-Dib and K

M. Ali-Dib and K. Menou. Physics simulation capabili- ties of llms.Physica Scripta, 99(11):116003, 2024

2024
[3]

Amayuelas, K

A. Amayuelas, K. Wong, L. Pan, W. Chen, and W. Y. Wang. Knowledge of knowledge: Exploring known- unknowns uncertainty with large language models. In Findings of the Association for Computational Linguis- tics: ACL 2024, pages 6416–6432, 2024

2024
[4]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, 13 M.Shah,A.Vallone,A.Beutel,etal. Healthbench: Eval- uating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review arXiv 2025
[5]

C. O. Barkan, S. Black, and O. Sourbut. Do large language models know what they are capable of?arXiv preprint arXiv:2512.24661, 2025

work page arXiv 2025
[6]

Replicating a high-impact scientific publication using systems of large language models.bioRxiv, pages 2024– 04, 2024

D.Bersenev,A.Yachie-Kinoshita,andS.K.Palaniappan. Replicating a high-impact scientific publication using systems of large language models.bioRxiv, pages 2024– 04, 2024

2024
[7]

Brunnsåker, A

D. Brunnsåker, A. H. Gower, P. Naval, E. Y. Bjurström, F.Kronström,I.A.Tiukova,andR.D.King. Self-driven biological discovery through automated hypothesis gen- eration and experimental validation.bioRxiv, pages 2025–06, 2025

2025
[8]

J.S.Chan,N.Chowdhury,O.Jaffe,J.Aung,D.Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L.Weng,andA.Mądry. Mle-bench: Evaluatingmachine learning agents on machine learning engineering, 2025. URLhttps://arxiv.org/abs/2410.07095

work page Pith review arXiv 2025
[9]

N. E. Chayen. Turning protein crystallisation from an art into a science.Current opinion in structural biology, 14 5:577–83, 2004. URLhttps://api.semanticscholar. org/CorpusID:26208535

2004
[10]

H. Chen, M. Xiong, Y. Lu, W. Han, A. Deng, Y. He, J.Wu,Y.Li,Y.Liu,andB.Hooi. Mlr-bench: Evaluating aiagentsonopen-endedmachinelearningresearch,2025. URLhttps://arxiv.org/abs/2505.19955

work page arXiv 2025
[11]

Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. Routledge, and W. Y. Wang. Finqa: A dataset of numerical reasoning over financial data, 2022. URL https://arxiv.org/abs/2109.00122

work page arXiv 2022
[12]

Z. Cui, N. Li, and H. Zhou. Can ai replace human subjects? a large-scale replication of psychological experiments with llms.A Large-Scale Replication of PsychologicalExperimentswithLLMs(August25,2024), 2024

2024
[13]

M. Du, B. Xu, C. Zhu, X. Wang, and Z. Mao. Deep- research bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

work page arXiv 2025
[14]

N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana, A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, D. Zambrano, D. Talisman, E. Hoque, F.Surani,F.Fagan,G.Sarfaty,G.M.Dickinson,H.Porat, J.Hegland,J.Wu,J.Nudell,J.Niklaus,J.Nay,J.H.Choi, K.Tobia,M.Hagan,M.Ma,M.Livermore,N.Rasumov- Rahe,N.Holzenberger,N.Kolt,P.Henderson,S.Rehaag, S. Goel, S....
[15]

URLhttps://arxiv.org/abs/2308.11462

work page arXiv
[16]

C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In D. Pre- cup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, vol- ume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/guo17a.html

2017
[17]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136, 2016

work page internal anchor Pith review arXiv 2016
[18]

T. Hua, H. Hua, V. Xiang, B. Klieger, S. T. Truong, W. Liang, F.-Y. Sun, and N. Haber. Researchcodebench: Benchmarking llms on implementing novel machine learningresearchcode.arXivpreprintarXiv:2506.02314, 2025

work page arXiv 2025
[19]

arXiv:2310.03302 , year=

Q. Huang, J. Vora, P. Liang, and J. Leskovec. Mlagent- bench: Evaluating language agents on machine learning experimentation, 2024. URLhttps://arxiv.org/abs/ 2310.03302

work page arXiv 2024
[20]

Jiang, J

Z. Jiang, J. Araki, H. Ding, and G. Neubig. How can we know when language models know? on the calibration oflanguagemodelsforquestionanswering.Transactions of the Association for Computational Linguistics, 9:962– 977, 2021

2021
[21]

Jiang, D

Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y. Wu. Aide: Ai-driven exploration in thespaceofcode,2025. URL https://arxiv.org/abs/ 2502.13138

work page arXiv 2025
[22]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. Pubmedqa: A dataset for biomedical research question answering, 2019. URLhttps://arxiv.org/abs/1909. 06146

2019
[23]

L. Justen. Llms outperform experts on challenging biology benchmarks.arXiv preprint arXiv:2505.06108, 2025

work page arXiv 2025
[24]

Großmann, S

Y.Ke, K.George, K.Pandya, D.Blumenthal, M.Sprang, G. Großmann, S. Vollmer, and D. A. Selby. Biodisco: Multi-agent hypothesis generation with dual-mode evi- dence, iterative feedback and temporal evaluation.arXiv preprint arXiv:2508.01285, 2025

work page arXiv 2025
[25]

P. T. J. Kon, J. Liu, X. Zhu, Q. Ding, J. Peng, J. Xing, Y. Huang, Y. Qiu, J. Srinivasa, M. Lee, et al. Exp-bench: Can ai conduct ai research experiments?arXiv preprint arXiv:2505.24785, 2025

work page arXiv 2025
[26]

Laurent and Joseph D

J.M.Laurent,J.D.Janizek,M.Ruzo,M.M.Hinks,M.J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research, 2024. URL https://arxiv.org/abs/2407.10362. 14

work page arXiv 2024
[27]

M. Li, S. Torres-Garcia, S. Halder, P. Kuppa, S. O’Brien, V. Sharma, K. Zhu, and S. Dev. Frontierscience bench: Evaluatingairesearchcapabilitiesinllms.InProceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 428–453, 2025

2025
[28]

S. Lin, J. Hilton, and O. Evans. Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334, 2022

work page arXiv 2022
[29]

Z. Lin, Y. Shen, Q. Cai, H. Sun, J. Zhou, and M. Xiao. Autop2c: An llm-based agent framework for code repos- itory generation from multimodal content in academic papers.arXiv preprint arXiv:2504.20115, 2025

work page arXiv 2025
[30]

G. Liu, X. Wang, L. Yuan, Y. Chen, and H. Peng. Examining llms’ uncertainty expression towards ques- tions outside parametric knowledge.arXiv preprint arXiv:2311.09731, 2023

work page arXiv 2023
[31]

S. Lu, Z. Jin, T. J. Zhang, P. Kos, J. I. Cirac, and B. Schölkopf. Can theoretical physics research benefit fromlanguageagents?arXivpreprintarXiv:2506.06214, 2025

work page arXiv 2025
[32]

Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering

A.Pal,L.K.Umapathi,andM.Sankarasubbu. Medmcqa : A large-scale multi-subject multi-choice dataset for medical domain question answering, 2022. URLhttps: //arxiv.org/abs/2203.14371

work page arXiv 2022
[33]

Stronginference: Certainsystematicmethods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964

J.R.Platt. Stronginference: Certainsystematicmethods of scientific thinking may produce much more rapid progress than others.science, 146(3642):347–353, 1964

1964
[34]

Saynova, K

D. Saynova, K. Hansson, B. Bruinsma, A. Fredén, and M. Johansson. Identifying non-replicable social sci- ence studies with language models.arXiv preprint arXiv:2503.10671, 2025

work page arXiv 2025
[35]

Shojaee, K

P. Shojaee, K. Meidani, S. Gupta, A. B. Farimani, and C. K. Reddy. Llm-sr: Scientific equation discovery via programming with large language models. InThe Thirteenth International Conference on Learning Repre- sentations
[36]

Shorinwa, Z

O. Shorinwa, Z. Mei, J. Lidard, A. Z. Ren, and A. Ma- jumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 2025

2025
[37]

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

N. Somasekharan, L. Yue, Y. Cao, W. Li, P. Emami, P. S. Bhargav, A. Acharya, X. Xie, and S. Pan. Cfd-llmbench: A benchmark suite for evaluating large language mod- els in computational fluid dynamics.arXiv preprint arXiv:2509.20374, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Starace, O

G. Starace, O. Jaffe, D. Sherburn, J. Aung, J. S. Chan, L. Maksin, R. Dias, E. Mays, B. Kinsella, W. Thompson, et al. Paperbench: Evaluating ai’s ability to replicate ai research. InForty-second International Conference on Machine Learning
[39]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. S. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic. Galactica: A large language model for science.ArXiv, abs/2211.09085, 2022. URL https: //api.semanticscholar.org/CorpusID:253553203

work page internal anchor Pith review arXiv 2022
[40]

G. Tom, S. P. Schmid, S. G. Baird, Y. Cao, K. Darvish, H. Hao, S. Lo, S. Pablo-García, E. M. Rajaonson, M. Skreta, et al. Self-driving laboratories for chem- istry and materials science.Chemical Reviews, 124(16): 9633–9732, 2024

2024
[41]

Alvers, M

G.Tsatsaronis,M.Schroeder,G.Paliouras,Y.Almirantis, I.Androutsopoulos,E.Gaussier,P.Gallinari,T.Artieres, M. Alvers, M. Zschunke, and A.-C. Ngonga Ngomo. BioASQ:Achallengeonlarge-scalebiomedicalsemantic indexing and Question Answering. InProceedings of AAAI Information Retrieval and Knowledge Discovery in Biomedical Text, 2012

2012
[42]

M. Wang, R. Lin, K. Hu, J. Jiao, N. Chowdhury, E. Chang, and T. Patwardhan. Frontierscience: Evaluating AI’s ability to perform expert-level scientific tasks. https://cdn.openai.com/ pdf/2fcd284c-b468-4c21-8ee0-7a783933efcc/ frontierscience-paper.pdf, Dec. 2025. Technical report. Accessed: 2026-01-26

2025
[43]

S. Xia, Y. Sun, and P. Liu. Sr-scientist: Scientific equation discovery with agentic ai.arXiv preprint arXiv:2510.11661, 2025

work page arXiv 2025
[44]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

M. Xiong, Z. Hu, X. Lu, Y. Li, J. Fu, J. He, and B. Hooi. Can llms express their uncertainty? an empirical eval- uation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023

work page internal anchor Pith review arXiv 2023
[45]

T. Xu, P. Lu, L. Ye, X. Hu, and P. Liu. Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280, 2025

work page arXiv 2025
[46]

S. Yan, R. Li, Z. Luo, Z. Wang, D. Li, L. Jing, K. He, P. Wu, G. Michalopoulos, Y. Zhang, et al. Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research.arXiv preprint arXiv:2506.17335, 2025

work page arXiv 2025
[47]

On Verbalized Confidence Scores for LLMs

D. Yang, Y.-H. H. Tsai, and M. Yamada. On ver- balized confidence scores for llms.arXiv preprint arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Z. Yang, W. Liu, B. Gao, T. Xie, Y. Li, W. Ouyang, S. Poria, E. Cambria, and D. Zhou. Large language models for rediscovering unseen chemistry scientific hypotheses. In2nd AI4Research Workshop: Towards a Knowledge-groundedScientificResearchLifecycle,2025

2025
[49]

Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023. 15

work page arXiv 2023
[50]

H.Zhang,S.Diao,Y.Lin,Y.R.Fung,Q.Lian,X.Wang, Y. Chen, H. Ji, and T. Zhang. R-tuning: Teaching large language models to refuse unknown questions.arXiv preprint arXiv:2311.09677, 63:67, 2023

work page arXiv 2023
[51]

Zhang, J

X. Zhang, J. Wu, Z. He, X. Liu, and Y. Su. Medical exam question answering with large-scale reading com- prehension, 2018. URLhttps://arxiv.org/abs/1802. 10279

2018
[52]

Zhang, M

Y. Zhang, M. Khalifa, S. Bhushan, G. D. Murphy, L. Lo- geswaran, J. Kim, M. Lee, H. Lee, and L. Wang. Mlrc- bench: Can language agents solve machine learning research challenges?, 2025. URLhttps://arxiv.org/ abs/2504.09702

work page arXiv 2025
[53]

AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage

X. Zhao, Z. Sang, Y. Li, Q. Shi, W. Zhao, S. Wang, D. Zhang, X. Han, Z. Liu, and M. Sun. Autoreproduce: Automaticaiexperimentreproductionwithpaperlineage. arXiv preprint arXiv:2505.20662, 2025. 16 A. Additional Dataset Details A.1 Additional details about task contributors / human baseline participants We provide additional visualizations of the degree, e...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1101/2025.09.05.674457v1.full 2025
[54]

This is the primary error if the answer, regardless of its correctness, is for the wrong question

Comprehension & Scope Errors: The answer fails because it fundamentally misunderstands the user’s question or violates its core constraints. This is the primary error if the answer, regardless of its correctness, is for the wrong question
[55]

It omits, fabricates, or directly contradicts facts that are clearly stated

Factual & Extraction Errors: The answer fails because it incorrectly handles explicit information from the provided ‘experimental_setup‘, ‘measurements_taken‘, or ‘background_knowledge‘. It omits, fabricates, or directly contradicts facts that are clearly stated
[56]

The connections between evidence and conclusion are invalid

Logical & Reasoning Flaws: The answer fails because the argument is logically unsound, even if the individual facts cited are correct. The connections between evidence and conclusion are invalid
[57]

It may be factually correct but is presented with false certainty or violates a core scientific principle

Deficiencies in Scientific Rigor: The answer fails because it lacks the necessary nuance and rigor expected in scientific communication. It may be factually correct but is presented with false certainty or violates a core scientific principle
[58]

Detailed Analysis Flags: First, choose a PRIMARY ERROR CATEGORY from the five main categories above that best explains WHY the ‘suggested_answer‘ is flawed or incorrect

Formatting & Mechanical Bug: The answer fails due to a non-substantive formatting error. Detailed Analysis Flags: First, choose a PRIMARY ERROR CATEGORY from the five main categories above that best explains WHY the ‘suggested_answer‘ is flawed or incorrect. For this choice of the primary error category, provide a 39 comprehensive justification (4-5 sente...
[59]

- Definition: Whether the ‘suggested_answer‘ addresses a fundamentally different question than the one posed

Comprehension & Scope Errors - ‘flag_task_misinterpretation‘: - Evidence Source: ‘question‘, ‘suggested_answer‘. - Definition: Whether the ‘suggested_answer‘ addresses a fundamentally different question than the one posed. - Prerequisite: None. - ‘YES‘: The answer’s core purpose is different from the question’s intent or it addresses a different scientifi...
[60]

- Definition: Whether the ‘suggested_answer‘ fails to extract or reports as "missing" a REQUIRED piece of data explicitly present in the provided materials

Factual & Extraction Errors - ‘flag_information_omission‘: - Evidence Source: ‘experimental_setup‘, ‘measurements_taken‘, ‘background_knowledge‘, ‘suggested_ answer‘. - Definition: Whether the ‘suggested_answer‘ fails to extract or reports as "missing" a REQUIRED piece of data explicitly present in the provided materials. - Prerequisite: The information i...
[61]

- Definition: Whether the justification restates the conclusion without providing independent evidence

Logical & Reasoning Flaws - ‘flag_tautological_reasoning‘: - Evidence Source: ‘suggested_answer‘. - Definition: Whether the justification restates the conclusion without providing independent evidence. - Prerequisite: The ‘suggested_answer‘ provides a justification or reasoning. - ‘YES‘: The reasoning is circular, using the conclusion as its own evidence....
[62]

- Definition: Whether the ‘suggested_answer‘ presents a probabilistic, correlational, or uncertain outcome as a definitive fact

Deficiencies in Scientific Rigor - ‘flag_false_certainty‘: - Evidence Source: ‘experimental_setup‘, ‘measurements_taken‘, ‘background_knowledge‘, ‘suggested_ answer‘. - Definition: Whether the ‘suggested_answer‘ presents a probabilistic, correlational, or uncertain outcome as a definitive fact. - Prerequisite: The outcome described in the provided materia...
[63]

"" {experimental_setup}

Formatting & Mechanical Bugs - ‘flag_incorrect_answer_reference‘: - Evidence Source: ‘question‘, ‘suggested_answer‘, ‘ground_truth_answer‘. - Definition: Whether the provided justification or reasoning identifies the correct answer option(s), BUT then a different option letter is given as the final answer. - Prerequisite: The ‘question‘ IS a multiple-choi...
[64]

Use the provided ‘experimental_setup‘ and ‘measurements_taken‘ to inform your understanding

First, carefully read and understand the scientific context (domain, field) and the specific ‘question‘. Use the provided ‘experimental_setup‘ and ‘measurements_taken‘ to inform your understanding
[65]

Compare the ‘suggested_answer‘ with the ‘ground_truth_answer‘ and reason about the overall correctness and completeness of the ‘suggested_answer‘
[66]

true" or

For EACH criterion (INDEPENDENTLY) provided in the ‘rubric_criteria‘ list (could be 1 or more criterion items), you must meticulously assess if the ‘suggested_answer‘ satisfies it ("true" or "false"). The ground truth answer should be used as the reference as the overall correct answer to the ‘question‘. Provide the output in the corresponding ‘_satisfied‘ fields
[67]

Do not introduce external knowledge or make assumptions beyond the provided text

Your judgment must be objective. Do not introduce external knowledge or make assumptions beyond the provided text
[68]

true"/"false

Provide a concise yet clear justification for EACH criterion’s determined satisfaction status ("true"/"false") in the corresponding‘_reasoning‘ field. Inputs: - ‘domain‘: {domain} 45 - ‘field‘: {field} - ‘rubric_criteria‘: Provided below as a list. Evaluation Criteria: {rubric_criteria_lines} Output Format: You MUST provide your evaluation in a strict JSO...