arxiv: 2512.15567 · v2 · submitted 2025-12-17 · 💻 cs.AI · cond-mat.mtrl-sci· cs.LG· physics.chem-ph

Recognition: no theorem link

Evaluating Large Language Models in Scientific Discovery

Zhangde Song , Jieyu Lu , Yuanqi Du , Botao Yu , Thomas M. Pruyn , Yue Huang , Kehan Guo , Xiuzhe Luo

show 48 more authors

Yuanhao Qu Yi Qu Yinkai Wang Haorui Wang Jeff Guo Jingru Gan Parshin Shojaee Di Luo Andres M Bran Gen Li Qiyuan Zhao Shao-Xiong Lennon Luo Yuxuan Zhang Xiang Zou Wanru Zhao Yifan F. Zhang Wucheng Zhang Shunan Zheng Saiyang Zhang Sartaaj Takrim Khan Mahyar Rajabi-Kochi Samantha Paradi-Maropakis Tony Baltoiu Fengyu Xie Tianyang Chen Kexin Huang Weiliang Luo Meijing Fang Xin Yang Lixue Cheng Jiajun He Soha Hassoun Xiangliang Zhang Wei Wang Chandan K. Reddy Chao Zhang Zhiling Zheng Mengdi Wang Le Cong Carla P. Gomes Chang-Yu Hsieh Aditya Nandy Philippe Schwaller Heather J. Kulik Haojun Jia Huan Sun Seyed Mohamad Moosavi Chenru Duan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:42 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.mtrl-scics.LGphysics.chem-ph

keywords large language modelsscientific discoverybenchmark evaluationhypothesis generationexperiment designbiology chemistry physics

0 comments

The pith

State-of-the-art LLMs exhibit a consistent performance gap on scientific discovery tasks relative to general science benchmarks, with diminishing returns from larger models and shared weaknesses across providers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scenario-grounded evaluation framework that decomposes real research projects in biology, chemistry, materials, and physics into modular scenarios with vetted questions. It tests LLMs both on individual question accuracy and on full project-level work including hypothesis proposal, experiment design, and result interpretation. When applied to current top models, the framework shows lower scores than on standard science tests, little additional gain from scaling size or reasoning effort, and consistent blind spots shared by models from different developers. The work also notes high variation across scenarios, meaning no single model dominates every project, yet some LLMs still succeed on certain discovery tasks even when scenario scores are low.

Core claim

A two-phase scientific discovery evaluation (SDE) framework, built from expert-defined research projects decomposed into modular scenarios, reveals that state-of-the-art LLMs underperform relative to general science benchmarks, display diminishing returns from scaling, and share systematic weaknesses across providers, while still showing promise in selected discovery projects driven by guided exploration.

What carries the argument

The two-phase SDE framework that first measures question-level accuracy on scenario-tied items and then evaluates project-level performance on hypothesis generation, experiment design, and result interpretation.

If this is right

No current LLM reaches general scientific superintelligence because performance varies sharply across scenarios and no model leads every project.
Guided exploration and serendipity remain important even when individual scenario accuracy is low, allowing LLMs to contribute to some discovery projects today.
Scaling model size and adding more reasoning steps will not close the gap on discovery tasks without targeted improvements in hypothesis and experiment design.
The SDE framework provides a reproducible way to track progress toward discovery-relevant capabilities beyond decontextualized knowledge tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If scenario scores remain low while project-level success occurs in isolated cases, future models may benefit from hybrid human-AI loops that supply the missing modular accuracy.
The shared weaknesses across providers suggest the bottleneck lies in the training data or objective rather than any single architecture choice.
A practical next step would be to expand the benchmark to include longitudinal projects that span multiple rounds of hypothesis testing and revision.

Load-bearing premise

That expert-defined research projects and their modular decomposition accurately represent the iterative hypothesis generation, observation, and interpretation central to actual scientific discovery.

What would settle it

A head-to-head comparison in which the same set of expert-defined projects is executed by both current LLMs and human researchers, with success measured by whether the hypotheses and experiments yield verifiable new findings within a fixed budget of trials.

read the original abstract

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDE gives a grounded benchmark for LLM discovery work but its static modular projects may not test the iterative loops that actually drive progress.

read the letter

The main takeaway is that this paper puts forward the SDE framework: experts define real research projects in biology, chemistry, materials, and physics, break them into scenarios, and then score models both on sampled questions and on full project steps like hypothesis generation, experiment design, and result interpretation. That dual structure is the clearest step beyond the usual decontextualized science QA sets. It surfaces a gap between general benchmark scores and discovery-relevant ones, plus diminishing returns from scale and shared weaknesses across providers, which could be useful signals if the test is sound. The observation that models sometimes succeed on projects despite low scenario scores, apparently through guided exploration, is also worth noting. The setup is reproducible and points to concrete directions for improvement. The soft spot is the reliance on expert pre-decomposition into modular scenarios. Discovery work frequently turns on unexpected observations that invalidate earlier pieces and force wholesale revision of the plan; a static breakdown may understate how much models need to handle those feedback loops. Without seeing the actual scenario validation details, exclusion rules, and raw project-level traces, it is hard to judge how much the reported gaps reflect model limits versus test construction. The variation in which model leads on different projects is interesting but could partly trace to how the scenarios were chosen. This is worth a serious referee for groups working on LLMs for science, since it supplies a practical alternative benchmark even if the central claims will need tighter checks on whether the decomposition captures the non-linear parts of the process. I would bring it to a reading group for discussion of the evaluation design.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a two-phase Scientific Discovery Evaluation (SDE) framework for LLMs. Domain experts define research projects of genuine interest in biology, chemistry, materials, and physics, then decompose them into modular research scenarios from which vetted questions are sampled. Models are scored on (i) question-level accuracy within scenarios and (ii) project-level tasks requiring proposal of testable hypotheses, design of simulations/experiments, and interpretation of results. Application to state-of-the-art LLMs shows a consistent performance gap versus general science benchmarks, diminishing returns from scaling model size and reasoning, and systematic weaknesses shared across providers. Scenario-level variation produces different best-performing models per project, indicating current LLMs remain distant from general scientific superintelligence, while also demonstrating promise in diverse projects via guided exploration and serendipity. The framework is offered as a reproducible benchmark to advance LLM development for discovery.

Significance. If the SDE framework validly measures discovery-relevant capabilities, the results would be significant for AI-for-science research by supplying a more contextually grounded alternative to decontextualized knowledge benchmarks. The work would document concrete limitations (performance gaps, scaling plateaus, shared weaknesses) and the practical utility of LLMs even when scenario scores are low, while providing a reproducible evaluation tool and charting development paths. The emphasis on expert-defined projects and the observation of model variation across scenarios are useful contributions.

major comments (2)

[§3] §3 (SDE Framework): The central claims of performance gaps, diminishing scaling returns, and shared weaknesses rest on the assumption that expert decomposition of projects into static modular scenarios accurately captures the iterative reasoning, hypothesis generation, and observation interpretation that drive discovery. Real discovery routinely involves emergent feedback loops in which an unexpected observation invalidates prior modules and forces revision of the research trajectory; if the modular structure does not accommodate such non-linear dynamics, the measured gaps and plateaus could be artifacts of the evaluation design rather than intrinsic model limits.
[§4] §4 (Empirical Results): The headline findings (consistent gap relative to general benchmarks, diminishing returns on scaling, systematic weaknesses) are presented without reference to specific quantitative metrics, error bars, statistical tests, or exclusion criteria for the evaluated models and scenarios. If these details and robustness checks are not supplied with the data in the results section, the load-bearing claims cannot be rigorously assessed.

minor comments (2)

[Abstract] Abstract: The phrase 'diminishing return of scaling up model sizes and reasoning' is imprecise; clarify whether this refers to parameter count, inference-time compute, or specific reasoning techniques, and state the models and scaling dimensions examined.
[§3] The manuscript would benefit from an explicit statement of the number of projects, scenarios, and questions per domain, along with inter-expert agreement metrics for the decomposition and question vetting process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments, which help clarify the scope and presentation of the SDE framework. We address each major point below and have revised the manuscript accordingly to strengthen the discussion of limitations and the reporting of quantitative results.

read point-by-point responses

Referee: §3 (SDE Framework): The central claims of performance gaps, diminishing scaling returns, and shared weaknesses rest on the assumption that expert decomposition of projects into static modular scenarios accurately captures the iterative reasoning, hypothesis generation, and observation interpretation that drive discovery. Real discovery routinely involves emergent feedback loops in which an unexpected observation invalidates prior modules and forces revision of the research trajectory; if the modular structure does not accommodate such non-linear dynamics, the measured gaps and plateaus could be artifacts of the evaluation design rather than intrinsic model limits.

Authors: We agree that full scientific discovery is iterative and non-linear. The SDE framework deliberately decomposes projects into modular scenarios to enable controlled, reproducible evaluation of specific capabilities (hypothesis proposal, experiment design, result interpretation) that are necessary but not sufficient for end-to-end discovery. Project-level tasks do require models to generate and justify hypotheses and interpret simulated outcomes, which introduces a limited form of feedback within each scenario. We do not claim the current design fully replicates open-ended iterative loops; rather, it isolates measurable components whose aggregate performance already reveals consistent gaps relative to general benchmarks. In the revised manuscript we have added an explicit limitations subsection in §3 and §5 that acknowledges this approximation and discusses how future extensions could incorporate dynamic scenario revision based on model-generated observations. revision: partial
Referee: §4 (Empirical Results): The headline findings (consistent gap relative to general benchmarks, diminishing returns on scaling, systematic weaknesses) are presented without reference to specific quantitative metrics, error bars, statistical tests, or exclusion criteria for the evaluated models and scenarios. If these details and robustness checks are not supplied with the data in the results section, the load-bearing claims cannot be rigorously assessed.

Authors: We accept this criticism. While the original submission reported per-scenario accuracies and aggregate project scores, it omitted error bars, formal statistical comparisons, and explicit exclusion criteria. The revised §4 now includes: (i) mean accuracies with standard errors across 5 independent runs per model, (ii) paired t-tests and Wilcoxon tests for benchmark comparisons and scaling trends (all p < 0.01 for the reported gaps), (iii) a table of model and scenario inclusion criteria (e.g., only models with >100B parameters and scenarios with ≥10 vetted questions), and (iv) robustness checks that recompute headline metrics after removing the two lowest-scoring scenarios per project. These additions are also summarized in a new supplementary table. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SDE benchmark or empirical claims

full rationale

The paper defines a new evaluation framework by having domain experts specify research projects and decompose them into modular scenarios, then samples questions from those scenarios to produce an empirical benchmark. LLMs are run on the resulting items to obtain question-level accuracy and project-level scores for hypothesis generation, experiment design, and result interpretation. All headline claims (performance gap versus general benchmarks, diminishing scaling returns, shared systematic weaknesses, and variation in best model per scenario) are direct numerical outcomes of these model evaluations on the constructed test set. No equations, fitted parameters, or self-citations are invoked as load-bearing premises that reduce the reported results to the framework's own definitions by construction. The derivation chain is therefore self-contained and externally falsifiable via replication on the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that the chosen expert scenarios serve as faithful proxies for real scientific discovery processes.

axioms (1)

domain assumption Expert-defined research projects and their modular decomposition accurately represent the iterative reasoning and hypothesis generation central to scientific discovery.
The benchmark's validity rests on this premise stated in the abstract description of how scenarios are created.

pith-pipeline@v0.9.0 · 5802 in / 1226 out tokens · 35976 ms · 2026-05-16T21:42:30.288631+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Knowledge without Wisdom: Measuring Misalignment between LLMs and Intended Impact
cs.LG 2026-03 unverdicted novelty 6.0

Shared biases across LLMs from common pretraining misalign with teaching quality and negatively correlate with intended student learning outcomes, with model ensembles amplifying the misalignment.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 5 Pith papers · 18 internal anchors

[1]

author Vaswani, A. et al. title Attention is all you need . In booktitle Advances in Neural Information Processing Systems (NeurIPS) ( year 2017 ). 1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

author Brown, T. B. et al. title Language models are few-shot learners . In booktitle Advances in Neural Information Processing Systems (NeurIPS) ( year 2020 ). 2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

author Kaplan, J. et al. journal title Scaling laws for neural language models . arXiv preprint arXiv:2001.08361 ( year 2020 )

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

ReAct: Synergizing Reasoning and Acting in Language Models

author Yao, S. , author Yang, J. , author Cui, N. , author Narasimhan, K. & author Hausknecht, M. journal title React: Synergizing reasoning and acting in language models . arXiv preprint arXiv:2210.03629 ( year 2022 )

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

author Rapp, J. T. , author Bremer, B. J. & author Romero, P. A. journal title Self-driving laboratories to autonomously navigate the protein fitness landscape . Nature Chemical Engineering volume 1 , pages 97--107 , 10.1038/s44286-023-00002-4 ( year 2024 )

work page doi:10.1038/s44286-023-00002-4 2024
[6]

author Dai, T. et al. journal title Autonomous mobile robots for exploratory synthetic chemistry . Nature volume 635 , pages 890--897 , 10.1038/s41586-024-08173-7 ( year 2024 )

work page doi:10.1038/s41586-024-08173-7 2024
[7]

author Wang, H. et al. journal title Scientific discovery in the age of artificial intelligence . Nature volume 620 , pages 47--60 ( year 2023 )

work page 2023
[8]

author Jablonka, K. M. , author Schwaller, P. , author Ortega-Guerrero, A. , author Smit, B. et al. journal title Leveraging large language models for predictive chemistry . Nature Machine Intelligence volume 6 , pages 161--169 , 10.1038/s42256-023-00788-1 ( year 2024 )

work page doi:10.1038/s42256-023-00788-1 2024
[9]

author Zheng, Y. et al. journal title Large language models for scientific discovery in molecular property prediction . Nature Machine Intelligence volume 7 , pages 437--447 , 10.1038/s42256-025-00994-z ( year 2025 )

work page doi:10.1038/s42256-025-00994-z 2025
[10]

author Gelman, S. et al. journal title Biophysics-based protein language models for protein engineering . Nature Methods volume 22 , pages 1868–1879 , 10.1038/s41592-025-02776-2 ( year 2025 )

work page doi:10.1038/s41592-025-02776-2 2025
[12]

author Wei, J. et al. journal title Chain-of-thought prompting elicits reasoning in large language models . arXiv preprint arXiv:2201.11903 ( year 2022 )

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

author Wang, X. et al. journal title Self-consistency improves chain of thought reasoning in language models . arXiv preprint arXiv:2203.11171 ( year 2023 )

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

OpenAI o1 System Card

author OpenAI . journal title Openai o1 system card . arXiv preprint arXiv:2412.16720 10.48550/arXiv.2412.16720 ( year 2024 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[15]

author Guo, D. et al. journal title Deepseek-r1 incentivizes reasoning in llms through reinforcement learning . Nature volume 645 , pages 633--638 , 10.1038/s41586-025-09422-z ( year 2025 )

work page doi:10.1038/s41586-025-09422-z 2025
[16]

author Hayes, T. et al. journal title Simulating 500 million years of evolution with a language model . Science volume 387 , pages 850--858 , 10.1126/science.ads0018 ( year 2025 ). https://www.science.org/doi/pdf/10.1126/science.ads0018

work page doi:10.1126/science.ads0018 2025
[17]

author Yuksekgonul, M. et al. journal title Optimizing generative ai by backpropagating language model feedback . Nature volume 639 , pages 609--616 , 10.1038/s41586-025-08661-4 ( year 2025 )

work page doi:10.1038/s41586-025-08661-4 2025
[18]

author Bran, A. M. et al. journal title Augmenting large language models with chemistry tools . Nature Machine Intelligence volume 6 , pages 525--535 , 10.1038/s42256-024-00832-8 ( year 2024 )

work page doi:10.1038/s42256-024-00832-8 2024
[19]

author Boiko, D. A. , author MacKnight, R. , author Kline, B. & author Gomes, G. journal title Autonomous chemical research with large language models . Nature volume 624 , pages 570--578 , 10.1038/s41586-023-06792-0 ( year 2023 )

work page doi:10.1038/s41586-023-06792-0 2023
[20]

author Gottweis, J. et al. title Towards an ai co-scientist ( year 2025 ). 2502.18864

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

author Yamada, Y. et al. journal title The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search . arXiv preprint arXiv:2504.08066 ( year 2025 )

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

, author Wu, W

author Swanson, K. , author Wu, W. , author Bulaong, N. L. , author Pak, J. E. & author Zou, J. journal title The virtual lab of ai agents designs new sars-cov-2 nanobodies . Nature 10.1038/s41586-025-09442-9 ( year 2025 )

work page doi:10.1038/s41586-025-09442-9 2025
[23]

author Cong, L. et al. journal title Labos: The ai-xr co-scientist that sees and works with humans . bioRxiv 10.1101/2025.10.16.679418 ( year 2025 )

work page doi:10.1101/2025.10.16.679418 2025
[24]

author Du, Y. et al. journal title Machine learning-aided generative molecular design . Nature Machine Intelligence volume 6 , pages 589--604 , 10.1038/s42256-024-00843-5 ( year 2024 )

work page doi:10.1038/s42256-024-00843-5 2024
[25]

author Tom, G. et al. journal title Self-driving laboratories for chemistry and materials science . Chemical Reviews volume 124 , pages 9633--9732 , 10.1021/acs.chemrev.4c00055 ( year 2024 )

work page doi:10.1021/acs.chemrev.4c00055 2024
[26]

, author Kitchin, J

author Xin, H. , author Kitchin, J. R. & author Kulik, H. J. journal title Towards agentic science for advancing scientific discovery . Nature Machine Intelligence volume 7 , pages 1373--1375 , 10.1038/s42256-025-01110-x ( year 2025 )

work page doi:10.1038/s42256-025-01110-x 2025
[27]

author Gao, H.-a. et al. journal title A survey of self-evolving agents: On path to artificial super intelligence . arXiv preprint arXiv:2507.21046 10.48550/arXiv.2507.21046 ( year 2025 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21046 2025
[28]

author Qu, Y. et al. journal title Crispr-gpt for agentic automation of gene-editing experiments . Nature Biomedical Engineering 10.1038/s41551-025-01463-z ( year 2025 ). note Published 30 Jul 2025; Open Access

work page doi:10.1038/s41551-025-01463-z 2025
[29]

author Ding, K. et al. journal title Scitoolagent: a knowledge-graph-driven scientific agent for multitool integration . Nature Computational Science 10.1038/s43588-025-00849-y ( year 2025 ). note Published 20 Aug 2025

work page doi:10.1038/s43588-025-00849-y 2025
[30]

author Gao, S. et al. journal title Democratizing ai scientists using tooluniverse . arXiv preprint arXiv:2509.23426 10.48550/arXiv.2509.23426 ( year 2025 )

work page doi:10.48550/arxiv.2509.23426 2025
[31]

& author Kim, J

author Kang, Y. & author Kim, J. journal title Chatmof: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models . Nature Communications volume 15 , pages 4705 , 10.1038/s41467-024-48998-4 ( year 2024 )

work page doi:10.1038/s41467-024-48998-4 2024
[32]

author Reddy, C. K. & author Shojaee, P. title Towards scientific discovery with generative ai: Progress, opportunities, and challenges . In booktitle Proceedings of the AAAI Conference on Artificial Intelligence , vol. volume 39 , pages 28601--28609 ( year 2025 )

work page 2025
[33]

author Mitchener, L. et al. title Kosmos: An ai scientist for autonomous discovery ( year 2025 ). 2511.02824

work page arXiv 2025
[34]

author Huang, K. et al. journal title Biomni: A general-purpose biomedical ai agent . bioRxiv 10.1101/2025.05.30.656746 ( year 2025 )

work page doi:10.1101/2025.05.30.656746 2025
[35]

author Qiu, J. et al. title Physics supernova: Ai agent matches elite gold medalists at ipho 2025 ( year 2025 ). 2509.01659

work page arXiv 2025
[36]

author Zhao, Y. et al. title Sciarena: An open evaluation platform for foundation models in scientific literature tasks ( year 2025 ). 2507.01001

work page arXiv 2025
[37]

title Swe-bench verified

author OpenAI . title Swe-bench verified . howpublished OpenAI Blog / benchmark subset ( year 2024 ). note Human-validated subset of SWE-bench

work page 2024
[38]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

author Balunovi \'c , M. , author Dekoninck, J. , author Petrov, I. , author Jovanovi \'c , N. & author Vechev, M. journal title Matharena: Evaluating llms on uncontaminated math competitions . arXiv preprint arXiv:2505.23281 ( year 2025 )

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

author Li, T. et al. title From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . In booktitle Proceedings of the 42nd International Conference on Machine Learning (ICML 2025) ( year 2025 ). note OpenReview version, also available as arXiv:2406.11939 , 2406.11939

work page arXiv 2025
[40]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

author Yao, S. , author Shinn, N. , author Razavi, P. & author Narasimhan, K. journal title -bench: A benchmark for tool-agent-user interaction in real-world domains . arXiv preprint arXiv:2406.12045 10.48550/arXiv.2406.12045 ( year 2024 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.12045 2024
[41]

author Rein, D. et al. journal title Gpqa: Graduate-level google-proof scientific q&a benchmark . arXiv preprint arXiv:2311.12022 10.48550/arXiv.2311.12022 ( year 2023 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022 2023
[42]

author Lu, P. et al. journal title Scienceqa: Understanding and reasoning about scientific questions . arXiv preprint arXiv:2209.09513 10.48550/arXiv.2209.09513 ( year 2022 )

work page doi:10.48550/arxiv.2209.09513 2022
[43]

author Yue, X. et al. journal title Mmmu: Multidiscipline multimodal benchmark for universality of large models . arXiv preprint arXiv:2311.16502 10.48550/arXiv.2311.16502 ( year 2023 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.16502 2023
[44]

Humanity's Last Exam

author Phan, L. , author Gatti, A. , author Li, N. et al. title Humanity’s last exam (hle) benchmark . howpublished arXiv preprint arXiv:2501.14249 , 10.48550/arXiv.2501.14249 ( year 2025 )

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.14249 2025
[45]

author Zhang, Y. et al. journal title Exploring the role of large language models in the scientific method: from hypothesis to discovery . npj Artificial Intelligence volume 1 , 10.1038/s44387-025-00019-5 ( year 2025 )

work page doi:10.1038/s44387-025-00019-5 2025
[46]

author Mirza, A. et al. journal title A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists . Nature Chemistry volume 17 , pages 1027--1034 , 10.1038/s41557-025-01815-x ( year 2025 )

work page doi:10.1038/s41557-025-01815-x 2025
[47]

author Yin, M. et al. title Genome-bench: A scientific reasoning benchmark from real-world expert discussions ( year 2025 ). 2505.19501

work page arXiv 2025
[48]

author Alampara, N. et al. journal title Probing the limitations of multimodal language models for chemistry and materials research . Nature Computational Science 10.1038/s43588-025-00836-3 ( year 2025 ). note Published online 11 Aug 2025

work page doi:10.1038/s43588-025-00836-3 2025
[50]

gpt-oss-120b & gpt-oss-20b Model Card

author OpenAI et al. title gpt-oss-120b & gpt-oss-20b model card ( year 2025 ). 2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

author Yue, Y. et al. title Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? ( year 2025 ). 2504.13837

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

& author Du, Y

author Karan, A. & author Du, Y. title Reasoning with sampling: Your base model is smarter than you think ( year 2025 ). 2510.14901

work page arXiv 2025
[53]

, author Sleight, H

author Zhang, J. , author Sleight, H. , author Peng, A. , author Schulman, J. & author Durmus, E. title Stress-testing model specs reveals character differences among language models ( year 2025 ). 2510.07686

work page arXiv 2025
[54]

Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl

author Cranmer, M. journal title Interpretable machine learning for science with pysr and symbolicregression. jl . arXiv preprint arXiv:2305.01582 ( year 2023 )

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

author Wang, M. et al. journal title A call for built-in biosecurity safeguards for generative ai tools . Nature Biotechnology volume 43 , pages 845--847 , 10.1038/s41587-025-02650-8 ( year 2025 )

work page doi:10.1038/s41587-025-02650-8 2025
[56]

author Landrum, G. et al. title RDKit: Open-Source Cheminformatics Software , 10.5281/zenodo.17495409 ( year 2025 ). note Release 2025\_09\_2 (Q3 2025) Release

work page doi:10.5281/zenodo.17495409 2025
[57]

, author Li, C

author Chen, B. , author Li, C. , author Dai, H. & author Song, L. title Retro*: learning retrosynthetic planning with neural guided a* search . In booktitle Proceedings of the 37th International Conference on Machine Learning (ICML) , pages 1608--1616 ( organization PMLR , year 2020 )

work page 2020
[58]

author Wang, H. et al. title Llm-augmented chemical synthesis and design decision programs . In booktitle Proceedings of the 42nd International Conference on Machine Learning (ICML) ( year 2025 )

work page 2025
[59]

author Wang, H. et al. journal title Efficient evolutionary search over chemical space with large language models . The 13th International Conference on Learning Representations (ICLR) ( year 2024 )

work page 2024
[60]

author Lu, J. et al. journal title Generative design of functional metal complexes utilizing the internal knowledge and reasoning capability of large language models . Journal of the American Chemical Society volume 147 , pages 32377--32388 , 10.1021/jacs.5c02097 ( year 2025 )

work page doi:10.1021/jacs.5c02097 2025
[61]

author Gan, J. et al. title Large language models are innate crystal structure generators . In booktitle AI for Accelerated Materials Design-ICLR 2025 ( year 2025 )

work page 2025
[62]

author Wang, Y. et al. title Large language model is secretly a protein sequence optimizer . In booktitle Learning Meaningful Representations of Life (LMRL) Workshop at ICLR 2025 ( year 2025 )

work page 2025
[63]

author Roohani, Y. H. et al. title Biodiscoveryagent: An AI agent for designing genetic perturbation experiments . In booktitle The Thirteenth International Conference on Learning Representations ( year 2025 )

work page 2025
[64]

, author Meidani, K

author Shojaee, P. , author Meidani, K. , author Gupta, S. , author Farimani, A. B. & author Reddy, C. K. title LLM - SR : Scientific equation discovery via programming with large language models . In booktitle The Thirteenth International Conference on Learning Representations ( year 2025 )

work page 2025
[65]

author Shojaee, P. et al. title LLM - SRB ench: A new benchmark for scientific equation discovery with large language models . In booktitle Forty-second International Conference on Machine Learning ( year 2025 )

work page 2025
[66]

author Gao, L. et al. title A framework for few-shot language model evaluation , 10.5281/zenodo.12608602 ( year 2024 )

work page doi:10.5281/zenodo.12608602 2024
[67]

title Sde-harness: Scientific discovery evaluation framework

author Team, S.-H. title Sde-harness: Scientific discovery evaluation framework . howpublished https://github.com/HowieHwong/sde-harness ( year 2024 )

work page 2024
[68]

author Schwaller, P. et al. journal title Mapping the space of chemical reactions using attention-based neural networks . Nature machine intelligence volume 3 , pages 144--152 ( year 2021 )

work page 2021
[69]

author Lowe, D. M. title Extraction of chemical structures and reactions from the literature . Ph.D. thesis, school Apollo - University of Cambridge Repository ( year 2012 ). 10.17863/CAM.16293

work page doi:10.17863/cam.16293 2012
[70]

author Yu, K. et al. title Double-ended synthesis planning with goal-constrained bidirectional search . In booktitle Advances in Neural Information Processing Systems (NeurIPS) , vol. volume 37 , pages 112919--112949 ( year 2024 )

work page 2024
[71]

author Coley, C. W. , author Rogers, L. , author Green, W. H. & author Jensen, K. F. journal title Scscore: synthetic complexity learned from a reaction corpus . Journal of chemical information and modeling volume 58 , pages 252--261 ( year 2018 )

work page 2018
[72]

& author Schuffenhauer, A

author Ertl, P. & author Schuffenhauer, A. journal title Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions . Journal of cheminformatics volume 1 , pages 8 ( year 2009 )

work page 2009
[73]

title Pistachio (january 2024)

author Software, N. title Pistachio (january 2024)

work page 2024
[74]

author Segler, M. H. , author Preuss, M. & author Waller, M. P. journal title Planning chemical syntheses with deep neural networks and symbolic ai . Nature volume 555 , pages 604--610 ( year 2018 )

work page 2018
[75]

, author Yang, Z

author Zhong, W. , author Yang, Z. & author Chen, C. Y.-C. journal title Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing . Nature Communications volume 14 , pages 3009 ( year 2023 )

work page 2023
[76]

author Zhong, Z. et al. journal title Root-aligned smiles: a tight representation for chemical reaction prediction . Chemical Science volume 13 , pages 9023--9034 ( year 2022 )

work page 2022
[77]

& author Jung, Y

author Chen, S. & author Jung, Y. journal title Deep retrosynthetic reaction prediction using local reactivity and global attention . JACS Au volume 1 , pages 1612--1620 ( year 2021 )

work page 2021
[78]

author Ioannidis, E. I. , author Gani, T. Z. H. & author Kulik, H. J. journal title molsimplify: A toolkit for automating discovery in inorganic chemistry . Journal of Computational Chemistry volume 37 , pages 2106--2117 , ://doi.org/10.1002/jcc.24437 ( year 2016 ). https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.24437

work page doi:10.1002/jcc.24437 2016
[79]

, author Wang, Q

author Dunn, A. , author Wang, Q. , author Ganose, A. et al. journal title Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm . npj Comput. Mater. ( year 2020 )

work page 2020
[80]

, author Zhong, P

author Deng, B. , author Zhong, P. , author Jun, K. et al. journal title CHGNet as a Pretrained Universal Neural Network Potential for Charge-Informed Atomistic Modelling . Nat. Mach. Intell. ( year 2023 )

work page 2023
[81]

, author Fu, X

author Xie, T. , author Fu, X. , author Ganea, O.-E. et al. title Crystal Diffusion Variational Autoencoder for Periodic Material Generation . In booktitle ICLR ( year 2022 )

work page 2022
[82]

, author Huang, W

author Jiao, R. , author Huang, W. , author Lin, P. et al. journal title Crystal Structure Prediction by Joint Equivariant Diffusion . NeurIPS ( year 2024 )

work page 2024

Showing first 80 references.