pith. sign in

arxiv: 2605.22681 · v1 · pith:OF3FRH5Qnew · submitted 2026-05-21 · 💻 cs.AI

Forecasting Scientific Progress with Artificial Intelligence

Pith reviewed 2026-05-22 05:13 UTC · model grok-4.3

classification 💻 cs.AI
keywords scientific forecastingAI evaluationbenchmarkknowledge cutofftemporal predictionscientific progressuncertainty estimation
0
0 comments X

The pith

AI systems cannot reliably predict whether or when scientific advances will occur.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests current AI models on their ability to forecast real scientific events using only knowledge available before those events. It finds that models can often select plausible research directions but fail at judging whether advances will happen or when they will arrive. Performance improves when models are given information after the events have occurred, but pre-event knowledge alone does not close the gap. This holds across disciplines and suggests that access to historical data does not produce reliable forward-looking predictions.

Core claim

We introduce the CUSP benchmark covering 4,760 scientific events and show that frontier models exhibit systematic limitations in forecasting progress: they misestimate timing, overstate feasibility, and gain more from post-event information than from pre-cutoff knowledge, with performance varying by domain but remaining insensitive to training cutoffs.

What carries the argument

The CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, which measures AI forecasting ability through four tasks under controlled pre- and post-event knowledge access.

If this is right

  • AI cannot yet serve as a standalone tool for prioritizing research investments or setting scientific timelines.
  • Domain-specific differences imply that forecasting methods may need tailoring rather than one-size-fits-all approaches.
  • The large gap between pre- and post-event performance indicates that current models rely more on hindsight than on causal anticipation.
  • Systematic overconfidence and response biases mean AI-generated forecasts require external calibration before use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest that AI science tools may need hybrid human-AI loops specifically for temporal and uncertainty judgments.
  • Extending the benchmark to include negative results or failed projects could reveal whether models also underpredict dead ends.
  • If the pattern holds for newer events, it would imply that scaling alone is unlikely to solve scientific forecasting.

Load-bearing premise

That the chosen scientific events represent an unbiased sample of progress and that the four tasks measure genuine forecasting skill rather than surface pattern matching.

What would settle it

Finding that models achieve similar accuracy on timing and feasibility predictions when restricted to pre-event information as they do with full post-event details would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22681 by David Clifton, James Zou, Jonathan Bragg, Junchi Yu, Pan Lu, Peter Clark, Philip Torr, Sean Wu, YuPeng Chen, Yutaro Yamada.

Figure 1
Figure 1. Figure 1: We construct CUSP by aggregating scientific breakthroughs from top-tier journals and community-driven sources across multiple domains. The benchmark is continuously updated with newly published discoveries, enabling an event-level, dynamic, and temporally grounded evaluation of AI systems’ ability to forecast scientific progress beyond a knowledge cutoff. We develop CUSP using a temporally stratified corpu… view at source ↗
Figure 2
Figure 2. Figure 2: A) Source Distribution: Breakdown of the 4,760 scientific milestones by publication venue. B) Task Density by Domain: Distribution of the 17,429 validated tasks across nine top-level domains. C) Temporal Information: Longitudinal count of entries from January 2024 to March 2026. D) Multi-Disciplinary Taxonomy: Sunburst visualization of distinct subcategories. E) Human vs. AI Keep Rates: Calibration of the … view at source ↗
Figure 3
Figure 3. Figure 3: Radar plots of LLM MCQ performance across six models across the main areas of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of model bias in binary prediction. 0 % 50 % 100 % Response rate GPT-OSS 20B GPT-4o Claude S4.5 DeepSeek R1 GPT-5.4 LLaMA 3.3 A 18% 82% 19% 81% 27% 73% 47% 53% 63% 37% 93% Binary (ground truth = "Yes") "Yes" ✓ correct "No" ✗ incorrect 0 % 50 % 100 % Response rate B 22% 78% 23% 77% 30% 70% 51% 49% 60% 40% 91% Binary perturbed (ground truth = "No") "Yes" ✗ incorrect "No" ✓ correct Calibration v… view at source ↗
Figure 5
Figure 5. Figure 5: Forecasts of global CO2 emissions. 2015 2017 2019 2021 2023 2025 2027 Year 33 34 35 36 37 38 39 40 Global C O2 emissions (Gt C O2) Claude S4.5 DeepSeek R1 GPT-4o GPT-5.4 GPT-OSS H LLaMA 3.3 istorical CO2 emissions [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: CUSP validation and evaluation pipeline. (A) Benchmark construction: scientific findings are curated and filtered via LLM-based criteria, validated by an independent model, and verified by human experts to produce a high-quality benchmark. (B) Two-track evaluation: model outputs are assessed for outcome correctness (across binary, MCQ, free-response, and date prediction tasks) and reasoning quality (viabil… view at source ↗
Figure 7
Figure 7. Figure 7: A) Visualization of FRQ evaluation on four criteria on 6 LLMs B) Frq Score distribution [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A) Visualization of passing rates across six LLMs B) Visualization on LLM performance [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of aggregated date predictions across models. Importantly, many models [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of confidence calibration across six LLMs. [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Saturation plot of CUSP compared to other commonly used LLM benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p048_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Our human evaluation interface is designed to assess alignment, novelty, feasibility, and [PITH_FULL_IMAGE:figures/full_fig_p049_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Results on human evaluation vs AI Judge on 60 AI questions. Each human evaluates 20 [PITH_FULL_IMAGE:figures/full_fig_p050_13.png] view at source ↗
read the original abstract

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, a multi-disciplinary, event-level evaluation framework for testing AI systems' ability to forecast scientific progress under controlled knowledge constraints. Across 4,760 scientific events, it assesses frontier models on four tasks: feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Key observations include systematic limitations in predicting realization and timing of advances, domain heterogeneity (e.g., AI progress more predictable than biology/chemistry/physics), insensitivity to training cutoffs, greater benefit from post-event information than prior knowledge, and issues with overconfidence and response biases. The central conclusion is that current AI systems fall short as predictive tools for scientific progress.

Significance. If the event sample proves representative, this benchmark offers a valuable, temporally grounded tool for quantifying AI forecasting gaps in science, with implications for improving uncertainty calibration and long-horizon prediction. The scale (4,760 events), controlled cutoff design, and multi-task structure are strengths that could guide development of more reliable scientific AI. The finding that pre-cutoff knowledge does not close the performance gap to full-information settings points to limitations beyond data exposure, though this hinges on validation of the sampling process.

major comments (2)
  1. [Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.
  2. [Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.
minor comments (2)
  1. [Abstract] Abstract: claims of 'systematic and domain-dependent limitations' and 'strong response biases' would be strengthened by including at least one quantitative example (e.g., accuracy delta or bias rate) rather than qualitative summary only.
  2. [Results] Results presentation: tables or figures reporting cross-domain performance should include statistical significance tests or confidence intervals to support assertions of heterogeneity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification regarding benchmark construction and the evaluation framework. We address each major comment below, indicating where revisions have been made to strengthen the paper while preserving its core contributions.

read point-by-point responses
  1. Referee: [Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.

    Authors: We agree that the original manuscript would benefit from expanded methodological detail on benchmark construction to support the central claims. In the revised version, we have added a dedicated subsection to the Methods that specifies the sampling frame (drawn from curated scientific databases and announcements across disciplines), selection criteria (focusing on non-routine, temporally bounded events with verifiable outcomes), stratification by domain and outcome where possible, exclusion rules for incremental or negative results, and inter-annotator agreement (Cohen's kappa reported for event validation). We further include a supplementary analysis demonstrating that performance patterns hold across citation-impact strata, reducing the likelihood that findings are driven solely by high-visibility successes. These additions directly address concerns about potential selection artifacts. revision: yes

  2. Referee: [Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.

    Authors: We acknowledge the need for greater explicitness in describing how the tasks isolate forward-looking forecasting. The revised manuscript expands the evaluation section to operationalize each task with strict cutoff conditioning that limits models to pre-event information only. Temporal prediction is now explicitly defined using mean absolute error in predicted realization year and accuracy within a ±2-year window, with results reported separately for high- and low-citation events to control for visibility bias. We have also added a discussion of how the multi-task design, combined with full-information baselines, helps distinguish pattern matching from genuine prediction. These clarifications and added controls substantiate the assumptions underlying our claims about misestimation of timing. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper introduces the CUSP benchmark as an explicit evaluation framework and reports measured performance differences across feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction tasks on 4,760 events. All central claims (insensitivity to training cutoffs, superiority of post-event information, domain heterogeneity) rest on direct empirical observations and controlled comparisons rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citation chain or ansatz is invoked to justify uniqueness or force results; the work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation framework depends on the representativeness of the chosen scientific events and the validity of the four task definitions as proxies for forecasting; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5797 in / 1100 out tokens · 35706 ms · 2026-05-22T05:13:51.485780+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 11 internal anchors

  1. [1]

    Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

    Brian Uzzi, Satyam Mukherjee, Michael Stringer, and Ben Jones. Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

  2. [2]

    The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

    Dudley Shapere. The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

  3. [3]

    Moore’s law.Electronics Magazine, 38(8):114, 1965

    Gordon Moore. Moore’s law.Electronics Magazine, 38(8):114, 1965

  4. [4]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  5. [5]

    Science of science

    Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojević, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185, 2018

  6. [6]

    Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

    Michael Park, Erin Leahey, and Russell J Funk. Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

  7. [7]

    The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

    Ian Miles. The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

  8. [8]

    Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

  9. [9]

    Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

  10. [10]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehra- bian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  11. [11]

    Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

    Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

  12. [12]

    The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

    Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

  13. [13]

    Accelerating scientific discovery with co-scientist.Nature, 2026

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, et al. Accelerating scientific discovery with co-scientist.Nature, 2026

  14. [14]

    Szostkiewicz, Dmytro Shved, Gavin J

    Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammad T. Razzak, Andrew D. White, Silvia C. Finnemann, Michael M. Hinks, and Samuel G. Rodrigues. A multi-agent system for automating scientific discovery.Nature, 2026. 18

  15. [15]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in neural information processing systems, 2022

  16. [16]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

  17. [17]

    Astabench: Rigorous benchmarking of ai agents with a scientific research suite

    Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite. InInternational conference on learning representations, 2026

  18. [18]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

    Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

  19. [19]

    Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

    Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C Kozlowski, Oyvind Tafjord, James Evans, Daniel S Weld, Tom Hope, and Doug Downey. Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

  20. [20]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in neural information processing systems, 2024

  21. [21]

    Forecastbench: A dynamic benchmark of AI forecasting capabilities

    Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of AI forecasting capabilities. In International conference on learning representations, 2025

  22. [22]

    Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

    Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

  23. [23]

    Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark

    Moy Yuan, Zifeng Ding, and Andreas Vlachos. Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark. InAdvances in neural information processing systems datasets and benchmarks track, 2025

  24. [24]

    Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

    Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, et al. Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

  25. [25]

    ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

    Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025

  26. [26]

    Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science

    Peter Jansen, Samiah Hassan, and Ruoyao Wang. Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science. InEmpirical methods in natural language processing, 2025

  27. [27]

    Solving 19 inequality proofs with large language models

    Jiayi Sheng, Luna Lyu, Jikai Jin, Tanglin Xia, Alex Gu, James Zou, and Pan Lu. Solving 19 inequality proofs with large language models. InAdvances in neural information processing systems, 2025

  28. [28]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023

  29. [29]

    Walk the talk? measuring the faithfulness of large language model explanations

    Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithfulness of large language model explanations. InInternational conference on learning representations, 2025

  30. [30]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

  31. [31]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  32. [32]

    Introducing claude sonnet 4.5

    Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, September 2025

  33. [33]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  34. [34]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  35. [35]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  36. [36]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational conference on learning representations, 2021

  37. [37]

    Protein data bank.Nature New Biol, 233(223):10–1038, 1971

    Protein Data Bank. Protein data bank.Nature New Biol, 233(223):10–1038, 1971

  38. [38]

    Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

    Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, and James Zou. Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

  39. [39]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  40. [40]

    Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and JeffClune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

  41. [41]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 20

  42. [42]

    Kosmos: An AI Scientist for Autonomous Discovery

    Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

  43. [43]

    Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

    Georgia Channing and Avijit Ghosh. Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

  44. [44]

    When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

    Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

  45. [45]

    International sci- entific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282,

    Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, et al. In- ternational scientific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282, 2024

  46. [46]

    Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

    Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

  47. [47]

    FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Car- oline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

  48. [48]

    Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K. Reddy. LLM-SRBench: A new benchmark for scientific equation discovery with large language models. InInternational Conference on Machine Learning, 2025

  49. [49]

    Forecasting future world events with neural networks

    Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, and Dan Hendrycks. Forecasting future world events with neural networks. InAdvances in Neural Information Processing Systems, 2022

  50. [50]

    Approaching human-level forecasting with language models

    Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems, 2024

  51. [51]

    Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation

    Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InInternational conference on computational linguistics, 2025

  52. [52]

    Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

    Shirin Shahabi, Spencer Graham, and Haruna Isah. Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

  53. [53]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2024

  54. [54]

    Alexander Krauss. Debunking revolutionary paradigm shifts: evidence of cumulative sci- entific progress across science.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 480(2302), 2024. 21

  55. [55]

    Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

    Kambiz N Alavian. Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

  56. [56]

    Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

    John E Hallsworth, Zulema Udaondo, Carlos Pedrós-Alió, Juan Höfer, Kathleen C Benison, Karen G Lloyd, Radamés JB Cordero, Claudia BL de Campos, Michail M Yakimov, and Ricardo Amils. Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

  57. [57]

    Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

    Tairan Wang, Jianyu Hu, Runhai Ouyang, Yutao Wang, Yi Huang, Sulei Hu, and Wei-Xue Li. Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

  58. [58]

    Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

    Andrew K Schulz, Lena V Kaufmann, Lawrence T Smith, Deepti S Philip, Hilda David, Jelena Lazovic, Michael Brecht, Gunther Richter, and Katherine J Kuchenbecker. Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

  59. [59]

    Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

    Geosan Kang, Guhyeon Kwon, Jiwoon Jeon, Jisung Kwon, Myung-Ki Kim, Junpyo Hong, Albert S Lee, Seongi Lee, Binhyung Lee, Yujin Kim, et al. Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

  60. [60]

    Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

    Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Jianfeng Liu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Yuefeng Zhan, et al. Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

  61. [61]

    A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

    Wenzheng Heng, Shukun Yin, Jihong Min, Canran Wang, Hong Han, Ehsan Shirzaei Sani, Jiahong Li, Yu Song, Harry B Rossiter, and Wei Gao. A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

  62. [62]

    Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

    Milos Vukadinovic, I-Min Chiu, Xiu Tang, Neal Yuan, Tien-Yu Chen, Paul Cheng, Debiao Li, Susan Cheng, Bryan He, and David Ouyang. Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

  63. [63]

    Top 10 AI Papers of the Week

    Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. Infinidepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields.arXiv preprint arXiv:2601.03252, 2026. 22 A Benchmark Construction Details A.1 Data acquisition and source construction Natural Science Dat...

  64. [64]

    plausible enough to require real reasoning,

  65. [65]

    not directly supported by the abstract,

  66. [66]

    verdict":

    not trivially wrong or obviously eliminated. Important rules: • Judge the distractors as a set — do not judge the stem or the correct answer in this call •Passonly if the distractors are non-trivial and sufficiently plausible •Failif the distractors are too easy, too obviously wrong, or directly supported by the abstract •Failif the distractors are not me...

  67. [67]

    ‘json code block. Write 1-2 sentences of reasoning with inline citations ( [title](url)) before the block. {

    alignment(0–10): Does the LLM RESPONSE describe the specific approach used in the paper? Use web search to find the actual paper method. 0–2:completely wrong direction or no meaningful content 3–4:roughly right area but missing key specifics of the actual method 5–6:captures the main idea but lacks important details or misstates them 7–8:matches the core ...

  68. [68]

    Do NOT change which benchmark or dataset is referenced

    Keep ALL benchmark names, dataset names, and task names EXACTLY the same. Do NOT change which benchmark or dataset is referenced

  69. [69]

    ONLY modify an EXISTING numeric score/threshold, or add a credible unmet constraint

  70. [70]

    Make the increase a clear shift so there is no ambiguity, but still physically plausible

    IF modifying an existing numeric score, RAISE it enough so the original result definitively does NOT satisfy the perturbed claim (e.g., if original is 94.2%, change to 95.8%; if 51.7%, change to 54.5%). Make the increase a clear shift so there is no ambiguity, but still physically plausible

  71. [71]

    Make this constraint significant enough that it’s noticeably harder to satisfy than the original

    IF the original claim has no specific numbers, you MUST add a highly specific, definitive unmet constraint (e.g., ‘while using 50% fewer parameters’, ‘but fails completely on zero-shot tasks’, or ‘but requires 3x the memory’). Make this constraint significant enough that it’s noticeably harder to satisfy than the original

  72. [72]

    The perturbed claim must be plausible and not absurd

  73. [73]

    {problem_statement}

    Keep the same length, style, and level of specificity. Return JSON with: • ‘perturbed_result’: The counterfactual alternative result claim • ‘changed_detail’: Which aspect of the result was modified Create MCQ Distractors You are a technical forecasting analyst who designs extraordinarily difficult, graduate-level evaluations. Your task is to create a mul...

  74. [74]

    The problem description must come ONLY from the Problem Statement

  75. [75]

    DO NOT mention any specific method, architecture, technique, or approach

  76. [76]

    Return JSON with key ‘prompt’

    DO NOT include any narrative about a paper or discovery. Return JSON with key ‘prompt’. Return JSON only. 67 J Example FRQ Responses GPT-5.4 High-Scoring Response Source Abstract Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands c...