Forecasting Scientific Progress with Artificial Intelligence

David Clifton; James Zou; Jonathan Bragg; Junchi Yu; Pan Lu; Peter Clark; Philip Torr; Sean Wu; YuPeng Chen; Yutaro Yamada

arxiv: 2605.22681 · v1 · pith:OF3FRH5Qnew · submitted 2026-05-21 · 💻 cs.AI

Forecasting Scientific Progress with Artificial Intelligence

Sean Wu , Pan Lu , Yupeng Chen , Jonathan Bragg , Yutaro Yamada , Peter Clark , David Clifton , Philip Torr

show 2 more authors

James Zou Junchi Yu

This is my paper

Pith reviewed 2026-05-22 05:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords scientific forecastingAI evaluationbenchmarkknowledge cutofftemporal predictionscientific progressuncertainty estimation

0 comments

The pith

AI systems cannot reliably predict whether or when scientific advances will occur.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests current AI models on their ability to forecast real scientific events using only knowledge available before those events. It finds that models can often select plausible research directions but fail at judging whether advances will happen or when they will arrive. Performance improves when models are given information after the events have occurred, but pre-event knowledge alone does not close the gap. This holds across disciplines and suggests that access to historical data does not produce reliable forward-looking predictions.

Core claim

We introduce the CUSP benchmark covering 4,760 scientific events and show that frontier models exhibit systematic limitations in forecasting progress: they misestimate timing, overstate feasibility, and gain more from post-event information than from pre-cutoff knowledge, with performance varying by domain but remaining insensitive to training cutoffs.

What carries the argument

The CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, which measures AI forecasting ability through four tasks under controlled pre- and post-event knowledge access.

If this is right

AI cannot yet serve as a standalone tool for prioritizing research investments or setting scientific timelines.
Domain-specific differences imply that forecasting methods may need tailoring rather than one-size-fits-all approaches.
The large gap between pre- and post-event performance indicates that current models rely more on hindsight than on causal anticipation.
Systematic overconfidence and response biases mean AI-generated forecasts require external calibration before use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest that AI science tools may need hybrid human-AI loops specifically for temporal and uncertainty judgments.
Extending the benchmark to include negative results or failed projects could reveal whether models also underpredict dead ends.
If the pattern holds for newer events, it would imply that scaling alone is unlikely to solve scientific forecasting.

Load-bearing premise

That the chosen scientific events represent an unbiased sample of progress and that the four tasks measure genuine forecasting skill rather than surface pattern matching.

What would settle it

Finding that models achieve similar accuracy on timing and feasibility predictions when restricted to pre-event information as they do with full post-event details would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22681 by David Clifton, James Zou, Jonathan Bragg, Junchi Yu, Pan Lu, Peter Clark, Philip Torr, Sean Wu, YuPeng Chen, Yutaro Yamada.

**Figure 1.** Figure 1: We construct CUSP by aggregating scientific breakthroughs from top-tier journals and community-driven sources across multiple domains. The benchmark is continuously updated with newly published discoveries, enabling an event-level, dynamic, and temporally grounded evaluation of AI systems’ ability to forecast scientific progress beyond a knowledge cutoff. We develop CUSP using a temporally stratified corpu… view at source ↗

**Figure 2.** Figure 2: A) Source Distribution: Breakdown of the 4,760 scientific milestones by publication venue. B) Task Density by Domain: Distribution of the 17,429 validated tasks across nine top-level domains. C) Temporal Information: Longitudinal count of entries from January 2024 to March 2026. D) Multi-Disciplinary Taxonomy: Sunburst visualization of distinct subcategories. E) Human vs. AI Keep Rates: Calibration of the … view at source ↗

**Figure 3.** Figure 3: Radar plots of LLM MCQ performance across six models across the main areas of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of model bias in binary prediction. 0 % 50 % 100 % Response rate GPT-OSS 20B GPT-4o Claude S4.5 DeepSeek R1 GPT-5.4 LLaMA 3.3 A 18% 82% 19% 81% 27% 73% 47% 53% 63% 37% 93% Binary (ground truth = "Yes") "Yes" ✓ correct "No" ✗ incorrect 0 % 50 % 100 % Response rate B 22% 78% 23% 77% 30% 70% 51% 49% 60% 40% 91% Binary perturbed (ground truth = "No") "Yes" ✗ incorrect "No" ✓ correct Calibration v… view at source ↗

**Figure 5.** Figure 5: Forecasts of global CO2 emissions. 2015 2017 2019 2021 2023 2025 2027 Year 33 34 35 36 37 38 39 40 Global C O2 emissions (Gt C O2) Claude S4.5 DeepSeek R1 GPT-4o GPT-5.4 GPT-OSS H LLaMA 3.3 istorical CO2 emissions [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: CUSP validation and evaluation pipeline. (A) Benchmark construction: scientific findings are curated and filtered via LLM-based criteria, validated by an independent model, and verified by human experts to produce a high-quality benchmark. (B) Two-track evaluation: model outputs are assessed for outcome correctness (across binary, MCQ, free-response, and date prediction tasks) and reasoning quality (viabil… view at source ↗

**Figure 7.** Figure 7: A) Visualization of FRQ evaluation on four criteria on 6 LLMs B) Frq Score distribution [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: A) Visualization of passing rates across six LLMs B) Visualization on LLM performance [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of aggregated date predictions across models. Importantly, many models [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of confidence calibration across six LLMs. [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗

**Figure 11.** Figure 11: Saturation plot of CUSP compared to other commonly used LLM benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p048_11.png] view at source ↗

**Figure 12.** Figure 12: Our human evaluation interface is designed to assess alignment, novelty, feasibility, and [PITH_FULL_IMAGE:figures/full_fig_p049_12.png] view at source ↗

**Figure 13.** Figure 13: Results on human evaluation vs AI Judge on 60 AI questions. Each human evaluates 20 [PITH_FULL_IMAGE:figures/full_fig_p050_13.png] view at source ↗

read the original abstract

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New CUSP benchmark tests AI forecasting on real scientific events with cutoffs but event selection details are thin.

read the letter

The main thing to know is that this paper builds a benchmark called CUSP to check whether current AI models can forecast scientific advances under realistic knowledge limits, and it finds they fall short on timing and realization predictions while showing domain gaps and overconfidence. The setup uses 4760 events and four tasks, with explicit training cutoffs to separate knowledge exposure from true forecasting ability. Performance stays similar before and after cutoffs, and extra pre-cutoff info helps only modestly compared to full post-event access, especially on high-citation cases. Timing of AI-related progress comes out more predictable than biology or physics advances. That pattern is the clearest new signal here. The work does a solid job laying out controlled information access and reporting systematic biases like overconfidence and response tendencies across tasks. It also ties the results to practical risks for using AI in research planning. The soft spot sits in the event sample itself. The abstract gives no sampling frame, stratification by outcome success or failure, or rules for excluding routine or negative results, so the observed limitations could partly reflect a tilt toward visible, high-impact cases rather than the broader distribution of scientific progress. Without those details or checks for citation bias and inter-annotator reliability, the domain differences and insensitivity claims rest on unverified ground. This is worth attention for groups working on AI evaluation benchmarks or long-horizon prediction in science. Readers who care about how models handle uncertainty or domain-specific forecasting will get concrete numbers to think about. It deserves a serious referee to verify the selection protocol and run robustness checks on the patterns. I would send it to review rather than desk reject, with a note to expand the methods section on curation and add controls for selection effects.

Referee Report

2 major / 2 minor

Summary. The paper introduces the CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, a multi-disciplinary, event-level evaluation framework for testing AI systems' ability to forecast scientific progress under controlled knowledge constraints. Across 4,760 scientific events, it assesses frontier models on four tasks: feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Key observations include systematic limitations in predicting realization and timing of advances, domain heterogeneity (e.g., AI progress more predictable than biology/chemistry/physics), insensitivity to training cutoffs, greater benefit from post-event information than prior knowledge, and issues with overconfidence and response biases. The central conclusion is that current AI systems fall short as predictive tools for scientific progress.

Significance. If the event sample proves representative, this benchmark offers a valuable, temporally grounded tool for quantifying AI forecasting gaps in science, with implications for improving uncertainty calibration and long-horizon prediction. The scale (4,760 events), controlled cutoff design, and multi-task structure are strengths that could guide development of more reliable scientific AI. The finding that pre-cutoff knowledge does not close the performance gap to full-information settings points to limitations beyond data exposure, though this hinges on validation of the sampling process.

major comments (2)

[Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.
[Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.

minor comments (2)

[Abstract] Abstract: claims of 'systematic and domain-dependent limitations' and 'strong response biases' would be strengthened by including at least one quantitative example (e.g., accuracy delta or bias rate) rather than qualitative summary only.
[Results] Results presentation: tables or figures reporting cross-domain performance should include statistical significance tests or confidence intervals to support assertions of heterogeneity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification regarding benchmark construction and the evaluation framework. We address each major comment below, indicating where revisions have been made to strengthen the paper while preserving its core contributions.

read point-by-point responses

Referee: [Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.

Authors: We agree that the original manuscript would benefit from expanded methodological detail on benchmark construction to support the central claims. In the revised version, we have added a dedicated subsection to the Methods that specifies the sampling frame (drawn from curated scientific databases and announcements across disciplines), selection criteria (focusing on non-routine, temporally bounded events with verifiable outcomes), stratification by domain and outcome where possible, exclusion rules for incremental or negative results, and inter-annotator agreement (Cohen's kappa reported for event validation). We further include a supplementary analysis demonstrating that performance patterns hold across citation-impact strata, reducing the likelihood that findings are driven solely by high-visibility successes. These additions directly address concerns about potential selection artifacts. revision: yes
Referee: [Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.

Authors: We acknowledge the need for greater explicitness in describing how the tasks isolate forward-looking forecasting. The revised manuscript expands the evaluation section to operationalize each task with strict cutoff conditioning that limits models to pre-event information only. Temporal prediction is now explicitly defined using mean absolute error in predicted realization year and accuracy within a ±2-year window, with results reported separately for high- and low-citation events to control for visibility bias. We have also added a discussion of how the multi-task design, combined with full-information baselines, helps distinguish pattern matching from genuine prediction. These clarifications and added controls substantiate the assumptions underlying our claims about misestimation of timing. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper introduces the CUSP benchmark as an explicit evaluation framework and reports measured performance differences across feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction tasks on 4,760 events. All central claims (insensitivity to training cutoffs, superiority of post-event information, domain heterogeneity) rest on direct empirical observations and controlled comparisons rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citation chain or ansatz is invoked to justify uniqueness or force results; the work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation framework depends on the representativeness of the chosen scientific events and the validity of the four task definitions as proxies for forecasting; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5797 in / 1100 out tokens · 35706 ms · 2026-05-22T05:13:51.485780+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 11 internal anchors

[1]

Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

Brian Uzzi, Satyam Mukherjee, Michael Stringer, and Ben Jones. Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

work page 2013
[2]

The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

Dudley Shapere. The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

work page 1964
[3]

Moore’s law.Electronics Magazine, 38(8):114, 1965

Gordon Moore. Moore’s law.Electronics Magazine, 38(8):114, 1965

work page 1965
[4]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[5]

Science of science

Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojević, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185, 2018

work page 2018
[6]

Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

Michael Park, Erin Leahey, and Russell J Funk. Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

work page 2023
[7]

The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

Ian Miles. The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

work page 2010
[8]

Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

work page 2021
[9]

Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

work page 2024
[10]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehra- bian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

work page 2023
[12]

The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

work page 2025
[13]

Accelerating scientific discovery with co-scientist.Nature, 2026

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, et al. Accelerating scientific discovery with co-scientist.Nature, 2026

work page 2026
[14]

Szostkiewicz, Dmytro Shved, Gavin J

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammad T. Razzak, Andrew D. White, Silvia C. Finnemann, Michael M. Hinks, and Samuel G. Rodrigues. A multi-agent system for automating scientific discovery.Nature, 2026. 18

work page 2026
[15]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in neural information processing systems, 2022

work page 2022
[16]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

work page 2024
[17]

Astabench: Rigorous benchmarking of ai agents with a scientific research suite

Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite. InInternational conference on learning representations, 2026

work page 2026
[18]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

work page 2026
[19]

Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C Kozlowski, Oyvind Tafjord, James Evans, Daniel S Weld, Tom Hope, and Doug Downey. Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

work page arXiv 2026
[20]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in neural information processing systems, 2024

work page 2024
[21]

Forecastbench: A dynamic benchmark of AI forecasting capabilities

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of AI forecasting capabilities. In International conference on learning representations, 2025

work page 2025
[22]

Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

work page arXiv 2025
[23]

Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark

Moy Yuan, Zifeng Ding, and Andreas Vlachos. Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark. InAdvances in neural information processing systems datasets and benchmarks track, 2025

work page 2025
[24]

Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, et al. Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

work page arXiv 2025
[25]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science

Peter Jansen, Samiah Hassan, and Ruoyao Wang. Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science. InEmpirical methods in natural language processing, 2025

work page 2025
[27]

Solving 19 inequality proofs with large language models

Jiayi Sheng, Luna Lyu, Jikai Jin, Tanglin Xia, Alex Gu, James Zou, and Pan Lu. Solving 19 inequality proofs with large language models. InAdvances in neural information processing systems, 2025

work page 2025
[28]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023

work page 2023
[29]

Walk the talk? measuring the faithfulness of large language model explanations

Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithfulness of large language model explanations. InInternational conference on learning representations, 2025

work page 2025
[30]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

work page 2026
[31]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, September 2025

work page 2025
[33]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[36]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational conference on learning representations, 2021

work page 2021
[37]

Protein data bank.Nature New Biol, 233(223):10–1038, 1971

Protein Data Bank. Protein data bank.Nature New Biol, 233(223):10–1038, 1971

work page 1971
[38]

Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, and James Zou. Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

work page 2025
[39]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and JeffClune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

work page 2026
[41]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

work page internal anchor Pith review arXiv 2025
[43]

Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

Georgia Channing and Avijit Ghosh. Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

work page arXiv 2025
[44]

When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

work page 2018
[45]

International sci- entific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282,

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, et al. In- ternational scientific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282, 2024

work page arXiv 2024
[46]

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

work page arXiv 2026
[47]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Car- oline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K. Reddy. LLM-SRBench: A new benchmark for scientific equation discovery with large language models. InInternational Conference on Machine Learning, 2025

work page 2025
[49]

Forecasting future world events with neural networks

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, and Dan Hendrycks. Forecasting future world events with neural networks. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[50]

Approaching human-level forecasting with language models

Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[51]

Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InInternational conference on computational linguistics, 2025

work page 2025
[52]

Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

Shirin Shahabi, Spencer Graham, and Haruna Isah. Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

work page arXiv 2026
[53]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Alexander Krauss. Debunking revolutionary paradigm shifts: evidence of cumulative sci- entific progress across science.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 480(2302), 2024. 21

work page 2024
[55]

Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

Kambiz N Alavian. Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

work page 2025
[56]

Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

John E Hallsworth, Zulema Udaondo, Carlos Pedrós-Alió, Juan Höfer, Kathleen C Benison, Karen G Lloyd, Radamés JB Cordero, Claudia BL de Campos, Michail M Yakimov, and Ricardo Amils. Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

work page 2023
[57]

Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

Tairan Wang, Jianyu Hu, Runhai Ouyang, Yutao Wang, Yi Huang, Sulei Hu, and Wei-Xue Li. Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

work page 2024
[58]

Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

Andrew K Schulz, Lena V Kaufmann, Lawrence T Smith, Deepti S Philip, Hilda David, Jelena Lazovic, Michael Brecht, Gunther Richter, and Katherine J Kuchenbecker. Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

work page 2026
[59]

Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

Geosan Kang, Guhyeon Kwon, Jiwoon Jeon, Jisung Kwon, Myung-Ki Kim, Junpyo Hong, Albert S Lee, Seongi Lee, Binhyung Lee, Yujin Kim, et al. Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

work page 2025
[60]

Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Jianfeng Liu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Yuefeng Zhan, et al. Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

work page arXiv 2025
[61]

A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

Wenzheng Heng, Shukun Yin, Jihong Min, Canran Wang, Hong Han, Ehsan Shirzaei Sani, Jiahong Li, Yu Song, Harry B Rossiter, and Wei Gao. A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

work page 2024
[62]

Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

Milos Vukadinovic, I-Min Chiu, Xiu Tang, Neal Yuan, Tien-Yu Chen, Paul Cheng, Debiao Li, Susan Cheng, Bryan He, and David Ouyang. Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

work page 2026
[63]

Top 10 AI Papers of the Week

Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. Infinidepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields.arXiv preprint arXiv:2601.03252, 2026. 22 A Benchmark Construction Details A.1 Data acquisition and source construction Natural Science Dat...

work page doi:10.1126/science.adx8981 2026
[64]

plausible enough to require real reasoning,

work page
[65]

not directly supported by the abstract,

work page
[66]

verdict":

not trivially wrong or obviously eliminated. Important rules: • Judge the distractors as a set — do not judge the stem or the correct answer in this call •Passonly if the distractors are non-trivial and sufficiently plausible •Failif the distractors are too easy, too obviously wrong, or directly supported by the abstract •Failif the distractors are not me...

work page 2026
[67]

‘json code block. Write 1-2 sentences of reasoning with inline citations ( [title](url)) before the block. {

alignment(0–10): Does the LLM RESPONSE describe the specific approach used in the paper? Use web search to find the actual paper method. 0–2:completely wrong direction or no meaningful content 3–4:roughly right area but missing key specifics of the actual method 5–6:captures the main idea but lacks important details or misstates them 7–8:matches the core ...

work page doi:10.1016/j.cell.2026.01.023 2025
[68]

Do NOT change which benchmark or dataset is referenced

Keep ALL benchmark names, dataset names, and task names EXACTLY the same. Do NOT change which benchmark or dataset is referenced

work page
[69]

ONLY modify an EXISTING numeric score/threshold, or add a credible unmet constraint

work page
[70]

Make the increase a clear shift so there is no ambiguity, but still physically plausible

IF modifying an existing numeric score, RAISE it enough so the original result definitively does NOT satisfy the perturbed claim (e.g., if original is 94.2%, change to 95.8%; if 51.7%, change to 54.5%). Make the increase a clear shift so there is no ambiguity, but still physically plausible

work page
[71]

Make this constraint significant enough that it’s noticeably harder to satisfy than the original

IF the original claim has no specific numbers, you MUST add a highly specific, definitive unmet constraint (e.g., ‘while using 50% fewer parameters’, ‘but fails completely on zero-shot tasks’, or ‘but requires 3x the memory’). Make this constraint significant enough that it’s noticeably harder to satisfy than the original

work page
[72]

The perturbed claim must be plausible and not absurd

work page
[73]

{problem_statement}

Keep the same length, style, and level of specificity. Return JSON with: • ‘perturbed_result’: The counterfactual alternative result claim • ‘changed_detail’: Which aspect of the result was modified Create MCQ Distractors You are a technical forecasting analyst who designs extraordinarily difficult, graduate-level evaluations. Your task is to create a mul...

work page
[74]

The problem description must come ONLY from the Problem Statement

work page
[75]

DO NOT mention any specific method, architecture, technique, or approach

work page
[76]

Return JSON with key ‘prompt’

DO NOT include any narrative about a paper or discovery. Return JSON with key ‘prompt’. Return JSON only. 67 J Example FRQ Responses GPT-5.4 High-Scoring Response Source Abstract Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands c...

work page 2025

[1] [1]

Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

Brian Uzzi, Satyam Mukherjee, Michael Stringer, and Ben Jones. Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013

work page 2013

[2] [2]

The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

Dudley Shapere. The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964

work page 1964

[3] [3]

Moore’s law.Electronics Magazine, 38(8):114, 1965

Gordon Moore. Moore’s law.Electronics Magazine, 38(8):114, 1965

work page 1965

[4] [4]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[5] [5]

Science of science

Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojević, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185, 2018

work page 2018

[6] [6]

Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

Michael Park, Erin Leahey, and Russell J Funk. Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023

work page 2023

[7] [7]

The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

Ian Miles. The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010

work page 2010

[8] [8]

Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021

work page 2021

[9] [9]

Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

work page 2024

[10] [10]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehra- bian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023

work page 2023

[12] [12]

The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025

work page 2025

[13] [13]

Accelerating scientific discovery with co-scientist.Nature, 2026

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, et al. Accelerating scientific discovery with co-scientist.Nature, 2026

work page 2026

[14] [14]

Szostkiewicz, Dmytro Shved, Gavin J

Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammad T. Razzak, Andrew D. White, Silvia C. Finnemann, Michael M. Hinks, and Samuel G. Rodrigues. A multi-agent system for automating scientific discovery.Nature, 2026. 18

work page 2026

[15] [15]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in neural information processing systems, 2022

work page 2022

[16] [16]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024

work page 2024

[17] [17]

Astabench: Rigorous benchmarking of ai agents with a scientific research suite

Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite. InInternational conference on learning representations, 2026

work page 2026

[18] [18]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026

work page 2026

[19] [19]

Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C Kozlowski, Oyvind Tafjord, James Evans, Daniel S Weld, Tom Hope, and Doug Downey. Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026

work page arXiv 2026

[20] [20]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in neural information processing systems, 2024

work page 2024

[21] [21]

Forecastbench: A dynamic benchmark of AI forecasting capabilities

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of AI forecasting capabilities. In International conference on learning representations, 2025

work page 2025

[22] [22]

Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

work page arXiv 2025

[23] [23]

Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark

Moy Yuan, Zifeng Ding, and Andreas Vlachos. Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark. InAdvances in neural information processing systems datasets and benchmarks track, 2025

work page 2025

[24] [24]

Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, et al. Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025

work page arXiv 2025

[25] [25]

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science

Peter Jansen, Samiah Hassan, and Ruoyao Wang. Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science. InEmpirical methods in natural language processing, 2025

work page 2025

[27] [27]

Solving 19 inequality proofs with large language models

Jiayi Sheng, Luna Lyu, Jikai Jin, Tanglin Xia, Alex Gu, James Zou, and Pan Lu. Solving 19 inequality proofs with large language models. InAdvances in neural information processing systems, 2025

work page 2025

[28] [28]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023

work page 2023

[29] [29]

Walk the talk? measuring the faithfulness of large language model explanations

Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithfulness of large language model explanations. InInternational conference on learning representations, 2025

work page 2025

[30] [30]

Introducing GPT-5.4

OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

work page 2026

[31] [31]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Introducing claude sonnet 4.5

Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, September 2025

work page 2025

[33] [33]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[36] [36]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational conference on learning representations, 2021

work page 2021

[37] [37]

Protein data bank.Nature New Biol, 233(223):10–1038, 1971

Protein Data Bank. Protein data bank.Nature New Biol, 233(223):10–1038, 1971

work page 1971

[38] [38]

Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, and James Zou. Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025

work page 2025

[39] [39]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and JeffClune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026

work page 2026

[41] [41]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Kosmos: An AI Scientist for Autonomous Discovery

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

work page internal anchor Pith review arXiv 2025

[43] [43]

Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

Georgia Channing and Avijit Ghosh. Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025

work page arXiv 2025

[44] [44]

When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018

work page 2018

[45] [45]

International sci- entific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282,

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, et al. In- ternational scientific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282, 2024

work page arXiv 2024

[46] [46]

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

work page arXiv 2026

[47] [47]

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Car- oline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K. Reddy. LLM-SRBench: A new benchmark for scientific equation discovery with large language models. InInternational Conference on Machine Learning, 2025

work page 2025

[49] [49]

Forecasting future world events with neural networks

Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, and Dan Hendrycks. Forecasting future world events with neural networks. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[50] [50]

Approaching human-level forecasting with language models

Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[51] [51]

Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation

Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InInternational conference on computational linguistics, 2025

work page 2025

[52] [52]

Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

Shirin Shahabi, Spencer Graham, and Haruna Isah. Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026

work page arXiv 2026

[53] [53]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Alexander Krauss. Debunking revolutionary paradigm shifts: evidence of cumulative sci- entific progress across science.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 480(2302), 2024. 21

work page 2024

[55] [55]

Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

Kambiz N Alavian. Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025

work page 2025

[56] [56]

Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

John E Hallsworth, Zulema Udaondo, Carlos Pedrós-Alió, Juan Höfer, Kathleen C Benison, Karen G Lloyd, Radamés JB Cordero, Claudia BL de Campos, Michail M Yakimov, and Ricardo Amils. Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023

work page 2023

[57] [57]

Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

Tairan Wang, Jianyu Hu, Runhai Ouyang, Yutao Wang, Yi Huang, Sulei Hu, and Wei-Xue Li. Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024

work page 2024

[58] [58]

Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

Andrew K Schulz, Lena V Kaufmann, Lawrence T Smith, Deepti S Philip, Hilda David, Jelena Lazovic, Michael Brecht, Gunther Richter, and Katherine J Kuchenbecker. Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026

work page 2026

[59] [59]

Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

Geosan Kang, Guhyeon Kwon, Jiwoon Jeon, Jisung Kwon, Myung-Ki Kim, Junpyo Hong, Albert S Lee, Seongi Lee, Binhyung Lee, Yujin Kim, et al. Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025

work page 2025

[60] [60]

Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Jianfeng Liu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Yuefeng Zhan, et al. Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025

work page arXiv 2025

[61] [61]

A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

Wenzheng Heng, Shukun Yin, Jihong Min, Canran Wang, Hong Han, Ehsan Shirzaei Sani, Jiahong Li, Yu Song, Harry B Rossiter, and Wei Gao. A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024

work page 2024

[62] [62]

Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

Milos Vukadinovic, I-Min Chiu, Xiu Tang, Neal Yuan, Tien-Yu Chen, Paul Cheng, Debiao Li, Susan Cheng, Bryan He, and David Ouyang. Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026

work page 2026

[63] [63]

Top 10 AI Papers of the Week

Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. Infinidepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields.arXiv preprint arXiv:2601.03252, 2026. 22 A Benchmark Construction Details A.1 Data acquisition and source construction Natural Science Dat...

work page doi:10.1126/science.adx8981 2026

[64] [64]

plausible enough to require real reasoning,

work page

[65] [65]

not directly supported by the abstract,

work page

[66] [66]

verdict":

not trivially wrong or obviously eliminated. Important rules: • Judge the distractors as a set — do not judge the stem or the correct answer in this call •Passonly if the distractors are non-trivial and sufficiently plausible •Failif the distractors are too easy, too obviously wrong, or directly supported by the abstract •Failif the distractors are not me...

work page 2026

[67] [67]

‘json code block. Write 1-2 sentences of reasoning with inline citations ( [title](url)) before the block. {

alignment(0–10): Does the LLM RESPONSE describe the specific approach used in the paper? Use web search to find the actual paper method. 0–2:completely wrong direction or no meaningful content 3–4:roughly right area but missing key specifics of the actual method 5–6:captures the main idea but lacks important details or misstates them 7–8:matches the core ...

work page doi:10.1016/j.cell.2026.01.023 2025

[68] [68]

Do NOT change which benchmark or dataset is referenced

Keep ALL benchmark names, dataset names, and task names EXACTLY the same. Do NOT change which benchmark or dataset is referenced

work page

[69] [69]

ONLY modify an EXISTING numeric score/threshold, or add a credible unmet constraint

work page

[70] [70]

Make the increase a clear shift so there is no ambiguity, but still physically plausible

IF modifying an existing numeric score, RAISE it enough so the original result definitively does NOT satisfy the perturbed claim (e.g., if original is 94.2%, change to 95.8%; if 51.7%, change to 54.5%). Make the increase a clear shift so there is no ambiguity, but still physically plausible

work page

[71] [71]

Make this constraint significant enough that it’s noticeably harder to satisfy than the original

IF the original claim has no specific numbers, you MUST add a highly specific, definitive unmet constraint (e.g., ‘while using 50% fewer parameters’, ‘but fails completely on zero-shot tasks’, or ‘but requires 3x the memory’). Make this constraint significant enough that it’s noticeably harder to satisfy than the original

work page

[72] [72]

The perturbed claim must be plausible and not absurd

work page

[73] [73]

{problem_statement}

Keep the same length, style, and level of specificity. Return JSON with: • ‘perturbed_result’: The counterfactual alternative result claim • ‘changed_detail’: Which aspect of the result was modified Create MCQ Distractors You are a technical forecasting analyst who designs extraordinarily difficult, graduate-level evaluations. Your task is to create a mul...

work page

[74] [74]

The problem description must come ONLY from the Problem Statement

work page

[75] [75]

DO NOT mention any specific method, architecture, technique, or approach

work page

[76] [76]

Return JSON with key ‘prompt’

DO NOT include any narrative about a paper or discovery. Return JSON with key ‘prompt’. Return JSON only. 67 J Example FRQ Responses GPT-5.4 High-Scoring Response Source Abstract Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands c...

work page 2025