Forecasting Scientific Progress with Artificial Intelligence
Pith reviewed 2026-05-22 05:13 UTC · model grok-4.3
The pith
AI systems cannot reliably predict whether or when scientific advances will occur.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the CUSP benchmark covering 4,760 scientific events and show that frontier models exhibit systematic limitations in forecasting progress: they misestimate timing, overstate feasibility, and gain more from post-event information than from pre-cutoff knowledge, with performance varying by domain but remaining insensitive to training cutoffs.
What carries the argument
The CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, which measures AI forecasting ability through four tasks under controlled pre- and post-event knowledge access.
If this is right
- AI cannot yet serve as a standalone tool for prioritizing research investments or setting scientific timelines.
- Domain-specific differences imply that forecasting methods may need tailoring rather than one-size-fits-all approaches.
- The large gap between pre- and post-event performance indicates that current models rely more on hindsight than on causal anticipation.
- Systematic overconfidence and response biases mean AI-generated forecasts require external calibration before use.
Where Pith is reading between the lines
- The results suggest that AI science tools may need hybrid human-AI loops specifically for temporal and uncertainty judgments.
- Extending the benchmark to include negative results or failed projects could reveal whether models also underpredict dead ends.
- If the pattern holds for newer events, it would imply that scaling alone is unlikely to solve scientific forecasting.
Load-bearing premise
That the chosen scientific events represent an unbiased sample of progress and that the four tasks measure genuine forecasting skill rather than surface pattern matching.
What would settle it
Finding that models achieve similar accuracy on timing and feasibility predictions when restricted to pre-event information as they do with full post-event details would falsify the central claim.
Figures
read the original abstract
Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the CUSP (Cutoff-conditioned Unseen Scientific Progress) benchmark, a multi-disciplinary, event-level evaluation framework for testing AI systems' ability to forecast scientific progress under controlled knowledge constraints. Across 4,760 scientific events, it assesses frontier models on four tasks: feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Key observations include systematic limitations in predicting realization and timing of advances, domain heterogeneity (e.g., AI progress more predictable than biology/chemistry/physics), insensitivity to training cutoffs, greater benefit from post-event information than prior knowledge, and issues with overconfidence and response biases. The central conclusion is that current AI systems fall short as predictive tools for scientific progress.
Significance. If the event sample proves representative, this benchmark offers a valuable, temporally grounded tool for quantifying AI forecasting gaps in science, with implications for improving uncertainty calibration and long-horizon prediction. The scale (4,760 events), controlled cutoff design, and multi-task structure are strengths that could guide development of more reliable scientific AI. The finding that pre-cutoff knowledge does not close the performance gap to full-information settings points to limitations beyond data exposure, though this hinges on validation of the sampling process.
major comments (2)
- [Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.
- [Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.
minor comments (2)
- [Abstract] Abstract: claims of 'systematic and domain-dependent limitations' and 'strong response biases' would be strengthened by including at least one quantitative example (e.g., accuracy delta or bias rate) rather than qualitative summary only.
- [Results] Results presentation: tables or figures reporting cross-domain performance should include statistical significance tests or confidence intervals to support assertions of heterogeneity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. Their comments highlight important areas for clarification regarding benchmark construction and the evaluation framework. We address each major comment below, indicating where revisions have been made to strengthen the paper while preserving its core contributions.
read point-by-point responses
-
Referee: [Benchmark construction / Methods] Benchmark construction (methods section describing CUSP): the manuscript provides no details on the sampling frame, selection criteria, stratification by outcome (success/failure), exclusion rules for routine or negative results, or inter-annotator agreement for the 4,760 events. This is load-bearing for the central claim, as the reported insensitivity to cutoffs, domain differences, and superiority of post-event information could be artifacts of over-representing high-visibility or high-citation successes rather than intrinsic model limitations.
Authors: We agree that the original manuscript would benefit from expanded methodological detail on benchmark construction to support the central claims. In the revised version, we have added a dedicated subsection to the Methods that specifies the sampling frame (drawn from curated scientific databases and announcements across disciplines), selection criteria (focusing on non-routine, temporally bounded events with verifiable outcomes), stratification by domain and outcome where possible, exclusion rules for incremental or negative results, and inter-annotator agreement (Cohen's kappa reported for event validation). We further include a supplementary analysis demonstrating that performance patterns hold across citation-impact strata, reducing the likelihood that findings are driven solely by high-visibility successes. These additions directly address concerns about potential selection artifacts. revision: yes
-
Referee: [Evaluation Tasks] Evaluation framework (section on the four tasks): it is not shown how the combination of feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction isolates genuine forward-looking forecasting from surface-level pattern matching on post-event literature. Without explicit operationalization of temporal prediction metrics or controls for citation bias, the claim that models 'systematically misestimate when [advances] will occur' rests on unverified assumptions about what the tasks measure.
Authors: We acknowledge the need for greater explicitness in describing how the tasks isolate forward-looking forecasting. The revised manuscript expands the evaluation section to operationalize each task with strict cutoff conditioning that limits models to pre-event information only. Temporal prediction is now explicitly defined using mean absolute error in predicted realization year and accuracy within a ±2-year window, with results reported separately for high- and low-citation events to control for visibility bias. We have also added a discussion of how the multi-task design, combined with full-information baselines, helps distinguish pattern matching from genuine prediction. These clarifications and added controls substantiate the assumptions underlying our claims about misestimation of timing. revision: yes
Circularity Check
No circularity in empirical benchmark evaluation
full rationale
The paper introduces the CUSP benchmark as an explicit evaluation framework and reports measured performance differences across feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction tasks on 4,760 events. All central claims (insensitivity to training cutoffs, superiority of post-event information, domain heterogeneity) rest on direct empirical observations and controlled comparisons rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citation chain or ansatz is invoked to justify uniqueness or force results; the work is self-contained as a benchmark study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013
Brian Uzzi, Satyam Mukherjee, Michael Stringer, and Ben Jones. Atypical combinations and scientific impact.Science, 342(6157):468–472, 2013
work page 2013
-
[2]
The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964
Dudley Shapere. The structure of scientific revolutions.The Philosophical Review, 73(3):383– 394, 1964
work page 1964
-
[3]
Moore’s law.Electronics Magazine, 38(8):114, 1965
Gordon Moore. Moore’s law.Electronics Magazine, 38(8):114, 1965
work page 1965
-
[4]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[5]
Santo Fortunato, Carl T Bergstrom, Katy Börner, James A Evans, Dirk Helbing, Staša Milojević, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, Brian Uzzi, et al. Science of science. Science, 359(6379):eaao0185, 2018
work page 2018
-
[6]
Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023
Michael Park, Erin Leahey, and Russell J Funk. Papers and patents are becoming less disruptive over time.Nature, 613(7942):138–144, 2023
work page 2023
-
[7]
Ian Miles. The development of technology foresight: A review.Technological forecasting and social change, 77(9):1448–1456, 2010
work page 2010
-
[8]
Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.Nature, 596(7873):583–589, 2021
work page 2021
-
[9]
Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024
work page 2024
-
[10]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehra- bian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023
Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624(7990):80–85, 2023
work page 2023
-
[12]
The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025
Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, 646(8085):716–723, 2025
work page 2025
-
[13]
Accelerating scientific discovery with co-scientist.Nature, 2026
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, et al. Accelerating scientific discovery with co-scientist.Nature, 2026
work page 2026
-
[14]
Szostkiewicz, Dmytro Shved, Gavin J
Ali Essam Ghareeb, Benjamin Chang, Ludovico Mitchener, Angela Yiu, Caralyn J. Szostkiewicz, Dmytro Shved, Gavin J. Gyimesi, Jon M. Laurent, Samantha M. Wright, Muhammad T. Razzak, Andrew D. White, Silvia C. Finnemann, Michael M. Hinks, and Samuel G. Rodrigues. A multi-agent system for automating scientific discovery.Nature, 2026. 18
work page 2026
-
[15]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in neural information processing systems, 2022
work page 2022
-
[16]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst conference on language modeling, 2024
work page 2024
-
[17]
Astabench: Rigorous benchmarking of ai agents with a scientific research suite
Jonathan Bragg, Mike D’Arcy, Nishant Balepur, Dan Bareket, Bhavana Dalvi, Sergey Feldman, Dany Haddad, Jena D Hwang, Peter Jansen, Varsha Kishore, et al. Astabench: Rigorous benchmarking of ai agents with a scientific research suite. InInternational conference on learning representations, 2026
work page 2026
-
[18]
A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026
Center for AI Safety, Scale AI, and HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649:1139–1146, 2026
work page 2026
-
[19]
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C Kozlowski, Oyvind Tafjord, James Evans, Daniel S Weld, Tom Hope, and Doug Downey. Prescience: A benchmark for forecasting scientific contributions.arXiv preprint arXiv:2602.20459, 2026
-
[20]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in neural information processing systems, 2024
work page 2024
-
[21]
Forecastbench: A dynamic benchmark of AI forecasting capabilities
Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of AI forecasting capabilities. In International conference on learning representations, 2025
work page 2025
-
[22]
Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025
-
[23]
Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark
Moy Yuan, Zifeng Ding, and Andreas Vlachos. Introducing FOReCAst: The future outcome reasoning and confidence assessment benchmark. InAdvances in neural information processing systems datasets and benchmarks track, 2025
work page 2025
-
[24]
Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, et al. Prophet: An inferable future forecasting benchmark with causal intervened likelihood estimation.arXiv preprint arXiv:2504.01509, 2025
-
[25]
ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition
Yujie Liu, Zonglin Yang, Tong Xie, Jinjie Ni, Ben Gao, Yuqiang Li, Shixiang Tang, Wanli Ouyang, Erik Cambria, and Dongzhan Zhou. Researchbench: Benchmarking llms in scientific discovery via inspiration-based task decomposition.arXiv preprint arXiv:2503.21248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Peter Jansen, Samiah Hassan, and Ruoyao Wang. Matter-of-fact: A benchmark for verifying the feasibility of literature-supported claims in materials science. InEmpirical methods in natural language processing, 2025
work page 2025
-
[27]
Solving 19 inequality proofs with large language models
Jiayi Sheng, Luna Lyu, Jikai Jin, Tanglin Xia, Alex Gu, James Zou, and Pan Lu. Solving 19 inequality proofs with large language models. InAdvances in neural information processing systems, 2025
work page 2025
-
[28]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In International conference on learning representations, 2023
work page 2023
-
[29]
Walk the talk? measuring the faithfulness of large language model explanations
Katie Matton, Robert Ness, John Guttag, and Emre Kiciman. Walk the talk? measuring the faithfulness of large language model explanations. InInternational conference on learning representations, 2025
work page 2025
-
[30]
OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026
work page 2026
-
[31]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/news/ claude-sonnet-4-5, September 2025
work page 2025
-
[33]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[36]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational conference on learning representations, 2021
work page 2021
-
[37]
Protein data bank.Nature New Biol, 233(223):10–1038, 1971
Protein Data Bank. Protein data bank.Nature New Biol, 233(223):10–1038, 1971
work page 1971
-
[38]
Federico Bianchi, Owen Queen, Nitya Thakkar, Eric Sun, and James Zou. Exploring the use of ai authors and reviewers at agents4science.Nature Biotechnology, pages 1–4, 2025
work page 2025
-
[39]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026
Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and JeffClune. Towards end-to-end automation of ai research.Nature, 651(8107):914–919, 2026
work page 2026
-
[41]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist.arXiv preprint arXiv:2502.18864, 2025. 20
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Kosmos: An AI Scientist for Autonomous Discovery
Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025
work page internal anchor Pith review arXiv 2025
-
[43]
Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025
Georgia Channing and Avijit Ghosh. Ai for scientific discovery is a social problem.arXiv preprint arXiv:2509.06580, 2025
-
[44]
Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. When will ai exceed human performance? evidence from ai experts.Journal of Artificial Intelligence Research, 62:729–754, 2018
work page 2018
-
[45]
Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Danielle Goldfarb, Hoda Heidari, Leila Khalatbari, et al. In- ternational scientific report on the safety of advanced ai (interim report).arXiv preprint arXiv:2412.05282, 2024
-
[46]
Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026
-
[47]
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Car- oline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai.arXiv preprint arXiv:2411.04872, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K. Reddy. LLM-SRBench: A new benchmark for scientific equation discovery with large language models. InInternational Conference on Machine Learning, 2025
work page 2025
-
[49]
Forecasting future world events with neural networks
Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, and Dan Hendrycks. Forecasting future world events with neural networks. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[50]
Approaching human-level forecasting with language models
Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[51]
Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation
Siyuan Wang, Zhuohan Long, Zhihao Fan, Xuan-Jing Huang, and Zhongyu Wei. Benchmark self-evolving: A multi-agent framework for dynamic llm evaluation. InInternational conference on computational linguistics, 2025
work page 2025
-
[52]
Shirin Shahabi, Spencer Graham, and Haruna Isah. Truthtensor: Evaluating llms through human imitation on prediction market under drift and holistic reasoning.arXiv preprint arXiv:2601.13545, 2026
-
[53]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, et al. Livebench: A challenging, contamination-limited llm benchmark.arXiv preprint arXiv:2406.19314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Alexander Krauss. Debunking revolutionary paradigm shifts: evidence of cumulative sci- entific progress across science.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 480(2302), 2024. 21
work page 2024
-
[55]
Kambiz N Alavian. Paradigm shifts as portals to threshold concepts and epistemic transfor- mation.Educational Philosophy and Theory, pages 1–12, 2025
work page 2025
-
[56]
Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023
John E Hallsworth, Zulema Udaondo, Carlos Pedrós-Alió, Juan Höfer, Kathleen C Benison, Karen G Lloyd, Radamés JB Cordero, Claudia BL de Campos, Michail M Yakimov, and Ricardo Amils. Scientific novelty beyond the experiment.Microbial Biotechnology, 16(6):1131–1173, 2023
work page 2023
-
[57]
Tairan Wang, Jianyu Hu, Runhai Ouyang, Yutao Wang, Yi Huang, Sulei Hu, and Wei-Xue Li. Nature of metal-support interaction for metal catalysts on oxide supports.Science, 386(6724):915–920, 2024
work page 2024
-
[58]
Andrew K Schulz, Lena V Kaufmann, Lawrence T Smith, Deepti S Philip, Hilda David, Jelena Lazovic, Michael Brecht, Gunther Richter, and Katherine J Kuchenbecker. Functional gradients facilitate tactile sensing in elephant whiskers.Science, 391(6786):712–718, 2026
work page 2026
-
[59]
Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025
Geosan Kang, Guhyeon Kwon, Jiwoon Jeon, Jisung Kwon, Myung-Ki Kim, Junpyo Hong, Albert S Lee, Seongi Lee, Binhyung Lee, Yujin Kim, et al. Electromagnetic interference shielding using metal and mxene thin films.Nature, pages 1–8, 2025
work page 2025
-
[60]
Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Jianfeng Liu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Yuefeng Zhan, et al. Rpg: A repository planning graph for unified and scalable codebase generation.arXiv preprint arXiv:2509.16198, 2025
-
[61]
A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024
Wenzheng Heng, Shukun Yin, Jihong Min, Canran Wang, Hong Han, Ehsan Shirzaei Sani, Jiahong Li, Yu Song, Harry B Rossiter, and Wei Gao. A smart mask for exhaled breath condensate harvesting and analysis.Science, 385(6712):954–961, 2024
work page 2024
-
[62]
Milos Vukadinovic, I-Min Chiu, Xiu Tang, Neal Yuan, Tien-Yu Chen, Paul Cheng, Debiao Li, Susan Cheng, Bryan He, and David Ouyang. Comprehensive echocardiogram evaluation with view primed vision language ai.Nature, 650(8103):970–977, 2026
work page 2026
-
[63]
Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, and Sida Peng. Infinidepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields.arXiv preprint arXiv:2601.03252, 2026. 22 A Benchmark Construction Details A.1 Data acquisition and source construction Natural Science Dat...
-
[64]
plausible enough to require real reasoning,
-
[65]
not directly supported by the abstract,
-
[66]
not trivially wrong or obviously eliminated. Important rules: • Judge the distractors as a set — do not judge the stem or the correct answer in this call •Passonly if the distractors are non-trivial and sufficiently plausible •Failif the distractors are too easy, too obviously wrong, or directly supported by the abstract •Failif the distractors are not me...
work page 2026
-
[67]
alignment(0–10): Does the LLM RESPONSE describe the specific approach used in the paper? Use web search to find the actual paper method. 0–2:completely wrong direction or no meaningful content 3–4:roughly right area but missing key specifics of the actual method 5–6:captures the main idea but lacks important details or misstates them 7–8:matches the core ...
-
[68]
Do NOT change which benchmark or dataset is referenced
Keep ALL benchmark names, dataset names, and task names EXACTLY the same. Do NOT change which benchmark or dataset is referenced
-
[69]
ONLY modify an EXISTING numeric score/threshold, or add a credible unmet constraint
-
[70]
Make the increase a clear shift so there is no ambiguity, but still physically plausible
IF modifying an existing numeric score, RAISE it enough so the original result definitively does NOT satisfy the perturbed claim (e.g., if original is 94.2%, change to 95.8%; if 51.7%, change to 54.5%). Make the increase a clear shift so there is no ambiguity, but still physically plausible
-
[71]
Make this constraint significant enough that it’s noticeably harder to satisfy than the original
IF the original claim has no specific numbers, you MUST add a highly specific, definitive unmet constraint (e.g., ‘while using 50% fewer parameters’, ‘but fails completely on zero-shot tasks’, or ‘but requires 3x the memory’). Make this constraint significant enough that it’s noticeably harder to satisfy than the original
-
[72]
The perturbed claim must be plausible and not absurd
-
[73]
Keep the same length, style, and level of specificity. Return JSON with: • ‘perturbed_result’: The counterfactual alternative result claim • ‘changed_detail’: Which aspect of the result was modified Create MCQ Distractors You are a technical forecasting analyst who designs extraordinarily difficult, graduate-level evaluations. Your task is to create a mul...
-
[74]
The problem description must come ONLY from the Problem Statement
-
[75]
DO NOT mention any specific method, architecture, technique, or approach
-
[76]
DO NOT include any narrative about a paper or discovery. Return JSON with key ‘prompt’. Return JSON only. 67 J Example FRQ Responses GPT-5.4 High-Scoring Response Source Abstract Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands c...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.