pith. machine review for the scientific record. sign in

arxiv: 2604.19341 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Evaluation-driven Scaling for Scientific Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM scientific discoverytest-time scalingevaluation-driven refinementSimpleTESalgorithm optimizationquantum circuit routingErdos constructionstrajectory supervision
0
0 comments X

The pith

Simple test-time scaling of evaluation loops lets open models discover better scientific solutions than frontier systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that evaluation-driven discovery loops can be scaled effectively by combining parallel exploration of candidate solutions, iterative refinement guided by feedback, and local selection among high-scoring options. This matters because it reframes the limits of LLM use in science as a question of how evaluation is orchestrated at inference time rather than solely a matter of model size or pre-training. SimpleTES implements this scaling in a lightweight way and shows consistent gains across 21 problems in six domains, including concrete advances such as faster LASSO implementations, lower-overhead quantum routing, and improved Erdos constructions. The same process also yields trajectory data that can supervise post-training for improved efficiency and generalization to new problems.

Core claim

SimpleTES scales evaluation-driven discovery by strategically combining parallel exploration, feedback-driven refinement, and local selection; when applied to gpt-oss models it produces state-of-the-art solutions on 21 scientific problems spanning six domains, outperforming both frontier-model baselines and sophisticated optimization pipelines while also generating successful trajectories that improve subsequent model performance through post-training.

What carries the argument

Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that amplifies the impact of verifiers, simulators, or scoring functions by running many candidates in parallel, refining them based on evaluation feedback, and selecting locally optimal trajectories.

If this is right

  • Evaluation scaling becomes a practical axis for advancing LLM-driven discovery independent of further pre-training gains.
  • Trajectory histories collected during successful discoveries can be reused to post-train models that solve both seen and unseen problems more efficiently.
  • Open-weight models equipped with this loop can surpass closed frontier models on concrete algorithmic and combinatorial tasks.
  • Specific domains see measurable gains such as more than 2x speedup for LASSO, 24.5% reduction in quantum gate overhead, and new record constructions for the Erdos minimum overlap problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If reliable verifiers can be engineered for additional domains, the same scaling approach may accelerate discovery in fields currently limited by weak feedback signals.
  • The emphasis on test-time loops suggests that future model development could prioritize training for effective use of external evaluation rather than solely increasing raw capability.
  • Post-training on discovery trajectories may create a feedback cycle where each round of scaling produces data that makes the next round more effective.

Load-bearing premise

Reliable, unbiased verifiers or task-specific scoring functions exist for the problems and can steer refinement without systematic errors or hidden constraints on the search space.

What would settle it

Running SimpleTES on a fresh scientific problem whose verifier is known to be noisy or biased produces no improvement over direct prompting or yields solutions that fail independent verification while a non-scaled baseline succeeds.

read the original abstract

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Simple Test-time Evaluation-driven Scaling (SimpleTES), a framework that combines parallel exploration, feedback-driven refinement, and local selection to scale evaluation-driven discovery loops using LLMs. It reports that SimpleTES, applied with gpt-oss models, achieves state-of-the-art results on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines. Specific claims include a >2x speedup on the LASSO algorithm, a 24.5% reduction in gate overhead for quantum circuit routing, and new Erdos minimum-overlap constructions that surpass prior best-known results. The work also shows that successful discovery trajectories can supervise post-training, improving efficiency on seen problems and enabling generalization to unseen ones.

Significance. If the empirical results hold under properly validated evaluators, the work would be significant for LLM-driven scientific discovery by framing evaluation scaling as a central, actionable axis and offering a simple, practical framework. The trajectory-based post-training component is a notable strength, as it converts discovery histories into reusable supervision signals that demonstrably improve both in-domain efficiency and out-of-domain generalization. These elements could influence how future systems integrate verifiers and simulators into iterative loops.

major comments (2)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.
  2. [§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.
minor comments (2)
  1. [Abstract and §3] Abstract and §3: The term 'gpt-oss models' is used without an explicit definition or list of the specific models and versions employed; this should be clarified for reproducibility.
  2. [§5] §5: Trajectory-level histories are described as naturally supervising feedback-driven learning, but the post-training protocol (data filtering, loss formulation, training hyperparameters) receives only high-level treatment; a dedicated subsection or appendix would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The SOTA claims, including the new Erdos minimum-overlap constructions, >2x LASSO speedup, and 24.5% quantum-routing improvement, rest on problem-specific verifiers and scoring functions. No independent validation, cross-checks against external oracles or full literature baselines, or ablations on scorer fidelity/approximation error are reported. This is load-bearing for the central empirical claims, as any systematic bias or incompleteness in the scorers would turn reported improvements into artifacts.

    Authors: We agree that the reliability of the reported improvements depends on the verifiers. For each of the 21 problems, the scoring functions are drawn from established, objective metrics in the respective literatures (e.g., wall-clock runtime via standard solvers for LASSO, exact gate-count simulation for quantum routing, and direct mathematical verification for Erdős constructions). In the revised manuscript we have added a new subsection to §4 that explicitly documents every verifier, its implementation, known limitations, and any approximation error bounds. For the new Erdős constructions we now include the explicit solutions together with a verification script in the supplementary material. We have also added a targeted ablation on scorer fidelity for the subset of problems that use approximate evaluators, confirming that the reported gains remain stable under reasonable perturbations. While we cannot perform external oracle validation within the scope of this work, the added documentation and code enable independent reproduction and checking. We maintain that the improvements are not artifacts because all methods (SimpleTES, frontier models, and optimization baselines) were evaluated under identical verifiers. revision: partial

  2. Referee: [§4] §4 (Experimental Setup): No details are provided on experimental controls, statistical tests for significance, variance across LLM sampling runs, or exact baseline implementations (e.g., how frontier-model and optimization-pipeline comparisons were configured). Without these, it is impossible to determine whether the consistent outperformance is robust or attributable to SimpleTES rather than implementation choices or stochastic effects.

    Authors: We acknowledge that the original §4 lacked sufficient experimental detail. In the revised version we have expanded §4 with the following additions: (1) precise configurations and prompt templates for all frontier-model and optimization-pipeline baselines; (2) the number of independent sampling runs (five per problem, different random seeds) together with reported standard deviations; (3) statistical significance testing via paired t-tests with p-values now shown in the result tables; and (4) fixed sampling hyperparameters (temperature, top-p, etc.) across all compared methods. These controls demonstrate that the observed gains are robust and attributable to the SimpleTES framework rather than implementation or stochastic variation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents SimpleTES as an empirical framework combining parallel exploration, feedback-driven refinement, and local selection, then reports performance gains on 21 problems against external baselines (LASSO runtime, quantum gate counts, Erdos overlap records). No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. Post-training on trajectories is described as an additional observed outcome rather than a definitional step. All central claims rest on comparisons to independent SOTA and baselines, making the work self-contained against external metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities can be identified. The framework relies on standard LLM generation and problem-specific evaluators whose details are not provided.

pith-pipeline@v0.9.0 · 5669 in / 1208 out tokens · 34336 ms · 2026-05-10T03:36:48.275377+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Reference graph

Works this paper leans on

188 extracted references · 85 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Accurate structure predic- tion of biomolecular interactions with alphafold 3

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ron- neberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure predic- tion of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, 2024

  2. [2]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv...

  3. [3]

    Natarajan, and K

    Nasir Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE T ransactions on Computers, 23(1):90–93, 1974

  4. [4]

    FARS: Fully automated research system

    Analemma AI. FARS: Fully automated research system. https://analemma.ai/fars/, 2026

  5. [5]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Con- crete problems in ai safety , 2016. URL https://arxiv.org/abs/1606.06565

  6. [6]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025. URL https: //github.com/algorithmicsuperintelligence/openevolve. GitHub repository

  7. [7]

    arXiv preprint arXiv:2510.14150 , year =

    Henrique Assumpção, Diego Ferreira, Leandro Campos, and Fabricio Murai. Codeevolve: an open source evolutionary coding agent for algorithmic discovery and optimization. arXiv preprint arXiv:2510.14150, 2025. doi: 10.48550/arXiv.2510.14150. URL https://arxiv.org/abs/2510.14150

  8. [8]

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models. pages 6709–6738. Association for Computational Linguistics, 2025. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.342. URL https://aclanthology.org/2025.naacl-long.342/

  9. [9]

    Barnard and Stefan Steinerberger

    Richard C. Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics. Journal of Number Theory , 207:42–55, 2020. ISSN 0022-314X. doi: https://doi.org/10.1016/j.jnt.2019.07.001. URL https://www.sciencedirect.com/ science/article/pii/S0022314X19302549

  10. [10]

    Molecular cross-validation for single-cell rna-seq

    Joshua Batson, Loïc Royer, and James Webber. Molecular cross-validation for single-cell rna-seq. BioRxiv, page 786269, 2019

  11. [11]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, T omasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

  12. [12]

    Aster: Autonomous scientific discovery over 20x faster than existing methods

    Emmett Bicker. Aster: Autonomous scientific discovery over 20x faster than existing methods. arXiv preprint arXiv:2602.07040, 2026

  13. [13]

    A quantum processor based on coherent transport of entangled atom arrays

    Dolev Bluvstein, Harry Levine, Giulia Semeghini, T out T Wang, Sepehr Ebadi, Marcin Kalinowski, Alexander Keesling, Nishad Maskara, Hannes Pichler, Markus Greiner, et al. A quantum processor based on coherent transport of entangled atom arrays. Nature, 604(7906):451–456, 2022

  14. [14]

    Logical quantum processor based on reconfigurable atom arrays

    Dolev Bluvstein, Simon J Evered, Alexandra A Geim, Sophie H Li, Hengyun Zhou, T om Manovitz, Sepehr Ebadi, Madelyn Cain, Marcin Kalinowski, Dominik Hangleiter, et al. Logical quantum processor based on reconfigurable atom arrays. Nature, 626(7997):58–65, 2024

  15. [15]

    An improved example for an autoconvolution inequality

    Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Ex- perimental Mathematics, 2026. doi: 10.1080/10586458.2025.2607423. Published online 2026-02-15

  16. [16]

    Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann

    Richard P . Brent, William Orrick, Judy anne Osborn, and Paul Zimmermann. Maximal determi- nants and saturated D-optimal designs of orders 19 and 37, 2011. URL https://arxiv.org/abs/ 1112.4160

  17. [17]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky , Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  18. [18]

    Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

    Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing , 16(5):1190–1208, 1995

  19. [19]

    K-search: Llm kernel generation via co-evolving intrinsic world model.arXiv preprint arXiv:2602.19128,

    Shiyi Cao, Ziming Mao, Joseph E Gonzalez, and Ion Stoica. K-search: Llm kernel generation via Page 62 of 110 Evaluation-driven Scaling for Scientific Discovery co-evolving intrinsic world model. arXiv preprint arXiv:2602.19128, 2026

  20. [20]

    Cawley and Nicola L

    Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selec- tion bias in performance evaluation. Journal of Machine Learning Research , 11(70):2079–2107, 2010. URL https://www.jmlr.org/papers/v11/cawley10a.html

  21. [21]

    2602.20133 , archivePrefix =

    Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. Adaevolve: Adaptive llm driven zeroth-order optimization. arXiv preprint arXiv:2602.20133, 2026. doi: 10.48550/ arXiv.2602.20133. URL https://arxiv.org/abs/2602.20133

  22. [22]

    On the role of feedback in test-time scaling of agentic ai workflows, 2025

    Souradip Chakraborty , Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, and T omas Pfister. On the role of feedback in test-time scaling of agentic ai workflows, 2025. URL https://arxiv.org/abs/ 2504.01931

  23. [23]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury , Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

  24. [24]

    Parallel scaling law for language models, 2025

    Mouxiang Chen, Binyuan Hui, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Jianling Sun, Junyang Lin, and Zhongxin Liu. Parallel scaling law for language models. arXiv preprint arXiv:2505.10475, 2025

  25. [25]

    T umix: Multi-agent test-time scaling with tool-use mixture, 2025

    Yongchao Chen, Jiefeng Chen, Rui Meng, Ji Yin, Na Li, Chuchu Fan, Chi Wang, T omas Pfister, and Jinsung Yoon. T umix: Multi-agent test-time scaling with tool-use mixture, 2025. URL https:// arxiv.org/abs/2510.01279

  26. [26]

    Cooley and John W

    James W. Cooley and John W. T ukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301, 1965

  27. [27]

    On the qubit routing problem

    Alexander Cowtan, Silas Dilkes, Ross Duncan, Alexandre Krajenbrink, Will Simmons, and Seyon Sivarajah. On the qubit routing problem. In 14th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2019) , volume 135 of Leibniz International Proceedings in In- formatics (LIPIcs) , pages 5:1–5:32. Schloss Dagstuhl – Leibniz-Zent...

  28. [28]

    Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation.arXiv preprint arXiv:2602.24286, 2026

    Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yu- fan Song, Hongli Yu, and et al. Chen, Jiaze. Cuda agent: Large-scale agentic rl for high-performance cuda kernel generation. arXiv preprint arXiv:2602.24286, 2026

  29. [29]

    Sandboxes, 2026

    Daytona. Sandboxes, 2026. URL https://www.daytona.io/docs/en/sandboxes/. Documentation

  30. [30]

    net/forum?id=nZeVKeeFYf9

    Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, and Siheng Chen. Swe-dev: Evaluating and training autonomous feature-driven software development. arXiv preprint arXiv:2505.16975, 2025

  31. [31]

    E2b documentation, 2026

    E2B. E2b documentation, 2026. URL https://e2b.dev/docs. Cloud sandboxing and code- interpreting documentation for AI agents

  32. [32]

    Alphazero-like tree-search can guide large lan- guage model decoding and training.arXiv preprint arXiv:2309.17179, 2023

    Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023

  33. [33]

    Richard P . Feynman. Simulating physics with computers. International Journal of Theoretical Physics , 21(6):467–488, 1982. ISSN 1572-9575. doi: 10.1007/BF02650179. URL https://doi.org/10.1007/ BF02650179

  34. [34]

    Regularization paths for generalized linear models via coordinate descent

    Jerome H Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33:1–22, 2010

  35. [35]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046, 1, 2025

  36. [36]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Ma- chine Learning Research , pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/ v202/gao23h.html

  37. [38]

    Georgiev, J

    Bogdan Georgiev , Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale. arXiv preprint arXiv:2511.02864, 2025. URL https://arxiv.org/ abs/2511.02864

  38. [39]

    Quantum error correction below the surface code threshold

    Google Quantum AI et al. Quantum error correction below the surface code threshold. Nature, 638 (8052):920–926, 2025

  39. [40]

    Gpu mode reference kernels, 2026

    GPU Mode. Gpu mode reference kernels, 2026. URL https://github.com/gpu-mode/ reference-kernels

  40. [41]

    Trimul competition, 2026

    GPU Mode. Trimul competition, 2026. URL https://www.gpumode.com/leaderboard/496

  41. [42]

    Gpu mode, 2026

    GPU Mode. Gpu mode, 2026. URL https://www.gpumode.com/

  42. [43]

    A fast quantum mechanical algorithm for database search

    Lov K Grover. A fast quantum mechanical algorithm for database search. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing , pages 212–219, 1996

  43. [44]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  44. [45]

    Katalin Gyarmati, François Hennecart, and Imre Z. Ruzsa. Sums and differences of finite sets. Func- tiones et Approximatio Commentarii Mathematici , 37(1):175–186, 2007

  45. [46]

    The minimum overlap problem revisited, 2016

    Jan Kristian Haugland. The minimum overlap problem revisited, 2016. URL https://arxiv.org/ abs/1609.08000

  46. [47]

    Peter V . Hegarty. Some explicit constructions of sets with more sums than differences. Acta Arith- metica, 130(1):61–77, 2007. doi: 10.4064/aa130-1-4

  47. [48]

    A literature review on circle and sphere packing problems: Models and methodologies

    Mhand Hifi and Rym M’Hallah. A literature review on circle and sphere packing problems: Models and methodologies. Advances in Operations Research, 2009:150624, 2009. doi: 10.1155/2009/150624

  48. [49]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. arXiv preprint arXiv:2508.05004, 2025

  49. [50]

    Olympiad-level formal mathematical reasoning with reinforcement learning

    Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z Horváth, Goran Žužić, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature, pages 1–3, 2025

  50. [51]

    Ibm quantum computing: Hardware and roadmap, 2026

    IBM Quantum. Ibm quantum computing: Hardware and roadmap, 2026. URL https://www.ibm. com/quantum/hardware

  51. [52]

    Autonomous LLM-driven research – from data to human-verifiable research papers

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay , and Roy Kishony. Autonomous LLM-driven research – from data to human-verifiable research papers. NEJM AI , 2(1), 2025. doi: 10.1056/ AIoa2400555. URL https://doi.org/10.1056/AIoa2400555

  52. [53]

    ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050,

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. ALE-Bench: A benchmark for long-horizon objective-driven algorithm engineering, 2025. URL https://arxiv.org/abs/2506.09050

  53. [54]

    Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412, 2025

    Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search, 2025. URL https://arxiv.org/abs/2503.04412

  54. [55]

    Algorithmic theory of qubit routing

    Takehiro Ito, Naonori Kakimura, Naoyuki Kamiyama, Yusuke Kobayashi, and Yoshio Okamoto. Algorithmic theory of qubit routing. In Algorithms and Data Structures Symposium , pages 533–546. Springer, 2023

  55. [56]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky , Aiden Low , Alec Helyar, Aleksander Madry , Alex Beutel, Alex Carney , et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  56. [57]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Tim- othée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://...

  57. [58]

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Associ- ation for Computational Linguistics (V olume 1: Long Papers), pages 14165–14178, 2023

  58. [59]

    SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Page 64 of 110 Evaluation-driven Scaling for Scientific Discovery Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The T welfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= VTF8yNQM66

  59. [60]

    autoresearch: A simple and efficient AI agent for autonomous ML research

    Andrej Karpathy. autoresearch: A simple and efficient AI agent for autonomous ML research. https://github.com/karpathy/autoresearch, 2026

  60. [61]

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. T owards a science of scaling agent systems, 2025. URL https://arxiv.org/abs/2512.08296

  61. [62]

    Superconducting qubits: Current state of play

    Morten Kjaergaard, Mollie E Schwartz, Jochen Braumüller, Philip Krantz, Joel I-J Wang, Simon Gustavsson, and William D Oliver. Superconducting qubits: Current state of play. Annual Review of Condensed Matter Physics , 11(1):369–395, 2020

  62. [63]

    Lamb, Jialin Yu, Philip H

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: T owards open-ended and sample-efficient program evolution. arXiv preprint arXiv:2509.19349, 2025. doi: 10.48550/arXiv.2509. 19349. URL https://arxiv.org/abs/2509.19349

  63. [64]

    MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776

    Patrick W. Langley , Herbert A. Simon, Gary Bradshaw , and Jan M. Zytkow. Scientific Discovery: Computational Explorations of the Creative Process . MIT Press, Cambridge, MA, 1987. URL https: //mitpress.mit.edu/9780262620529/scientific-discovery/

  64. [65]

    Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation

    Ang Li, Samuel Stein, Sriram Krishnamoorthy , and James Ang. Qasmbench: A low-level quantum benchmark suite for nisq evaluation and simulation. ACM T ransactions on Quantum Computing, 4 (2):1–26, 2023

  65. [66]

    Tackling the qubit mapping problem for nisq-era quantum de- vices

    Gushu Li, Yufei Ding, and Yuan Xie. Tackling the qubit mapping problem for nisq-era quantum de- vices. In Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pages 1001–1014, 2019

  66. [67]

    Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

    Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, et al. Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining. arXiv e-prints, pages arXiv–2503, 2025

  67. [68]

    Selecting large language model to fine-tune via rectified scaling law

    Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. In International Conference on Machine Learning , 2024

  68. [69]

    Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xi- angyu Wang, Jianzhu Ma, Yitao Liang, and James Y . Zou. Can language models discover scaling laws? In International Conference on Learning Representations , 2026. URL https://openreview.net/ forum?id=TPTtWC0pGk

  69. [70]

    Reuse-aware compilation for zoned quantum architectures based on neutral atoms

    Wan-Hsuan Lin, Daniel Bochen Tan, and Jason Cong. Reuse-aware compilation for zoned quantum architectures based on neutral atoms. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 127–142. IEEE, 2025

  70. [71]

    Zero-preserving imputation of single-cell rna-seq data

    George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data. Nature communications, 13(1): 192, 2022

  71. [72]

    and Du, Alexander and Keutzer, Kurt and Cheung, Alvin and Dimakis, Alexandros G

    Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, and Ion Stoica. Evox: Meta-evolution for automated discovery. arXiv preprint arXiv:2602.23413 , 2026. doi: 10.48550/arXi...

  72. [73]

    Budget-aware tool-use enables effective agent scaling

    Tengxiao Liu, Zifeng Wang, Jin Miao, I-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, T omas Pfister, and Chen- Yu Lee. Budget-aware tool-use enables effective agent scaling, 2025. URL https://arxiv.org/abs/ 2511.17006

  73. [74]

    Universal quantum simulators

    Seth Lloyd. Universal quantum simulators. Science, 273(5278):1073–1078, 1996

  74. [75]

    Lange, et al

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. T owards end-to-end automation of ai research. Nature, 651(8107):914–919, 2026. doi: 10.1038/s41586-026-10265-5. URL https://doi.org/10.1038/s41586-026-10265-5 . Page 65 of 110 Evaluation-driven Scaling for Scientific Discovery

  75. [76]

    Current best practices in single-cell rna-seq analysis: a tutorial

    Malte D Luecken and Fabian J Theis. Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology, 15(6):MSB188746, 2019

  76. [77]

    Defining and benchmarking open problems in single-cell analysis

    Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Niko- lay S Markov , Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov , et al. Defining and benchmarking open problems in single-cell analysis. Nature Biotechnology, 43(7):1035–1040, 2025

  77. [78]

    Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets

    Evan Z Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, Melissa Goldman, Itay Tirosh, Allison R Bialas, Nolan Kamitaki, Emily M Martersteck, et al. Highly parallel genome- wide expression profiling of individual cells using nanoliter droplets. Cell, 161(5):1202–1214, 2015

  78. [79]

    Self- refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing S...

  79. [80]

    Pólya urn models

    Hosam Mahmoud. Pólya urn models. Chapman and Hall/CRC, 2008

  80. [81]

    Many sets have more sums than differences, 2006

    Greg Martin and Kevin O’Bryant. Many sets have more sums than differences, 2006. URL https: //arxiv.org/abs/math/0608131

Showing first 80 references.