pith. machine review for the scientific record. sign in

arxiv: 2601.16175 · v2 · submitted 2026-01-22 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Learning to Discover at Test Time

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords test-time trainingreinforcement learninglarge language modelsproblem discoveryoptimizationcontinuous rewardsalgorithm designGPU kernels
0
0 comments X

The pith

Reinforcement learning at test time on one problem lets an open LLM produce new state-of-the-art solutions for math, coding, and biology tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that continues training an LLM with reinforcement learning while solving a single test problem, using experience gathered only from that problem. The goal is to refine toward one high-quality solution rather than averaging performance across many problems or relying on a frozen model prompted for search. The authors apply the approach to continuous-reward tasks and report new best results on mathematical inequalities, GPU kernel optimization, algorithm design contests, and single-cell data denoising. These outcomes are obtained with an open model and modest compute cost, and the solutions receive expert review. If the method works as described, it offers a practical route to AI-assisted discovery on hard, narrowly defined problems without requiring larger closed models.

Core claim

TTT-Discover performs reinforcement learning at test time so the LLM continues to train with experience specific to the test problem. The learning objective and search subroutine are designed to prioritize the most promising solutions and thereby produce one great solution for that exact problem rather than many good ones on average. Applied across mathematics, GPU kernel engineering, algorithm design, and biology, the method sets new state-of-the-art results on Erdős' minimum overlap problem, an autocorrelation inequality, a GPUMode kernel competition with up to 2 times faster kernels, past AtCoder algorithm competitions, and a denoising problem in single-cell analysis, all achieved with an

What carries the argument

Test-Time Training to Discover (TTT-Discover), which applies reinforcement learning at test time on experience gathered from one specific problem to refine the model toward a single superior solution.

If this is right

  • New state-of-the-art solutions become reachable for continuous-reward problems in mathematics, engineering, algorithms, and biology using open models.
  • Test-time reinforcement learning can outperform prompting a frozen LLM for discovery-oriented search.
  • Expert-reviewed improvements are achievable in GPU kernel speed and algorithm contest performance.
  • Results remain reproducible with publicly available code at a cost of a few hundred dollars per problem.
  • The same training loop can be applied directly to new problems without retraining on a broad distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If test-time training scales reliably, teams could solve narrow but high-value scientific problems by investing modest compute on one instance rather than retraining large models.
  • The approach might transfer to non-language models if the reinforcement signal can be defined for other continuous optimization domains.
  • Repeated application across related problems could accumulate specialized knowledge inside a single model instance without full retraining.
  • Human experts could supply the reward function or final validation step to steer the search toward practically useful rather than merely high-scoring solutions.

Load-bearing premise

That reinforcement learning performed at test time on experience specific to one problem will reliably produce a single superior solution rather than overfitting or failing to improve over frozen-model search.

What would settle it

Reproducing the method on one of the reported problems, such as the GPUMode kernel task, and failing to match or exceed the claimed performance gains with the same open model.

read the original abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erd\H{o}s' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Test-Time Training to Discover (TTT-Discover), which performs reinforcement learning at test time on experience specific to a single test problem. The goal is to produce one superior solution rather than average performance across problems. The authors apply this to continuous-reward tasks and report new state-of-the-art results on Erdős' minimum overlap problem and an autocorrelation inequality, a GPUMode kernel competition (up to 2× faster), past AtCoder algorithm competitions, and a single-cell denoising task, all using the open gpt-oss-120b model with publicly released code.

Significance. If the central claims are substantiated, the work would be significant for demonstrating that problem-specific test-time RL can yield discovery-level improvements across mathematics, systems engineering, algorithms, and biology while using only an open model and modest compute budgets. The emphasis on reproducibility via public code and the contrast with prior closed-model results are concrete strengths that would lower barriers to AI-assisted discovery if the performance gains are shown to arise from the adaptation mechanism rather than extended search alone.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.
  2. [§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.
minor comments (2)
  1. [Abstract] Abstract: The statement that results were 'reviewed by experts or the organizers' would be strengthened by naming the reviewers or providing links to the review process for each domain.
  2. [Throughout] Throughout: The paper would benefit from explicit reporting of wall-clock time, number of RL steps, and variance across random seeds for each claimed improvement to allow direct comparison with prior frozen-LLM baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's potential significance. We address each major comment below and commit to revisions that will strengthen the manuscript's claims regarding the contributions of test-time training.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.

    Authors: We agree that isolating the effect of online weight updates from extended inference-time search with a frozen model is crucial for attributing the performance gains to the TTT-Discover mechanism. In the revised version, we will add ablations for the Erdős minimum overlap and GPUMode tasks. These will compare TTT-Discover against a frozen-model baseline that uses the same total compute budget but without weight updates, employing standard prompting or search methods. This will demonstrate whether the continual learning component provides benefits beyond increased search effort. revision: yes

  2. Referee: [§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.

    Authors: We recognize the need for greater formality in describing the modifications to the learning objective and search subroutine. In the revision, we will include explicit equations for the redesigned objective function that emphasizes promising solutions, along with pseudocode for the adapted search procedure. Additionally, we will provide a quantitative comparison to standard RL methods, including metrics on solution prioritization and overfitting prevention, such as reward concentration on top solutions and generalization within the single-instance setting. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies standard RL at test time with no self-referential derivations

full rationale

The paper presents TTT-Discover as an application of reinforcement learning performed at test time on problem-specific experience, with objectives and search subroutines redesigned to prioritize promising solutions for a single instance rather than average performance. No equations, derivations, or parameter-fitting steps are described that would reduce any claimed result to its own inputs by construction. The approach is framed as a direct extension of existing RL techniques to the test-time setting, with empirical SOTA results reported across tasks; these outcomes are presented as experimental findings rather than outputs of a closed mathematical chain. No self-citation load-bearing premises, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the description. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5629 in / 1120 out tokens · 33516 ms · 2026-05-16T05:11:47.328644+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LawOfExistence existence_economically_inevitable echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  2. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  3. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  4. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  5. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  6. New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search

    cs.AI 2026-05 accept novelty 7.0

    LLM-reinforced evolutionary search produces exact values Z(11,21,3,3)=116, Z(11,22,3,3)=121, Z(12,22,3,3)=132 and lower bounds for 41 additional Zarankiewicz numbers.

  7. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  8. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  9. Epistemic Uncertainty for Test-Time Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.

  10. What should post-training optimize? A test-time scaling law perspective

    cs.LG 2026-05 unverdicted novelty 6.0

    Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

  11. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  12. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  13. Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    A jointly learned hierarchical index with cross-attention and residual quantization scales exact retrieval in foundational recommendation models, deployed at Meta with additional performance from test-time training on...

  14. Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...

  15. TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution

    cs.NE 2026-04 unverdicted novelty 6.0

    TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...

  16. GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.

  17. Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

    cs.CL 2026-03 unverdicted novelty 6.0

    Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...

  18. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  19. Grokability in five inequalities

    math.PR 2026-05 unverdicted novelty 5.0

    Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 18 Pith papers · 15 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

    Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

  3. [3]

    AtCoder.https://atcoder.jp, 2025

    AtCoder Inc. AtCoder.https://atcoder.jp, 2025

  4. [4]

    Test-time Offline Reinforcement Learning on Goal-related Experience

    Marco Bagatella, Mert Albaba, Jonas Hübotter, Georg Martius, and Andreas Krause. Test-time offline reinforcement learning on goal-related experience.arXiv preprint arXiv:2507.18809, 2025

  5. [5]

    Three convolution inequalities on the real line with connections to additive combinatorics.Journal of Number Theory, 207:42–55, 2020

    Richard C Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics.Journal of Number Theory, 207:42–55, 2020

  6. [6]

    Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019

    Joshua Batson, Loic Royer, and James Webber. Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019

  7. [7]

    Neural Combinatorial Optimization with Reinforcement Learning

    Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

  8. [8]

    Local learning algorithms.Neural computation, 4(6):888–900, 1992

    Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

  9. [9]

    An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025

    Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025

  10. [10]

    Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

    William S Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

  11. [11]

    A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

  13. [13]

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022

  14. [14]

    Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

  15. [15]

    Dynamic few-shot visual learning without forgetting

    Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 20

  16. [16]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

  17. [17]

    Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

    Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

  18. [18]

    Test-time training on nearest neighbors for large language models.arXiv preprint arXiv:2305.18466, 2023

    Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models.arXiv preprint arXiv:2305.18466, 2023

  19. [19]

    Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

    Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

  20. [20]

    The minimum overlap problem revisited

    Jan Kristian Haugland. The minimum overlap problem revisited.arXiv preprint arXiv:1609.08000, 2016

  21. [21]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. Masked autoen- coders are scalable vision learners.CoRR, abs/2111.06377, 2021

  22. [22]

    The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

  23. [23]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  24. [24]

    Hubert, R

    T. Hubert, R. Mehta, L. Sartran, and et al. Olympiad-level formal mathematical reasoning with reinforce- ment learning.Nature, 2025

  25. [25]

    Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024

    Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024

  26. [26]

    Learning on the job: Test-time curricula for targeted reinforcement learning.arXiv preprint arXiv:2510.04786, 2025

    Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, and Moritz Hardt. Learning on the job: Test-time curricula for targeted reinforcement learning.arXiv preprint arXiv:2510.04786, 2025

  27. [27]

    Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

    Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

  28. [28]

    Online domain adaptation of a pre-trained cascade of classifiers

    Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011

  29. [29]

    Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

    Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

  30. [30]

    Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

  31. [31]

    Continual pre-training of language models.arXiv preprint arXiv:2302.03241, 2023

    Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv preprint arXiv:2302.03241, 2023

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  33. [33]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  34. [34]

    Wilds: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. InInternational conference on machine learning, pages 5637–5664. PMLR, 2021. 21

  35. [35]

    Dynamic evaluation of neural sequence models

    Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. InInternational Conference on Machine Learning, pages 2766–2775. PMLR, 2018

  36. [36]

    Tinker, 2025

    Thinking Machines Lab. Tinker, 2025

  37. [37]

    Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

    Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

  38. [38]

    Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

    Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

  39. [39]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

  40. [40]

    Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022

    George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022

  41. [41]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  42. [42]

    Llm4ad: A platform for algorithm design with large language model.arXiv preprint arXiv:2412.17287, 2024

    Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Llm4ad: A platform for algorithm design with large language model.arXiv preprint arXiv:2412.17287, 2024

  43. [43]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017

  44. [44]

    Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, pages 1–6, 2025

    Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, et al. Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, pages 1–6, 2025

  45. [45]

    Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

    Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

  46. [46]

    Improved bounds on the supremum of autoconvolutions.Journal of Mathematical Analysis and Applications, 372(2):439–447, 2010

    Máté Matolcsi and Carlos Vinuesa. Improved bounds on the supremum of autoconvolutions.Journal of Mathematical Analysis and Applications, 372(2):439–447, 2010

  47. [47]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

  48. [48]

    The effect of natural distribution shift on question answering models

    John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020

  49. [49]

    Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018

    Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018

  50. [50]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  51. [51]

    Peters, K

    J. Peters, K. Muelling, and Y. Altun. Relative entropy policy search. InProceedings of 24th AAAI Conference on Artificial Intelligence (AAAI ’10), pages 1607–1612, July 2010

  52. [52]

    Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

    Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

  53. [53]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  54. [54]

    Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020. 22

  55. [55]

    Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

    Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

  56. [56]

    Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039)

    Sakana. Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039). https://atcoder.jp/contests/ahc039/submissions/59660035, November 2024. AtCoder Heuristic Contest 039 submission page

  57. [57]

    Sakana ai agent wins atcoder heuristic contest (first ai to place 1st)

    Sakana AI. Sakana ai agent wins atcoder heuristic contest (first ai to place 1st). https://sakana.ai/ ahc058/, 2026

  58. [58]

    Meta- learning with memory-augmented neural networks

    Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta- learning with memory-augmented neural networks. InInternational conference on machine learning, pages 1842–1850, 2016

  59. [59]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  60. [60]

    Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

    Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393, 2022

  61. [61]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

  62. [62]

    zero-shot

    Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3118–3126, 2018

  63. [63]

    David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Masteri...

  64. [64]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

  65. [65]

    Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

  66. [66]

    Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977

    Charles J Stone. Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977

  67. [67]

    Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023

    Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023

  68. [68]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, pages 9229–9248. PMLR, 2020

  69. [69]

    Algorithm discovery with llms: Evolutionary search meets reinforcement learning

    Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning. arXiv preprint arXiv:2504.05108, 2025

  70. [70]

    The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

    Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

  71. [71]

    End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

    Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

  72. [72]

    On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025

    Yunhao Tang and Rémi Munos. On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025. 23

  73. [73]

    Triton: an intermediate language and compiler for tiled neural network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

  74. [74]

    Three scenarios for continual learning

    Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

  75. [75]

    Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018

    David Van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman, et al. Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018

  76. [76]

    Aanet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity.Cancer Discovery, 2025

    Aarthi Venkat, Scott E Youlten, Beatriz P San Juan, Carley A Purcell, Shabarni Gupta, Matthew Amodio, Daniel P Neumann, John G Lock, Anton E Westacott, Cerys S McCool, et al. Aanet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity.Cancer Discovery, 2025

  77. [77]

    A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362– 5383, 2024

    Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362– 5383, 2024

  78. [78]

    Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

    Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

  79. [79]

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025

  80. [80]

    A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023

    Ethan Patrick White. A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023

Showing first 80 references.