arxiv: 2601.16175 · v2 · submitted 2026-01-22 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Learning to Discover at Test Time

Mert Yuksekgonul , Daniel Koceja , Xinhao Li , Federico Bianchi , Jed McCaleb , Xiaolong Wang , Jan Kautz , Yejin Choi

show 3 more authors

James Zou Carlos Guestrin Yu Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords test-time trainingreinforcement learninglarge language modelsproblem discoveryoptimizationcontinuous rewardsalgorithm designGPU kernels

0 comments

The pith

Reinforcement learning at test time on one problem lets an open LLM produce new state-of-the-art solutions for math, coding, and biology tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method that continues training an LLM with reinforcement learning while solving a single test problem, using experience gathered only from that problem. The goal is to refine toward one high-quality solution rather than averaging performance across many problems or relying on a frozen model prompted for search. The authors apply the approach to continuous-reward tasks and report new best results on mathematical inequalities, GPU kernel optimization, algorithm design contests, and single-cell data denoising. These outcomes are obtained with an open model and modest compute cost, and the solutions receive expert review. If the method works as described, it offers a practical route to AI-assisted discovery on hard, narrowly defined problems without requiring larger closed models.

Core claim

TTT-Discover performs reinforcement learning at test time so the LLM continues to train with experience specific to the test problem. The learning objective and search subroutine are designed to prioritize the most promising solutions and thereby produce one great solution for that exact problem rather than many good ones on average. Applied across mathematics, GPU kernel engineering, algorithm design, and biology, the method sets new state-of-the-art results on Erdős' minimum overlap problem, an autocorrelation inequality, a GPUMode kernel competition with up to 2 times faster kernels, past AtCoder algorithm competitions, and a denoising problem in single-cell analysis, all achieved with an

What carries the argument

Test-Time Training to Discover (TTT-Discover), which applies reinforcement learning at test time on experience gathered from one specific problem to refine the model toward a single superior solution.

If this is right

New state-of-the-art solutions become reachable for continuous-reward problems in mathematics, engineering, algorithms, and biology using open models.
Test-time reinforcement learning can outperform prompting a frozen LLM for discovery-oriented search.
Expert-reviewed improvements are achievable in GPU kernel speed and algorithm contest performance.
Results remain reproducible with publicly available code at a cost of a few hundred dollars per problem.
The same training loop can be applied directly to new problems without retraining on a broad distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If test-time training scales reliably, teams could solve narrow but high-value scientific problems by investing modest compute on one instance rather than retraining large models.
The approach might transfer to non-language models if the reinforcement signal can be defined for other continuous optimization domains.
Repeated application across related problems could accumulate specialized knowledge inside a single model instance without full retraining.
Human experts could supply the reward function or final validation step to steer the search toward practically useful rather than merely high-scoring solutions.

Load-bearing premise

That reinforcement learning performed at test time on experience specific to one problem will reliably produce a single superior solution rather than overfitting or failing to improve over frozen-model search.

What would settle it

Reproducing the method on one of the reported problems, such as the GPUMode kernel task, and failing to match or exceed the claimed performance gains with the same open model.

read the original abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erd\H{o}s' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Test-time RL on single problems beats prior frozen-model search in several domains, but the writeup still needs ablations to separate training from extra search compute.

read the letter

The core move is straightforward: instead of prompting a frozen LLM to search for solutions, they run RL at test time on experience from one specific problem so the model can update toward a single strong answer. They apply this to Erdős overlap problems, a GPU kernel contest, old AtCoder tasks, and single-cell denoising, and report new best results that experts or organizers have checked. All of it uses an open 120B model and public code, which is a practical step forward from closed-model baselines. The cost is a few hundred dollars per run, which is manageable for high-value problems. That combination of test-time adaptation plus a discovery-oriented objective is the part that feels new relative to AlphaEvolve-style prompting. The results are presented across four different areas, which gives some breadth. The main weakness is exactly the one in the stress-test note. The abstract and summary do not show ablations that hold the search budget fixed while turning the RL updates on or off, so it is still possible the gains come mostly from longer inference rather than the weight changes. Training curves, variance across runs, and direct comparisons to frozen-model search with matched compute would settle this. The math claims also need explicit verification steps in the paper so readers can see why the new solutions are accepted as improvements. This is the kind of work that belongs in a reading group for people doing test-time methods or automated discovery. It is worth sending to referees because the idea is simple enough to test and the reported outcomes, if they survive controls, would be useful to cite. The current draft is not yet tight enough for acceptance, but the gap is fixable with more targeted experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Test-Time Training to Discover (TTT-Discover), which performs reinforcement learning at test time on experience specific to a single test problem. The goal is to produce one superior solution rather than average performance across problems. The authors apply this to continuous-reward tasks and report new state-of-the-art results on Erdős' minimum overlap problem and an autocorrelation inequality, a GPUMode kernel competition (up to 2× faster), past AtCoder algorithm competitions, and a single-cell denoising task, all using the open gpt-oss-120b model with publicly released code.

Significance. If the central claims are substantiated, the work would be significant for demonstrating that problem-specific test-time RL can yield discovery-level improvements across mathematics, systems engineering, algorithms, and biology while using only an open model and modest compute budgets. The emphasis on reproducibility via public code and the contrast with prior closed-model results are concrete strengths that would lower barriers to AI-assisted discovery if the performance gains are shown to arise from the adaptation mechanism rather than extended search alone.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.
[§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.

minor comments (2)

[Abstract] Abstract: The statement that results were 'reviewed by experts or the organizers' would be strengthened by naming the reviewers or providing links to the review process for each domain.
[Throughout] Throughout: The paper would benefit from explicit reporting of wall-clock time, number of RL steps, and variance across random seeds for each claimed improvement to allow direct comparison with prior frozen-LLM baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the work's potential significance. We address each major comment below and commit to revisions that will strengthen the manuscript's claims regarding the contributions of test-time training.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): No ablation is reported that isolates the contribution of online weight updates from simply running longer search with a frozen model. For the Erdős overlap and GPUMode tasks, any reported improvement could be explained by increased inference-time compute rather than the continual-learning component; without this separation the attribution of SOTA results to TTT-Discover is not secured.

Authors: We agree that isolating the effect of online weight updates from extended inference-time search with a frozen model is crucial for attributing the performance gains to the TTT-Discover mechanism. In the revised version, we will add ablations for the Erdős minimum overlap and GPUMode tasks. These will compare TTT-Discover against a frozen-model baseline that uses the same total compute budget but without weight updates, employing standard prompting or search methods. This will demonstrate whether the continual learning component provides benefits beyond increased search effort. revision: yes
Referee: [§3] §3 (Method): The claim that the learning objective and search subroutine have been redesigned to prioritize promising solutions is load-bearing for the single-solution focus, yet the manuscript supplies no equations, pseudocode, or quantitative comparison showing how these changes differ from standard RL or prevent overfitting on a single instance.

Authors: We recognize the need for greater formality in describing the modifications to the learning objective and search subroutine. In the revision, we will include explicit equations for the redesigned objective function that emphasizes promising solutions, along with pseudocode for the adapted search procedure. Additionally, we will provide a quantitative comparison to standard RL methods, including metrics on solution prioritization and overfitting prevention, such as reward concentration on top solutions and generalization within the single-instance setting. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies standard RL at test time with no self-referential derivations

full rationale

The paper presents TTT-Discover as an application of reinforcement learning performed at test time on problem-specific experience, with objectives and search subroutines redesigned to prioritize promising solutions for a single instance rather than average performance. No equations, derivations, or parameter-fitting steps are described that would reduce any claimed result to its own inputs by construction. The approach is framed as a direct extension of existing RL techniques to the test-time setting, with empirical SOTA results reported across tasks; these outcomes are presented as experimental findings rather than outputs of a closed mathematical chain. No self-citation load-bearing premises, uniqueness theorems imported from prior author work, or ansatz smuggling appear in the description. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5629 in / 1120 out tokens · 33516 ms · 2026-05-16T05:11:47.328644+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence existence_economically_inevitable echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
New Bounds for Zarankiewicz Numbers via Reinforced LLM Evolutionary Search
cs.AI 2026-05 accept novelty 7.0

LLM-reinforced evolutionary search produces exact values Z(11,21,3,3)=116, Z(11,22,3,3)=121, Z(12,22,3,3)=132 and lower bounds for 41 additional Zarankiewicz numbers.
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
Epistemic Uncertainty for Test-Time Discovery
cs.LG 2026-05 unverdicted novelty 6.0

UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

A jointly learned hierarchical index with cross-attention and residual quantization scales exact retrieval in foundational recommendation models, deployed at Meta with additional performance from test-time training on...
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
cs.NE 2026-04 unverdicted novelty 6.0

TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
cs.CL 2026-03 unverdicted novelty 6.0

Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
Grokability in five inequalities
math.PR 2026-05 unverdicted novelty 5.0

Five improved inequalities were found with AI help: better Gaussian perimeter bounds for convex sets, sharper L2-L1 moments on the Hamming cube, a strengthened autoconvolution inequality, improved g-Sidon set bounds, ...

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 18 Pith papers · 15 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, and Jacob Andreas. The surprising effectiveness of test-time training for few-shot learning.arXiv preprint arXiv:2411.07279, 2024

work page arXiv 2024
[3]

AtCoder.https://atcoder.jp, 2025

AtCoder Inc. AtCoder.https://atcoder.jp, 2025

work page 2025
[4]

Test-time Offline Reinforcement Learning on Goal-related Experience

Marco Bagatella, Mert Albaba, Jonas Hübotter, Georg Martius, and Andreas Krause. Test-time offline reinforcement learning on goal-related experience.arXiv preprint arXiv:2507.18809, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Three convolution inequalities on the real line with connections to additive combinatorics.Journal of Number Theory, 207:42–55, 2020

Richard C Barnard and Stefan Steinerberger. Three convolution inequalities on the real line with connections to additive combinatorics.Journal of Number Theory, 207:42–55, 2020

work page 2020
[6]

Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019

Joshua Batson, Loic Royer, and James Webber. Molecular cross-validation for single-cell rna-seq.BioRxiv, page 786269, 2019

work page 2019
[7]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Local learning algorithms.Neural computation, 4(6):888–900, 1992

Léon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 4(6):888–900, 1992

work page 1992
[9]

An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025

Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality.arXiv preprint arXiv:2506.16750, 2025

work page arXiv 2025
[10]

Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

William S Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

work page 1979
[11]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

work page 2021
[12]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 2022

work page 2022
[14]

Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical exploration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025

work page arXiv 2025
[15]

Dynamic few-shot visual learning without forgetting

Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367–4375, 2018. 20

work page 2018
[16]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638, 2025

work page 2025
[17]

Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks.Trends in cognitive sciences, 24(12):1028–1040, 2020

work page 2020
[18]

Test-time training on nearest neighbors for large language models.arXiv preprint arXiv:2305.18466, 2023

Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models.arXiv preprint arXiv:2305.18466, 2023

work page arXiv 2023
[19]

Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience- inspired artificial intelligence.Neuron, 95(2):245–258, 2017

work page 2017
[20]

The minimum overlap problem revisited

Jan Kristian Haugland. The minimum overlap problem revisited.arXiv preprint arXiv:1609.08000, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. Masked autoen- coders are scalable vision learners.CoRR, abs/2111.06377, 2021

work page internal anchor Pith review arXiv 2021
[22]

The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

work page 2021
[23]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[24]

Hubert, R

T. Hubert, R. Mehta, L. Sartran, and et al. Olympiad-level formal mathematical reasoning with reinforce- ment learning.Nature, 2025

work page 2025
[25]

Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024

Jonas Hübotter, Sascha Bongni, Ido Hakimi, and Andreas Krause. Efficiently learning at test-time: Active fine-tuning of llms.arXiv preprint arXiv:2410.08020, 2024

work page arXiv 2024
[26]

Learning on the job: Test-time curricula for targeted reinforcement learning.arXiv preprint arXiv:2510.04786, 2025

Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, and Moritz Hardt. Learning on the job: Test-time curricula for targeted reinforcement learning.arXiv preprint arXiv:2510.04786, 2025

work page arXiv 2025
[27]

Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, and Takuya Akiba. Ale-bench: A benchmark for long-horizon objective-driven algorithm engineering.arXiv preprint arXiv:2506.09050, 2025

work page arXiv 2025
[28]

Online domain adaptation of a pre-trained cascade of classifiers

Vidit Jain and Erik Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers. In CVPR 2011, pages 577–584. IEEE, 2011

work page 2011
[29]

Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models.arXiv preprint arXiv:2509.24261, 2025

work page arXiv 2025
[30]

Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold.nature, 596(7873):583–589, 2021

work page 2021
[31]

Continual pre-training of language models.arXiv preprint arXiv:2302.03241, 2023

Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi, Gyuhak Kim, and Bing Liu. Continual pre-training of language models.arXiv preprint arXiv:2302.03241, 2023

work page arXiv 2023
[32]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[34]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. InInternational conference on machine learning, pages 5637–5664. PMLR, 2021. 21

work page 2021
[35]

Dynamic evaluation of neural sequence models

Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. InInternational Conference on Machine Learning, pages 2766–2775. PMLR, 2018

work page 2018
[36]

Tinker, 2025

Thinking Machines Lab. Tinker, 2025

work page 2025
[37]

Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample- efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

work page arXiv 2025
[38]

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

Hengli Li, Chenxi Li, Tong Wu, Xuekai Zhu, Yuxuan Wang, Zhaoxin Yu, Eric Hanchen Jiang, Song-Chun Zhu, Zixia Jia, Ying Nian Wu, et al. Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space.arXiv preprint arXiv:2505.13308, 2025

work page arXiv 2025
[39]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

work page 2017
[40]

Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022

George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. Zero-preserving imputation of single-cell rna-seq data.Nature communications, 13(1):192, 2022

work page 2022
[41]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Llm4ad: A platform for algorithm design with large language model.arXiv preprint arXiv:2412.17287, 2024

Fei Liu, Rui Zhang, Zhuoliang Xie, Rui Sun, Kai Li, Xi Lin, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Llm4ad: A platform for algorithm design with large language model.arXiv preprint arXiv:2412.17287, 2024

work page arXiv 2024
[43]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017

work page 2017
[44]

Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, pages 1–6, 2025

Malte D Luecken, Scott Gigante, Daniel B Burkhardt, Robrecht Cannoodt, Daniel C Strobl, Nikolay S Markov, Luke Zappia, Giovanni Palla, Wesley Lewis, Daniel Dimitrov, et al. Defining and benchmarking open problems in single-cell analysis.Nature Biotechnology, pages 1–6, 2025

work page 2025
[45]

Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

work page 2020
[46]

Improved bounds on the supremum of autoconvolutions.Journal of Mathematical Analysis and Applications, 372(2):439–447, 2010

Máté Matolcsi and Carlos Vinuesa. Improved bounds on the supremum of autoconvolutions.Journal of Mathematical Analysis and Applications, 372(2):439–447, 2010

work page 2010
[47]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[48]

The effect of natural distribution shift on question answering models

John Miller, Karl Krauth, Benjamin Recht, and Ludwig Schmidt. The effect of natural distribution shift on question answering models. InInternational conference on machine learning, pages 6905–6916. PMLR, 2020

work page 2020
[49]

Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018

Ravi Teja Mullapudi, Steven Chen, Keyi Zhang, Deva Ramanan, and Kayvon Fatahalian. Online model distillation for efficient video inference.arXiv preprint arXiv:1812.02699, 2018

work page arXiv 2018
[50]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Peters, K

J. Peters, K. Muelling, and Y. Altun. Relative entropy policy search. InProceedings of 24th AAAI Conference on Artificial Intelligence (AAAI ’10), pages 1607–1612, July 2010

work page 2010
[52]

Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

Peter Phan, Dhruv Agarwal, Kavitha Srinivas, Horst Samulowitz, Pavan Kapanipathi, and Andrew McCallum. Migrate: Mixed-policy grpo for adaptation at test-time.arXiv preprint arXiv:2508.08641, 2025

work page arXiv 2025
[53]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[54]

Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprint arXiv:2005.04118, 2020. 22

work page arXiv 2005
[55]

Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

Christopher D Rosin. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011

work page 2011
[56]

Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039)

Sakana. Submission #59660035 — third programming contest 2024 (atcoder heuristic contest 039). https://atcoder.jp/contests/ahc039/submissions/59660035, November 2024. AtCoder Heuristic Contest 039 submission page

work page arXiv 2024
[57]

Sakana ai agent wins atcoder heuristic contest (first ai to place 1st)

Sakana AI. Sakana ai agent wins atcoder heuristic contest (first ai to place 1st). https://sakana.ai/ ahc058/, 2026

work page 2026
[58]

Meta- learning with memory-augmented neural networks

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta- learning with memory-augmented neural networks. InInternational conference on machine learning, pages 1842–1850, 2016

work page 2016
[59]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393,

Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models are continual learners.arXiv preprint arXiv:2205.12393, 2022

work page arXiv 2022
[61]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

work page 2025
[62]

zero-shot

Assaf Shocher, Nadav Cohen, and Michal Irani. “zero-shot” super-resolution using deep internal learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3118–3126, 2018

work page 2018
[63]

David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Do- minik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Masteri...

work page 2016
[64]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

work page 2018
[65]

Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge.Nature, 550(7676):354–359, 2017

work page 2017
[66]

Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977

Charles J Stone. Consistent nonparametric regression.The annals of statistics, pages 595–620, 1977

work page 1977
[67]

Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023

Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time).arXiv preprint arXiv:2310.13807, 2023

work page arXiv 2023
[68]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational Conference on Machine Learning, pages 9229–9248. PMLR, 2020

work page 2020
[69]

Algorithm discovery with llms: Evolutionary search meets reinforcement learning

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning. arXiv preprint arXiv:2504.05108, 2025

work page arXiv 2025
[70]

The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

Richard Sutton. The bitter lesson.Incomplete Ideas (blog), 13(1):38, 2019

work page 2019
[71]

End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, et al. End-to-end test-time training for long context.arXiv preprint arXiv:2512.23675, 2025

work page arXiv 2025
[72]

On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025

Yunhao Tang and Rémi Munos. On a few pitfalls in kl divergence gradient estimation for rl.arXiv preprint arXiv:2506.09477, 2025. 23

work page arXiv 2025
[73]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019

work page 2019
[74]

Three scenarios for continual learning

Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning.arXiv preprint arXiv:1904.07734, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[75]

Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018

David Van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J Carr, Cassandra Burdziak, Kevin R Moon, Christine L Chaffer, Diwakar Pattabiraman, et al. Recovering gene interactions from single-cell data using data diffusion.Cell, 174(3):716–729, 2018

work page 2018
[76]

Aanet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity.Cancer Discovery, 2025

Aarthi Venkat, Scott E Youlten, Beatriz P San Juan, Carley A Purcell, Shabarni Gupta, Matthew Amodio, Daniel P Neumann, John G Lock, Anton E Westacott, Cerys S McCool, et al. Aanet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity.Cancer Discovery, 2025

work page 2025
[77]

A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362– 5383, 2024

Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application.IEEE transactions on pattern analysis and machine intelligence, 46(8):5362– 5383, 2024

work page 2024
[78]

Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, et al. Thetaevolve: Test-time learning on open problems.arXiv preprint arXiv:2511.23473, 2025

work page arXiv 2025
[79]

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, et al. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023

Ethan Patrick White. A new bound for Erdős’ minimum overlap problem.Acta Arithmetica, 208:235–255, 2023

work page 2023

Showing first 80 references.