The Power of Test-Time Training for Approximate Sampling

Ankur Moitra; Dhruv Rohatgi; Noah Golowich

arxiv: 2606.11437 · v1 · pith:JAARSWONnew · submitted 2026-06-09 · 💻 cs.DS · cs.AI· cs.LG· stat.ML

The Power of Test-Time Training for Approximate Sampling

Noah Golowich , Ankur Moitra , Dhruv Rohatgi This is my paper

Pith reviewed 2026-06-27 11:05 UTC · model grok-4.3

classification 💻 cs.DS cs.AIcs.LGstat.ML

keywords test-time trainingapproximate samplingquery complexitydensity oraclerandom walkcounting-to-samplinggenerative models

0 comments

The pith

Sampling from large classes of distributions requires quadratically many queries to an approximate density oracle, proving random walks optimal, while bounded classes allow fewer queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes test-time training as sampling from an unknown distribution μ* drawn from a known class F, using only an oracle that returns approximate density values at queried points. It first establishes that any algorithm must make quadratically many oracle queries when F is large, matching the cost of the classic random-walk method and settling an open question on its optimality. The paper then shows that restricting the size of F removes this quadratic barrier and permits more efficient sampling procedures. This supplies a clean theoretical model for why adapting a model to the specific instance at test time can improve sampling performance.

Core claim

We show a quadratic lower bound on the query complexity of sampling from μ∗ given query access to ˆμ (for sufficiently large classes F), thus showing that the random walk approach proposed by Jerrum & Sinclair (1989) and refined by Hayes & Sinclair (2010), is optimal. This answers an open question posed by Hayes & Sinclair. We then show that this lower bound can be circumvented if the size of F is bounded appropriately.

What carries the argument

Query complexity of producing a sample from μ* using an approximate-density oracle ˆμ, parameterized by the cardinality of the known class F of possible target distributions.

If this is right

For every class F above the size threshold, no sampling procedure can improve on quadratic query cost.
The Jerrum-Sinclair random walk achieves the optimal query complexity for all sufficiently large F.
When |F| is below the threshold, there exist sampling algorithms whose query cost is sub-quadratic.
The bounded-F regime supplies a formal abstraction of test-time training that can be analyzed with existing counting-to-sampling techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In practice, the effective class size for an LLM may be small enough under a fixed prompt distribution that test-time adaptation yields measurable query savings.
The results motivate studying how to certify or enforce small effective |F| in concrete generative-model settings.
Similar lower-bound and circumvention arguments may apply to other adaptive-inference tasks such as test-time fine-tuning for reasoning.

Load-bearing premise

The oracle supplies sufficiently accurate approximate density estimates, and the quadratic lower bound applies precisely when the class F exceeds a certain size threshold.

What would settle it

An explicit algorithm that produces a sample from μ* using o(n²) queries to ˆμ when F is larger than the size threshold would falsify the lower bound.

read the original abstract

Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems. The efficacy of such sampling algorithms is limited, however, by the relationship between the LLM and the particular sampling task at hand, which has motivated the framework of test-time training (TTT). TTT works by updating a model's weights in response to partial generations and reward feedback received at inference time, thus adapting to the particular problem. In this work, we propose a formalization for TTT as the problem of producing a sample from a given probability measure $\mu^\star$ belonging to a known class ${F}$ of distributions, given an oracle $\hat \mu$ which yields approximate density estimates for $\mu^\star$. This is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant & Vazirani (1986) and Jerrum & Sinclair (1989): namely, when ${F}$ is the class of all distributions, it coincides exactly with the aforementioned counting-to-sampling reduction. In this paper, we first show a quadratic lower bound on the query complexity of sampling from $\mu^\star$ given query access to $\hat \mu$ (for sufficiently large classes ${F}$), thus showing that the random walk approach proposed by Jerrum & Sinclair (1989) and refined by Hayes & Sinclair (2010), is optimal. This answers an open question posed by Hayes & Sinclair. We then show that this lower bound can be circumvented if the size of ${F}$ is bounded appropriately. As we discuss, this latter result can be viewed as an abstraction of TTT, and thus represents a starting point for the development of a principled theoretical framework for TTT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Settles the Hayes-Sinclair open question with a quadratic lower bound on sampling queries and gives a clean abstraction of TTT for small classes.

read the letter

This paper formalizes test-time training as the task of sampling from an unknown distribution μ* in a known class F, given an oracle that returns approximate density values for μ*. It proves a quadratic query lower bound that holds for large enough F and shows this matches the random-walk upper bound from the Jerrum-Sinclair line, directly answering the open question posed by Hayes and Sinclair. It then gives a positive result that beats the quadratic bound when |F| is bounded.

The lower bound and the bounded-class circumvention are the genuinely new pieces. The formalization itself is a useful abstraction because it recovers the classic counting-to-sampling setting exactly when F contains all distributions. The paper keeps the model simple and states the scope clearly: the lower bound applies only for sufficiently large classes, and the improvement is only claimed for bounded ones.

The oracle model lines up with prior work, and the stress-test found no internal contradictions or unsecured assumptions. The main limitation right now is that we only have the abstract, so the precise constants, error handling in the oracle, and the details of the bounded-class construction cannot be checked yet. That is a standard early-stage issue rather than a flaw in the stated claims.

The work is aimed at researchers who already know the approximate counting and sampling literature and want to see how TTT ideas fit inside it. A reader who has followed the Jerrum-Valiant-Vazirani and Hayes-Sinclair results will get immediate value. It deserves a serious referee because it resolves a concrete open question with matching bounds in different regimes and supplies a starting formal framework. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper formalizes test-time training (TTT) as sampling from an unknown μ∗ ∈ F (known class of distributions) given query access to an oracle ˆμ supplying approximate density estimates. When F is the class of all distributions this reduces exactly to the classical counting-to-sampling problem. The central results are (i) a quadratic query-complexity lower bound that holds for all sufficiently large F, establishing optimality of the Jerrum–Sinclair random-walk approach and answering an open question of Hayes & Sinclair, and (ii) a circumvention of the quadratic bound when |F| is suitably bounded, which the authors present as a theoretical abstraction of TTT.

Significance. If the claimed proofs are correct, the work supplies the first query-complexity lower bound that matches the long-standing random-walk upper bound in the general case and simultaneously exhibits a regime (bounded |F|) in which substantially fewer queries suffice. This supplies a clean complexity-theoretic explanation for why TTT can be effective when the underlying model class is restricted, and it places modern LLM sampling techniques on the same footing as the classical Jerrum–Valiant–Vazirani and Jerrum–Sinclair results. The absence of free parameters or self-referential reductions in the stated argument is a strength.

minor comments (2)

The abstract states that proofs exist for both the quadratic lower bound and the bounded-class circumvention, but the precise definition of the oracle model (additive vs. multiplicative error, dependence on the approximation parameter), the exact threshold on |F| that triggers the lower bound, and the error terms in the reductions are not visible in the provided abstract. These details are load-bearing for verifying tightness.
Notation for the target measure μ∗ and the oracle ˆμ is introduced cleanly, but the manuscript should include an explicit comparison table or paragraph showing how the new oracle model specializes to the standard counting oracle when F is unrestricted.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and for highlighting the significance of the quadratic lower bound matching the Jerrum-Sinclair random walk, the resolution of the Hayes-Sinclair question, and the bounded-|F| circumvention as a theoretical abstraction of TTT. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No circularity; lower bounds are independent of self-referential inputs

full rationale

The paper establishes query-complexity lower bounds for sampling from μ* given ˆμ-oracle access when F is large, showing optimality of the Jerrum-Sinclair random walk, and shows circumvention for bounded |F|. These are standard information-theoretic and reduction arguments in the counting-to-sampling setting. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citations are load-bearing for the central claims (citations are to Jerrum-Sinclair 1989 and Hayes-Sinclair 2010, independent of the present authors), and the oracle model is explicitly aligned with prior work without ansatz smuggling or renaming. The derivation chain is self-contained against external complexity benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The formalization rests on standard background results from probability and complexity theory; no free parameters, ad-hoc axioms, or new invented entities are introduced in the abstract.

axioms (2)

domain assumption Existence of oracles that return approximate density estimates for distributions in a known class F.
Invoked when defining the input model for the sampling problem.
standard math Standard notions of query complexity and total variation distance from theoretical computer science.
Used to state the lower and upper bounds.

pith-pipeline@v0.9.1-grok · 5888 in / 1437 out tokens · 37326 ms · 2026-06-27T11:05:35.031793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 26 canonical work pages · 12 internal anchors

[1]

Log-concave polynomials iv: Exchange properties, tight mixing times, and faster sampling of spanning trees.arXiv preprint arXiv:2004.07220,

[ALOG+20] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, Cynthia Vinzant, and Thuy Duong Vuong. Log-concave polynomials iv: Exchange properties, tight mixing times, and faster sampling of spanning trees.arXiv preprint arXiv:2004.07220,

work page arXiv 2004
[2]

Language Models are Few-Shot Learners

[BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Spectral independence via stability and applications to holant-type problems.arXiv preprint arXiv:2106.03366,

[CLV21] Zongchen Chen, Kuikui Liu, and Eric Vigoda. Spectral independence via stability and applications to holant-type problems.arXiv preprint arXiv:2106.03366,

work page arXiv
[4]

Foster, Zakaria Mhammedi, and Dhruv Rohatgi

[FMR25] Dylan J. Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration.arXiv preprint arXiv:2503.07453,

work page arXiv
[5]

Foster, and Akshay Krishnamurthy

[GCR+26] Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, and Akshay Krishnamurthy. Reject, resample, repeat: Understand- ing parallel reasoning in language model inference.arXiv preprint arXiv:2603.07887,

work page arXiv
[6]

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

[GMAM25] Jonathan Geuter, Youssef Mroueh, and David Alvarez-Melis. Guided speculative inference for efficient test-time alignment of LLMs.arXiv preprint arXiv:2506.04118,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Denoising Diffusion Probabilistic Models

[HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[8]

Test-time learning for large language models

[HZC+25] Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633,

work page arXiv
[9]

Let's Verify Step by Step

[LKB+23] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

GPT-4 Technical Report

[Ope23] OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods.arXiv preprint arXiv:2502.01618,

53 [PSX+25] Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods.arXiv preprint arXiv:2502.01618,

work page arXiv
[12]

[RSS+25] Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, and Dylan J Foster

Lecture notes. [RSS+25] Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, and Dylan J Foster. Taming imperfect process verifiers: A sampling per- spective on backtracking.arXiv preprint arXiv:2510.03149,

work page arXiv
[13]

Non-asymptotic Error Bounds for Sequential MCMC Methods in Multimodal Settings

[Sch12] Nikolaus Schweizer. Non-asymptotic error bounds for sequential mcmc methods in multimodal settings.arXiv preprint arXiv:1205.6733,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

[SDWMG15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.arXiv preprint arXiv:1503.03585,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Singhal, Z

[SHT+25] Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848,

work page arXiv
[16]

Learning to (learn at test time)

[SLD+23] Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiao- long Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time). arXiv preprint arXiv:2310.13807,

work page arXiv
[17]

Score-Based Generative Modeling through Stochastic Differential Equations

[SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Er- mon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011
[18]

Efros, and Moritz Hardt

[SWL+19] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. arXiv preprint arXiv:1909.13231,

work page arXiv 1909
[19]

Gemini: A Family of Highly Capable Multimodal Models

[Tea23] Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Value-guided search for efficient chain-of-thought reasoning

[WZC+25] Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kiant´ e Brantley, and Wen Sun. Value-guided search for efficient chain-of-thought reasoning. arXiv preprint arXiv:2505.17373,

work page arXiv
[21]

Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint.arXiv preprint arXiv:2312.11456,

54 [XDY+24] Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv
[22]

Fudge: Controlled text generation with future discrimi- nators

[YK21] Kevin Yang and Dan Klein. Fudge: Controlled text generation with future discrimi- nators. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535,

2021
[23]

Learning to Discover at Test Time

[YKL+26] Mert Yuksekg¨ on¨ ul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xi- aolong Wang, Jan Kautz, Yejin Choi, James Zou, Yu Sun, and Carlos Guestrin. Learning to discover at test time.arXiv preprint arXiv:2601.16175,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Ttcs: Test-time curriculum synthesis for self- evolving.arXiv preprint arXiv:2601.22628,

[YXT+26] Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, and Jinsong Su. Ttcs: Test-time curriculum synthesis for self- evolving.arXiv preprint arXiv:2601.22628,

work page arXiv
[25]

On the power of (approximate) reward models for inference-time scaling.arXiv preprint arXiv:2602.01381,

[ZL26] Youheng Zhu and Yiping Lu. On the power of (approximate) reward models for inference-time scaling.arXiv preprint arXiv:2602.01381,

work page arXiv
[26]

TTRL: Test-Time Reinforcement Learning

[ZZS+25] Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

[ZZW+25] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Log-concave polynomials iv: Exchange properties, tight mixing times, and faster sampling of spanning trees.arXiv preprint arXiv:2004.07220,

[ALOG+20] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, Cynthia Vinzant, and Thuy Duong Vuong. Log-concave polynomials iv: Exchange properties, tight mixing times, and faster sampling of spanning trees.arXiv preprint arXiv:2004.07220,

work page arXiv 2004

[2] [2]

Language Models are Few-Shot Learners

[BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners.arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

Spectral independence via stability and applications to holant-type problems.arXiv preprint arXiv:2106.03366,

[CLV21] Zongchen Chen, Kuikui Liu, and Eric Vigoda. Spectral independence via stability and applications to holant-type problems.arXiv preprint arXiv:2106.03366,

work page arXiv

[4] [4]

Foster, Zakaria Mhammedi, and Dhruv Rohatgi

[FMR25] Dylan J. Foster, Zakaria Mhammedi, and Dhruv Rohatgi. Is a good foundation necessary for efficient reinforcement learning? the computational role of the base model in exploration.arXiv preprint arXiv:2503.07453,

work page arXiv

[5] [5]

Foster, and Akshay Krishnamurthy

[GCR+26] Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich, Dylan J. Foster, and Akshay Krishnamurthy. Reject, resample, repeat: Understand- ing parallel reasoning in language model inference.arXiv preprint arXiv:2603.07887,

work page arXiv

[6] [6]

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

[GMAM25] Jonathan Geuter, Youssef Mroueh, and David Alvarez-Melis. Guided speculative inference for efficient test-time alignment of LLMs.arXiv preprint arXiv:2506.04118,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Denoising Diffusion Probabilistic Models

[HJA20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[8] [8]

Test-time learning for large language models

[HZC+25] Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models. arXiv preprint arXiv:2505.20633,

work page arXiv

[9] [9]

Let's Verify Step by Step

[LKB+23] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

GPT-4 Technical Report

[Ope23] OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods.arXiv preprint arXiv:2502.01618,

53 [PSX+25] Isha Puri, Shivchander Sudalairaj, Guangxuan Xu, Kai Xu, and Akash Srivastava. A probabilistic inference approach to inference-time scaling of llms using particle-based monte carlo methods.arXiv preprint arXiv:2502.01618,

work page arXiv

[12] [12]

[RSS+25] Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, and Dylan J Foster

Lecture notes. [RSS+25] Dhruv Rohatgi, Abhishek Shetty, Donya Saless, Yuchen Li, Ankur Moitra, Andrej Risteski, and Dylan J Foster. Taming imperfect process verifiers: A sampling per- spective on backtracking.arXiv preprint arXiv:2510.03149,

work page arXiv

[13] [13]

Non-asymptotic Error Bounds for Sequential MCMC Methods in Multimodal Settings

[Sch12] Nikolaus Schweizer. Non-asymptotic error bounds for sequential mcmc methods in multimodal settings.arXiv preprint arXiv:1205.6733,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

[SDWMG15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics.arXiv preprint arXiv:1503.03585,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Singhal, Z

[SHT+25] Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models.arXiv preprint arXiv:2501.06848,

work page arXiv

[16] [16]

Learning to (learn at test time)

[SLD+23] Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiao- long Wang, Tatsunori Hashimoto, and Xinlei Chen. Learning to (learn at test time). arXiv preprint arXiv:2310.13807,

work page arXiv

[17] [17]

Score-Based Generative Modeling through Stochastic Differential Equations

[SSDK+20] Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Er- mon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

work page internal anchor Pith review Pith/arXiv arXiv 2011

[18] [18]

Efros, and Moritz Hardt

[SWL+19] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. arXiv preprint arXiv:1909.13231,

work page arXiv 1909

[19] [19]

Gemini: A Family of Highly Capable Multimodal Models

[Tea23] Gemini Team. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Value-guided search for efficient chain-of-thought reasoning

[WZC+25] Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kiant´ e Brantley, and Wen Sun. Value-guided search for efficient chain-of-thought reasoning. arXiv preprint arXiv:2505.17373,

work page arXiv

[21] [21]

Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint.arXiv preprint arXiv:2312.11456,

54 [XDY+24] Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint.arXiv preprint arXiv:2312.11456,

work page arXiv

[22] [22]

Fudge: Controlled text generation with future discrimi- nators

[YK21] Kevin Yang and Dan Klein. Fudge: Controlled text generation with future discrimi- nators. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535,

2021

[23] [23]

Learning to Discover at Test Time

[YKL+26] Mert Yuksekg¨ on¨ ul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xi- aolong Wang, Jan Kautz, Yejin Choi, James Zou, Yu Sun, and Carlos Guestrin. Learning to discover at test time.arXiv preprint arXiv:2601.16175,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Ttcs: Test-time curriculum synthesis for self- evolving.arXiv preprint arXiv:2601.22628,

[YXT+26] Chengyi Yang, Zhishang Xiang, Yunbo Tang, Zongpei Teng, Chengsong Huang, Fei Long, Yuhan Liu, and Jinsong Su. Ttcs: Test-time curriculum synthesis for self- evolving.arXiv preprint arXiv:2601.22628,

work page arXiv

[25] [25]

On the power of (approximate) reward models for inference-time scaling.arXiv preprint arXiv:2602.01381,

[ZL26] Youheng Zhu and Yiping Lu. On the power of (approximate) reward models for inference-time scaling.arXiv preprint arXiv:2602.01381,

work page arXiv

[26] [26]

TTRL: Test-Time Reinforcement Learning

[ZZS+25] Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

[ZZW+25] Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301,

work page internal anchor Pith review Pith/arXiv arXiv