arxiv: 2605.07775 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 3 theorem links

· Lean Theorem

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

Nicolas Menet , Andreas Krause , Abbas Rahimi

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords policy ensemblesThompson samplingKL regularizationepistemic uncertaintyLLM optimizationblack-box optimizationregret boundsscientific discovery

0 comments

The pith

POETS shows that training a policy ensemble to match KL-regularized rewards from bootstrapped data implicitly performs Thompson sampling with regret bounds O(sqrt(T gamma_T)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POETS as a framework for balancing exploration and exploitation in sequential decision-making and black-box optimization. It starts from the observation that policies trained under Kullback-Leibler regularization already encode an implicit reward function, so an ensemble can be trained directly to capture epistemic uncertainty by fitting to online bootstrapped data. This removes the need for a separate uncertainty-aware reward model followed by policy optimization. The method uses a shared pre-trained backbone with independent LoRA branches to keep compute and memory costs manageable for large language models. The authors prove that the resulting procedure is equivalent to KL-regularized Thompson sampling and therefore inherits the corresponding cumulative regret bound, while experiments show improved sample efficiency on protein search and quantum circuit design tasks.

Core claim

POETS bypasses the nested process of training an uncertainty-aware reward model and separately fitting a policy to it. Instead, it directly trains a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online bootstrapped data. Using a shared pre-trained backbone with independent LoRA branches for diversity, the framework proves that this procedure implicitly conducts KL-regularized Thompson sampling and therefore inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, the same construction achieves state-of-the-art sample efficiency across protein search and quantum circuit design, and improves optimization trajectories in off- or

What carries the argument

The compute-efficient policy ensemble that shares a pre-trained LLM backbone while using independent LoRA branches to maintain diversity and directly matches implicitly encoded reward functions to bootstrapped data.

If this is right

The procedure inherits cumulative regret bounds of O(sqrt(T gamma_T)) from its equivalence to KL-regularized Thompson sampling.
Direct ensemble training on bootstrapped data removes the need for a separate uncertainty-aware reward model and subsequent policy fitting.
The shared-backbone plus LoRA architecture enables practical ensembling of large language models under memory and compute limits.
Empirical results show state-of-the-art sample efficiency on protein search and quantum circuit design tasks.
Optimization trajectories improve in reinforcement-learning settings, especially off-policy with experience replay or small datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LoRA-based ensembling pattern could be tested on other sequential decision problems that currently rely on explicit Bayesian uncertainty estimates.
If the implicit-reward encoding holds across different regularization strengths, the framework offers a simpler alternative to full posterior sampling in large-model settings.
Extending the bootstrap-matching step to non-stationary environments would require checking whether the regret bound still applies when the implicit reward function drifts.

Load-bearing premise

Policies trained with KL regularization implicitly encode an underlying reward function that an ensemble can match to bootstrapped data to capture epistemic uncertainty without an explicit reward model.

What would settle it

In a controlled multi-armed bandit setting, measure whether the action-selection distribution produced by the POETS ensemble deviates from the distribution of KL-regularized Thompson sampling or whether realized cumulative regret exceeds the O(sqrt(T gamma_T)) bound.

Figures

Figures reproduced from arXiv: 2605.07775 by Abbas Rahimi, Andreas Krause, Nicolas Menet.

**Figure 2.** Figure 2: POETS balances exploration with exploitation, leading to stable policy diversity. Shown on Protein Search, for other settings, see Section A.2. Consider [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: POETS reaches SoTA sample efficiency across diverse tasks, even without a replay buffer. improvements in both best-seen and expected reward ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: POETS benefits substantially from a replay buffer (T = 4, 16). In contrast, GRPO overfits to suboptimal early solutions. After zero-initialization of LoRA, the policy heads diversify quickly. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: POETS improves the sample efficiency and robustness of RL both on-policy and off-policy. Using a Qwen3-8B model, rollouts are generated in groups of 8 for 4 questions simultaneously. In contrast, POETS successfully mitigates these failures by implicitly realizing Thompson sampling. Because the policy ensemble maintains a calibrated awareness of epistemic uncertainty, it naturally balances exploration and e… view at source ↗

**Figure 6.** Figure 6: Increasing the replay buffer results in SoTA sample efficiency for quantum circuit design. Recall from Section 5.1.4 that increasing the replay buffer provides consistent improvements in sample efficiency on the protein search task. We now repeat this experiment for the task of quantum circuit design. As can be seen in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: POETS does not suffer from the diversity collapse of GRPO. Note that we have added entropy regularization to the quantum circuit design task, since the starting entropy is extremely low. A.3 The policy ensemble diversifies effectively despite parameter sharing Independent ensembles are known to diversify naturally [Lakshminarayanan et al., 2017]. However, given the trend in up-scaling LLMs, running a forwa… view at source ↗

**Figure 8.** Figure 8: By selecting the correct learning rate for the LoRA heads of [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Increasing the ensemble size results in improved performance. At 8 members, the ensemble is sufficiently diverse. In [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Increasing the rank of the last-layer LoRA heads increases the diversity of the policy [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Bootstrapping ensures a diversified ensemble that captures epistemic uncertainty. In [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Entropy/KL-regularization do not achieve the well-calibrated exploration-exploitation [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T \gamma_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POETS gives a practical LoRA-ensemble trick for uncertainty in LLM optimization, but the load-bearing claim that it implicitly runs KL-regularized Thompson sampling still needs the actual proof to be checked.

read the letter

The core idea is to train a shared-backbone LLM with separate LoRA adapters on bootstrapped data so the ensemble captures epistemic uncertainty without copying whole models. This sidesteps the usual two-stage process of fitting a reward model then optimizing a policy, and the compute savings look real for large models. The architecture itself is a straightforward engineering move that could be tried on other policy-optimization setups where memory is tight. If the experiments on protein search and quantum circuit design actually beat the baselines with reasonable error bars, that part would be worth noting for people who run expensive black-box loops. The paper also mentions robustness in off-policy RL with replay buffers, which is a secondary but practical angle. The soft spot is the theory. The abstract states that KL-regularized policies implicitly encode a reward function, so the ensemble performs Thompson sampling and inherits an O(sqrt(T gamma_T)) regret bound. That equivalence is presented as an insight rather than a lemma with explicit steps, and the stress-test note correctly flags it as load-bearing. Without seeing the derivation that maps the trained LoRA parameters to a proper posterior over rewards, it is unclear whether finite training preserves the required optimality or whether the online matching step introduces bias. The empirical section is described only at the level of “SOTA sample efficiency,” with no mention of specific baselines, number of runs, or variance, so those results cannot be assessed yet. This paper is aimed at researchers who already work on LLM-based scientific discovery or sequential decision making under high evaluation cost. A reader who needs a lighter-weight way to add exploration to existing LoRA fine-tuning pipelines could extract the architecture and test it directly. It deserves a serious referee because the compute-efficiency claim is concrete and the regret bound, if it holds, would be a useful reference point even if the experiments require tightening.

Referee Report

2 major / 3 minor

Summary. The paper introduces POETS, a framework for uncertainty-aware optimization using policy ensembles on LLMs. It claims that KL-regularized policies implicitly encode reward functions, enabling direct training of a LoRA-based ensemble (shared backbone, independent adapters) to capture epistemic uncertainty by matching to online bootstrapped data. Theoretically, it proves that this setup implicitly performs KL-regularized Thompson sampling and inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, it reports state-of-the-art sample efficiency in protein search, quantum circuit design, and RL optimization tasks, including robustness in off-policy and small-data regimes.

Significance. If the central theoretical equivalence holds, POETS would provide a practical advance by integrating uncertainty quantification into LLM policy optimization without nested reward-model training, while the LoRA ensemble architecture addresses compute constraints. The claimed regret bound and cross-domain empirical gains would be notable for sample-efficient black-box optimization if supported by explicit derivations and rigorous controls.

major comments (2)

[Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.
[Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.

minor comments (3)

[Theoretical analysis section] Notation for gamma_T in the regret bound should be defined on first use and related to the specific function class or covering number used in the analysis.
[Figures] Figure captions for optimization trajectories should state the number of trials and whether shaded regions represent standard error or min/max.
[Discussion] The manuscript should add a limitations paragraph discussing failure modes when the implicit reward encoding assumption is violated (e.g., non-convex policy optimization or insufficient LoRA rank).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We respond to each major comment in turn below.

read point-by-point responses

Referee: [Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.

Authors: We thank the referee for highlighting the need for explicitness in the theoretical derivation. The manuscript does provide a proof of the implicit KL-regularized Thompson sampling, but to address this concern directly, we will revise the Theoretical analysis section to include an explicit construction of the mapping from the ensemble parameters (including the shared backbone and independent LoRA adapters) to the reward posterior. We will also detail how the online matching to bootstrapped data preserves the KL-regularized objective, ensuring the regret bound follows rigorously rather than definitionally. The abstract will be updated to describe this as a derived result. revision: yes
Referee: [Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.

Authors: We agree that additional experimental controls are necessary to substantiate the state-of-the-art sample-efficiency claims. In the revised manuscript, we will expand the Experimental results section to explicitly list all baselines, report the number of independent runs performed (currently 10 runs for each task), include error bars on all relevant figures, conduct appropriate statistical tests to assess significance, and provide an ablation study isolating the contribution of the bootstrapping procedure. These revisions will ensure the empirical results are robust and the superiority claims are well-supported. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context present the key theoretical claim as an independent proof that POETS implicitly conducts KL-regularized Thompson sampling, inheriting O(sqrt(T gamma_T)) regret bounds from that equivalence. No equations, self-citations, or explicit reductions are available in the text to inspect for definitional equivalence (e.g., the training objective being restated as the sampling procedure by construction). The 'insight' about KL-regularized policies encoding rewards is framed as a grounding premise rather than a fitted or renamed input, and the derivation chain is not shown to collapse to its own inputs. This aligns with the default expectation that most papers maintain independent theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that KL-regularized policies encode reward functions and on the use of bootstrapped data for ensemble training.

axioms (1)

domain assumption Policies trained with KL regularization implicitly encode an underlying reward function.
This is presented as the central insight that allows bypassing a separate reward model.

pith-pipeline@v0.9.0 · 5554 in / 1228 out tokens · 34171 ms · 2026-05-11T03:36:42.574343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policies optimized with Kullback-Leibler (KL) regularization inherently encode their underlying reward functions... r_π(a) := (β+α) logπ(a)−βlogπ_ref(a) + (β+α) logZ
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

POETS implicitly conducts KL-regularized Thompson sampling... cumulative soft regret bound of O(√T γ_T)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trunk & Branch architecture... independent Low-Rank Adaptation (LoRA) branches

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 6 internal anchors

[1]

Garnett, Roman , year = 2023, publisher =

work page 2023
[2]

2011 , publisher=

Thinking, fast and slow , author=. 2011 , publisher=

work page 2011
[3]

Rasmussen, Carl Edward and Williams, Christopher KI , year = 2006, publisher =

work page 2006
[4]

Entropy and Information Theory , author =

work page
[5]

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , author =

work page
[6]

Adaptation in natural and artificial systems , author =

work page
[7]

2010 , publisher=

Ginsbourger, David and Le Riche, Rodolphe and Carraro, Laurent , booktitle=. 2010 , publisher=

work page 2010
[8]

Nature communications , volume=

Ferruz, Noelia and Schmidt, Steffen and H. Nature communications , volume=. 2022 , publisher=

work page 2022
[9]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

work page 2024
[10]

Competition-Level Code Generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume=. 2022 , publisher=

work page 2022
[11]

Science , volume=

Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

work page 2023
[12]

Predicting structured data , volume=

A Tutorial on Energy-Based Learning , author=. Predicting structured data , volume=. 2006 , url=

work page 2006
[13]

Parallelizing Exploration-Exploitation Tradeoffs in

Desautels, Thomas and Krause, Andreas and Burdick, Joel W , journal=. Parallelizing Exploration-Exploitation Tradeoffs in. 2014 , url=

work page 2014
[14]

2020 , publisher=

Greenhill, Stewart and Rana, Santu and Gupta, Sunil and Vellanki, Pratibha and Venkatesh, Svetha , journal=. 2020 , publisher=

work page 2020
[15]

Annual Review of Pharmacology and Toxicology , volume=

Parallel Array and Mixture-Based Synthetic Combinatorial Chemistry: Tools for the Next Millennium , author=. Annual Review of Pharmacology and Toxicology , volume=. 2000 , publisher=

work page 2000
[16]

2016 , url=

Wang, Ziyu and Hutter, Frank and Zoghi, Masrour and Matheson, David and De Feitas, Nando , journal=. 2016 , url=

work page 2016
[17]

2009 , url=

Cock, Peter JA and Antao, Tiago and Chang, Jeffrey T and Chapman, Brad A and Cox, Cymon J and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and others , journal=. 2009 , url=

work page 2009
[18]

Protein Engineering, Design and Selection , volume=

Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence , author=. Protein Engineering, Design and Selection , volume=. 1990 , publisher=

work page 1990
[19]

Parallel and Distributed

Jos. Parallel and Distributed. International Conference on Machine Learning , pages =. 2017 , organization =

work page 2017
[20]

International Conference on Machine Learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016
[21]

International Conference on Machine Learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016
[22]

International Conference on Machine Learning , pages=

On the global convergence rates of softmax policy gradient methods , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[23]

International Conference on Learning Representations , year=

Information-Directed Exploration for Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page
[24]

Group-Relative

Chaorui Yao and Yanxi Chen and Yuchang Sun and Yushuo Chen and Wenhao Zhang and Xuchen Pan and Yaliang Li and Bolin Ding , booktitle=. Group-Relative. 2026 , url=

work page 2026
[25]

International Conference on Learning Representations , year=

A scalable laplace approximation for neural networks , author=. International Conference on Learning Representations , year=

work page
[26]

Conference on Learning Representations , year=

Let's Verify Step by Step , author=. Conference on Learning Representations , year=

work page
[27]

International Conference on Learning Representations , year=

Reward Model Ensembles Help Mitigate Overoptimization , author=. International Conference on Learning Representations , year=

work page
[28]

International Conference on Machine Learning , pages=

Understanding the impact of entropy on policy optimization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[29]

Adaptive and Safe

Kirschner, Johannes and Mutny, Mojmir and Hiller, Nicole and Ischebeck, Rasmus and Krause, Andreas , booktitle=. Adaptive and Safe. 2019 , organization=

work page 2019
[30]

Parallelised

Kandasamy, Kirthevasan and Krishnamurthy, Akshay and Schneider, Jeff and P. Parallelised. International conference on artificial intelligence and statistics , pages=. 2018 , organization=

work page 2018
[31]

Diversified Sampling for Batched

Nava, Elvis and Mutny, Mojmir and Krause, Andreas , booktitle=. Diversified Sampling for Batched. 2022 , organization=

work page 2022
[32]

2024 , organization=

Vishwakarma, Sanjay and Harkins, Francis and Golecha, Siddharth and Bajpe, Vishal Sharathchandra and Dupuis, Nicolas and Buratti, Luca and Kremer, David and Faro, Ismael and Puri, Ruchir and Cruz-Benito, Juan , booktitle=. 2024 , organization=

work page 2024
[33]

Conference on Neural Information Processing Systems 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World , year=

Rankovi. Conference on Neural Information Processing Systems 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World , year=

work page 2023
[34]

Amortized

Swersky, Kevin and Rubanova, Yulia and Dohan, David and Murphy, Kevin , booktitle=. Amortized. 2020 , organization=

work page 2020
[35]

Conference on Neural Information Processing Systems , volume=

Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization , author=. Conference on Neural Information Processing Systems , volume=. 2022 , url=

work page 2022
[36]

Conference on Neural Information Processing Systems , volume=

Rebel: Reinforcement learning via regressing relative rewards , author=. Conference on Neural Information Processing Systems , volume=. 2024 , url=

work page 2024
[37]

Conference on Neural Information Processing Systems , volume=

The epoch-greedy algorithm for multi-armed bandits with side information , author=. Conference on Neural Information Processing Systems , volume=. 2007 , url=

work page 2007
[38]

Conference on Neural Information Processing Systems , volume=

Effective diversity in population based reinforcement learning , author=. Conference on Neural Information Processing Systems , volume=. 2020 , url=

work page 2020
[39]

Conference on Neural Information Processing Systems , volume=

Deep exploration via bootstrapped DQN , author=. Conference on Neural Information Processing Systems , volume=. 2016 , url=

work page 2016
[40]

Conference on Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Conference on Neural Information Processing Systems , volume=. 2023 , url=

work page 2023
[41]

Conference on Neural Information Processing Systems , year=

Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions , author=. Conference on Neural Information Processing Systems , year=

work page
[42]

Conference on Neural Information Processing Systems , url=

Probabilistic Inference in Reinforcement Learning Done Right , author =. Conference on Neural Information Processing Systems , url=

work page
[43]

Variational

O'Donoghue, Brendan and Lattimore, Tor , year = 2021, booktitle =. Variational

work page 2021
[44]

Srinivas, Niranjan and Krause, Andreas and Kakade, Sham M and Seeger, Matthias , year = 2010, booktitle =

work page 2010
[45]

An Empirical Evaluation of

Chapelle, Olivier and Li, Lihong , year = 2011, booktitle =. An Empirical Evaluation of

work page 2011
[46]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[47]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=. 2017 , url =

work page 2017
[48]

Conference on Neural Information Processing Systems , volume=

Eluder dimension and the sample complexity of optimistic exploration , author=. Conference on Neural Information Processing Systems , volume=. 2013 , url=

work page 2013
[49]

International Conference on Learning Representations , year=

Diversity is All You Need: Learning Skills without a Reward Function , author=. International Conference on Learning Representations , year=

work page
[50]

International Conference on Learning Representations , year=

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author=. International Conference on Learning Representations , year=

work page
[51]

Lee, Seunghun and Park, Jinyoung and Chu, Jaewon and Yoon, Minseo and Kim, Hyunwoo J , year =. Latent. International Conference on Learning Representations , url=

work page
[52]

Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[53]

Thompson Sampling via Fine-Tuning of

Nicolas Menet and Aleksandar Terzic and Michael Hersche and Andreas Krause and Abbas Rahimi , booktitle=. Thompson Sampling via Fine-Tuning of. 2026 , url=

work page 2026
[54]

Kingma and Jimmy Ba , year = 2015, booktitle =

Diederik P. Kingma and Jimmy Ba , year = 2015, booktitle =

work page 2015
[55]

International Conference on Machine Learning , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , year=

work page
[56]

A Theory of Regularized

Geist, Matthieu and Scherrer, Bruno and Pietquin, Olivier , year = 2019, booktitle =. A Theory of Regularized

work page 2019
[57]

International Conference on Machine Learning , url=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , url=

work page
[58]

International Conference on Machine Learning , year=

On Kernelized Multi-armed Bandits , author=. International Conference on Machine Learning , year=

work page
[59]

Improving black-box optimization in

Notin, Pascal and Hern\'. Improving black-box optimization in. Conference on Neural Information Processing Systems , year =

work page
[60]

Conference on Neural Information Processing Systems , volume=

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Conference on Neural Information Processing Systems , volume=. 2017 , url=

work page 2017
[61]

2025 , url=

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

work page 2025
[62]

Understanding

Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin , booktitle=. Understanding. 2025 , url=

work page 2025
[63]

Conference on Language Modeling , year=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Conference on Language Modeling , year=

work page
[64]

International Conference on Artificial Intelligence and Statistics , volume =

Nicolas Menet and Jonas H. International Conference on Artificial Intelligence and Statistics , volume =. 2025 , organization =

work page 2025
[65]

International conference on machine learning , year=

More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize , author=. International conference on machine learning , year=

work page
[66]

Back to Basics: Revisiting

Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , volume=. 2024 , url=

work page 2024
[67]

Batched Large-scale

Wang, Zi and Gehring, Clement and Kohli, Pushmeet and Jegelka, Stefanie , booktitle=. Batched Large-scale. 2018 , organization=

work page 2018
[68]

Optimistic Games for Combinatorial

Bal, Melis Ilayda and Sessa, Pier Giuseppe and Mutny, Mojmir and Krause, Andreas , booktitle=. Optimistic Games for Combinatorial. 2025 , url=

work page 2025
[69]

Proceedings of the ninth annual conference of the Cognitive Science Society , pages=

Using fast weights to deblur old memories , author=. Proceedings of the ninth annual conference of the Cognitive Science Society , pages=. 1987 , url=

work page 1987
[70]

International workshop on artificial intelligence and statistics , pages=

Online bagging and boosting , author=. International workshop on artificial intelligence and statistics , pages=. 2001 , organization=

work page 2001
[71]

1964 , journal=

A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , author=. 1964 , journal=

work page 1964
[72]

Constrained

Griffiths, Ryan-Rhys and Hern. Constrained. 2020 , journal=

work page 2020
[73]

Mathematics of Operations Research , publisher =

Learning to Optimize Via Posterior Sampling , author =. Mathematics of Operations Research , publisher =

work page
[74]

Journal of Machine Learning Research , volume = 3, url=

Using Confidence Bounds for Exploitation-Exploration Trade-offs , author =. Journal of Machine Learning Research , volume = 3, url=

work page
[75]

An Information-Theoretic Analysis of

Russo, Daniel and Van Roy, Benjamin , year = 2016, journal =. An Information-Theoretic Analysis of

work page 2016
[76]

A Tutorial on

Russo, Daniel and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , year = 2018, journal =. A Tutorial on

work page 2018
[77]

IEEE transactions on systems, man, and cybernetics , pages=

Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE transactions on systems, man, and cybernetics , pages=. 1983 , publisher=

work page 1983
[78]

Biometrika , volume=

On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , author=. Biometrika , volume=. 1933 , publisher=

work page 1933
[79]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[80]

Kool, Wouter and van Hoof, Herke and Welling, Max , journal=. Buy 4. International Conference on Learning Representations 2019 Deep Reinforcement Learning meets Structured Prediction Workshop , year=

work page 2019

Showing first 80 references.