pith. machine review for the scientific record. sign in

arxiv: 2605.07775 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 3 theorem links

· Lean Theorem

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords policy ensemblesThompson samplingKL regularizationepistemic uncertaintyLLM optimizationblack-box optimizationregret boundsscientific discovery
0
0 comments X

The pith

POETS shows that training a policy ensemble to match KL-regularized rewards from bootstrapped data implicitly performs Thompson sampling with regret bounds O(sqrt(T gamma_T)).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POETS as a framework for balancing exploration and exploitation in sequential decision-making and black-box optimization. It starts from the observation that policies trained under Kullback-Leibler regularization already encode an implicit reward function, so an ensemble can be trained directly to capture epistemic uncertainty by fitting to online bootstrapped data. This removes the need for a separate uncertainty-aware reward model followed by policy optimization. The method uses a shared pre-trained backbone with independent LoRA branches to keep compute and memory costs manageable for large language models. The authors prove that the resulting procedure is equivalent to KL-regularized Thompson sampling and therefore inherits the corresponding cumulative regret bound, while experiments show improved sample efficiency on protein search and quantum circuit design tasks.

Core claim

POETS bypasses the nested process of training an uncertainty-aware reward model and separately fitting a policy to it. Instead, it directly trains a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online bootstrapped data. Using a shared pre-trained backbone with independent LoRA branches for diversity, the framework proves that this procedure implicitly conducts KL-regularized Thompson sampling and therefore inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, the same construction achieves state-of-the-art sample efficiency across protein search and quantum circuit design, and improves optimization trajectories in off- or

What carries the argument

The compute-efficient policy ensemble that shares a pre-trained LLM backbone while using independent LoRA branches to maintain diversity and directly matches implicitly encoded reward functions to bootstrapped data.

If this is right

  • The procedure inherits cumulative regret bounds of O(sqrt(T gamma_T)) from its equivalence to KL-regularized Thompson sampling.
  • Direct ensemble training on bootstrapped data removes the need for a separate uncertainty-aware reward model and subsequent policy fitting.
  • The shared-backbone plus LoRA architecture enables practical ensembling of large language models under memory and compute limits.
  • Empirical results show state-of-the-art sample efficiency on protein search and quantum circuit design tasks.
  • Optimization trajectories improve in reinforcement-learning settings, especially off-policy with experience replay or small datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LoRA-based ensembling pattern could be tested on other sequential decision problems that currently rely on explicit Bayesian uncertainty estimates.
  • If the implicit-reward encoding holds across different regularization strengths, the framework offers a simpler alternative to full posterior sampling in large-model settings.
  • Extending the bootstrap-matching step to non-stationary environments would require checking whether the regret bound still applies when the implicit reward function drifts.

Load-bearing premise

Policies trained with KL regularization implicitly encode an underlying reward function that an ensemble can match to bootstrapped data to capture epistemic uncertainty without an explicit reward model.

What would settle it

In a controlled multi-armed bandit setting, measure whether the action-selection distribution produced by the POETS ensemble deviates from the distribution of KL-regularized Thompson sampling or whether realized cumulative regret exceeds the O(sqrt(T gamma_T)) bound.

Figures

Figures reproduced from arXiv: 2605.07775 by Abbas Rahimi, Andreas Krause, Nicolas Menet.

Figure 1
Figure 1. Figure 1: Previous methods for sample efficient RL fit an [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: POETS balances ex￾ploration with exploitation, lead￾ing to stable policy diversity. Shown on Protein Search, for other settings, see Section A.2. Consider [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: POETS reaches SoTA sample efficiency across diverse tasks, even without a replay buffer. improvements in both best-seen and expected reward ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: POETS benefits substantially from a replay buffer (T = 4, 16). In contrast, GRPO overfits to suboptimal early solutions. After zero-initialization of LoRA, the policy heads diversify quickly. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: POETS improves the sample efficiency and robustness of RL both on-policy and off-policy. Using a Qwen3-8B model, rollouts are generated in groups of 8 for 4 questions simultaneously. In contrast, POETS successfully mitigates these failures by implicitly realizing Thompson sampling. Because the policy ensemble maintains a calibrated awareness of epistemic uncertainty, it naturally balances exploration and e… view at source ↗
Figure 6
Figure 6. Figure 6: Increasing the replay buffer results in SoTA sample efficiency for quantum circuit design. Recall from Section 5.1.4 that increasing the re￾play buffer provides consistent improvements in sample efficiency on the protein search task. We now repeat this experiment for the task of quan￾tum circuit design. As can be seen in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: POETS does not suffer from the diversity collapse of GRPO. Note that we have added entropy regularization to the quantum circuit design task, since the starting entropy is extremely low. A.3 The policy ensemble diversifies effectively despite parameter sharing Independent ensembles are known to diversify naturally [Lakshminarayanan et al., 2017]. However, given the trend in up-scaling LLMs, running a forwa… view at source ↗
Figure 8
Figure 8. Figure 8: By selecting the correct learning rate for the LoRA heads of [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Increasing the ensemble size re￾sults in improved performance. At 8 mem￾bers, the ensemble is sufficiently diverse. In [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Increasing the rank of the last-layer LoRA heads increases the diversity of the policy [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Bootstrapping ensures a diversified ensemble that captures epistemic uncertainty. In [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Entropy/KL-regularization do not achieve the well-calibrated exploration-exploitation [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T \gamma_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces POETS, a framework for uncertainty-aware optimization using policy ensembles on LLMs. It claims that KL-regularized policies implicitly encode reward functions, enabling direct training of a LoRA-based ensemble (shared backbone, independent adapters) to capture epistemic uncertainty by matching to online bootstrapped data. Theoretically, it proves that this setup implicitly performs KL-regularized Thompson sampling and inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, it reports state-of-the-art sample efficiency in protein search, quantum circuit design, and RL optimization tasks, including robustness in off-policy and small-data regimes.

Significance. If the central theoretical equivalence holds, POETS would provide a practical advance by integrating uncertainty quantification into LLM policy optimization without nested reward-model training, while the LoRA ensemble architecture addresses compute constraints. The claimed regret bound and cross-domain empirical gains would be notable for sample-efficient black-box optimization if supported by explicit derivations and rigorous controls.

major comments (2)
  1. [Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.
  2. [Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.
minor comments (3)
  1. [Theoretical analysis section] Notation for gamma_T in the regret bound should be defined on first use and related to the specific function class or covering number used in the analysis.
  2. [Figures] Figure captions for optimization trajectories should state the number of trials and whether shaded regions represent standard error or min/max.
  3. [Discussion] The manuscript should add a limitations paragraph discussing failure modes when the implicit reward encoding assumption is violated (e.g., non-convex policy optimization or insufficient LoRA rank).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We respond to each major comment in turn below.

read point-by-point responses
  1. Referee: [Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.

    Authors: We thank the referee for highlighting the need for explicitness in the theoretical derivation. The manuscript does provide a proof of the implicit KL-regularized Thompson sampling, but to address this concern directly, we will revise the Theoretical analysis section to include an explicit construction of the mapping from the ensemble parameters (including the shared backbone and independent LoRA adapters) to the reward posterior. We will also detail how the online matching to bootstrapped data preserves the KL-regularized objective, ensuring the regret bound follows rigorously rather than definitionally. The abstract will be updated to describe this as a derived result. revision: yes

  2. Referee: [Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.

    Authors: We agree that additional experimental controls are necessary to substantiate the state-of-the-art sample-efficiency claims. In the revised manuscript, we will expand the Experimental results section to explicitly list all baselines, report the number of independent runs performed (currently 10 runs for each task), include error bars on all relevant figures, conduct appropriate statistical tests to assess significance, and provide an ablation study isolating the contribution of the bootstrapping procedure. These revisions will ensure the empirical results are robust and the superiority claims are well-supported. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context present the key theoretical claim as an independent proof that POETS implicitly conducts KL-regularized Thompson sampling, inheriting O(sqrt(T gamma_T)) regret bounds from that equivalence. No equations, self-citations, or explicit reductions are available in the text to inspect for definitional equivalence (e.g., the training objective being restated as the sampling procedure by construction). The 'insight' about KL-regularized policies encoding rewards is framed as a grounding premise rather than a fitted or renamed input, and the derivation chain is not shown to collapse to its own inputs. This aligns with the default expectation that most papers maintain independent theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that KL-regularized policies encode reward functions and on the use of bootstrapped data for ensemble training.

axioms (1)
  • domain assumption Policies trained with KL regularization implicitly encode an underlying reward function.
    This is presented as the central insight that allows bypassing a separate reward model.

pith-pipeline@v0.9.0 · 5554 in / 1228 out tokens · 34171 ms · 2026-05-11T03:36:42.574343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages · 6 internal anchors

  1. [1]

    Garnett, Roman , year = 2023, publisher =

  2. [2]

    2011 , publisher=

    Thinking, fast and slow , author=. 2011 , publisher=

  3. [3]

    Rasmussen, Carl Edward and Williams, Christopher KI , year = 2006, publisher =

  4. [4]

    Entropy and Information Theory , author =

  5. [5]

    Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , author =

  6. [6]

    Adaptation in natural and artificial systems , author =

  7. [7]

    2010 , publisher=

    Ginsbourger, David and Le Riche, Rodolphe and Carraro, Laurent , booktitle=. 2010 , publisher=

  8. [8]

    Nature communications , volume=

    Ferruz, Noelia and Schmidt, Steffen and H. Nature communications , volume=. 2022 , publisher=

  9. [9]

    Nature , volume=

    Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

  10. [10]

    Competition-Level Code Generation with

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume=. 2022 , publisher=

  11. [11]

    Science , volume=

    Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=

  12. [12]

    Predicting structured data , volume=

    A Tutorial on Energy-Based Learning , author=. Predicting structured data , volume=. 2006 , url=

  13. [13]

    Parallelizing Exploration-Exploitation Tradeoffs in

    Desautels, Thomas and Krause, Andreas and Burdick, Joel W , journal=. Parallelizing Exploration-Exploitation Tradeoffs in. 2014 , url=

  14. [14]

    2020 , publisher=

    Greenhill, Stewart and Rana, Santu and Gupta, Sunil and Vellanki, Pratibha and Venkatesh, Svetha , journal=. 2020 , publisher=

  15. [15]

    Annual Review of Pharmacology and Toxicology , volume=

    Parallel Array and Mixture-Based Synthetic Combinatorial Chemistry: Tools for the Next Millennium , author=. Annual Review of Pharmacology and Toxicology , volume=. 2000 , publisher=

  16. [16]

    2016 , url=

    Wang, Ziyu and Hutter, Frank and Zoghi, Masrour and Matheson, David and De Feitas, Nando , journal=. 2016 , url=

  17. [17]

    2009 , url=

    Cock, Peter JA and Antao, Tiago and Chang, Jeffrey T and Chapman, Brad A and Cox, Cymon J and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and others , journal=. 2009 , url=

  18. [18]

    Protein Engineering, Design and Selection , volume=

    Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence , author=. Protein Engineering, Design and Selection , volume=. 1990 , publisher=

  19. [19]

    Parallel and Distributed

    Jos. Parallel and Distributed. International Conference on Machine Learning , pages =. 2017 , organization =

  20. [20]

    International Conference on Machine Learning , pages=

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=

  21. [21]

    International Conference on Machine Learning , pages=

    Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=

  22. [22]

    International Conference on Machine Learning , pages=

    On the global convergence rates of softmax policy gradient methods , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  23. [23]

    International Conference on Learning Representations , year=

    Information-Directed Exploration for Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=

  24. [24]

    Group-Relative

    Chaorui Yao and Yanxi Chen and Yuchang Sun and Yushuo Chen and Wenhao Zhang and Xuchen Pan and Yaliang Li and Bolin Ding , booktitle=. Group-Relative. 2026 , url=

  25. [25]

    International Conference on Learning Representations , year=

    A scalable laplace approximation for neural networks , author=. International Conference on Learning Representations , year=

  26. [26]

    Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. Conference on Learning Representations , year=

  27. [27]

    International Conference on Learning Representations , year=

    Reward Model Ensembles Help Mitigate Overoptimization , author=. International Conference on Learning Representations , year=

  28. [28]

    International Conference on Machine Learning , pages=

    Understanding the impact of entropy on policy optimization , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  29. [29]

    Adaptive and Safe

    Kirschner, Johannes and Mutny, Mojmir and Hiller, Nicole and Ischebeck, Rasmus and Krause, Andreas , booktitle=. Adaptive and Safe. 2019 , organization=

  30. [30]

    Parallelised

    Kandasamy, Kirthevasan and Krishnamurthy, Akshay and Schneider, Jeff and P. Parallelised. International conference on artificial intelligence and statistics , pages=. 2018 , organization=

  31. [31]

    Diversified Sampling for Batched

    Nava, Elvis and Mutny, Mojmir and Krause, Andreas , booktitle=. Diversified Sampling for Batched. 2022 , organization=

  32. [32]

    2024 , organization=

    Vishwakarma, Sanjay and Harkins, Francis and Golecha, Siddharth and Bajpe, Vishal Sharathchandra and Dupuis, Nicolas and Buratti, Luca and Kremer, David and Faro, Ismael and Puri, Ruchir and Cruz-Benito, Juan , booktitle=. 2024 , organization=

  33. [33]

    Conference on Neural Information Processing Systems 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World , year=

    Rankovi. Conference on Neural Information Processing Systems 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World , year=

  34. [34]

    Amortized

    Swersky, Kevin and Rubanova, Yulia and Dohan, David and Murphy, Kevin , booktitle=. Amortized. 2020 , organization=

  35. [35]

    Conference on Neural Information Processing Systems , volume=

    Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization , author=. Conference on Neural Information Processing Systems , volume=. 2022 , url=

  36. [36]

    Conference on Neural Information Processing Systems , volume=

    Rebel: Reinforcement learning via regressing relative rewards , author=. Conference on Neural Information Processing Systems , volume=. 2024 , url=

  37. [37]

    Conference on Neural Information Processing Systems , volume=

    The epoch-greedy algorithm for multi-armed bandits with side information , author=. Conference on Neural Information Processing Systems , volume=. 2007 , url=

  38. [38]

    Conference on Neural Information Processing Systems , volume=

    Effective diversity in population based reinforcement learning , author=. Conference on Neural Information Processing Systems , volume=. 2020 , url=

  39. [39]

    Conference on Neural Information Processing Systems , volume=

    Deep exploration via bootstrapped DQN , author=. Conference on Neural Information Processing Systems , volume=. 2016 , url=

  40. [40]

    Conference on Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Conference on Neural Information Processing Systems , volume=. 2023 , url=

  41. [41]

    Conference on Neural Information Processing Systems , year=

    Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions , author=. Conference on Neural Information Processing Systems , year=

  42. [42]

    Conference on Neural Information Processing Systems , url=

    Probabilistic Inference in Reinforcement Learning Done Right , author =. Conference on Neural Information Processing Systems , url=

  43. [43]

    Variational

    O'Donoghue, Brendan and Lattimore, Tor , year = 2021, booktitle =. Variational

  44. [44]

    Srinivas, Niranjan and Krause, Andreas and Kakade, Sham M and Seeger, Matthias , year = 2010, booktitle =

  45. [45]

    An Empirical Evaluation of

    Chapelle, Olivier and Li, Lihong , year = 2011, booktitle =. An Empirical Evaluation of

  46. [46]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  47. [47]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=. 2017 , url =

  48. [48]

    Conference on Neural Information Processing Systems , volume=

    Eluder dimension and the sample complexity of optimistic exploration , author=. Conference on Neural Information Processing Systems , volume=. 2013 , url=

  49. [49]

    International Conference on Learning Representations , year=

    Diversity is All You Need: Learning Skills without a Reward Function , author=. International Conference on Learning Representations , year=

  50. [50]

    International Conference on Learning Representations , year=

    Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author=. International Conference on Learning Representations , year=

  51. [51]

    Lee, Seunghun and Park, Jinyoung and Chu, Jaewon and Yoon, Minseo and Kim, Hyunwoo J , year =. Latent. International Conference on Learning Representations , url=

  52. [52]

    Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  53. [53]

    Thompson Sampling via Fine-Tuning of

    Nicolas Menet and Aleksandar Terzic and Michael Hersche and Andreas Krause and Abbas Rahimi , booktitle=. Thompson Sampling via Fine-Tuning of. 2026 , url=

  54. [54]

    Kingma and Jimmy Ba , year = 2015, booktitle =

    Diederik P. Kingma and Jimmy Ba , year = 2015, booktitle =

  55. [55]

    International Conference on Machine Learning , year=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , year=

  56. [56]

    A Theory of Regularized

    Geist, Matthieu and Scherrer, Bruno and Pietquin, Olivier , year = 2019, booktitle =. A Theory of Regularized

  57. [57]

    International Conference on Machine Learning , url=

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , url=

  58. [58]

    International Conference on Machine Learning , year=

    On Kernelized Multi-armed Bandits , author=. International Conference on Machine Learning , year=

  59. [59]

    Improving black-box optimization in

    Notin, Pascal and Hern\'. Improving black-box optimization in. Conference on Neural Information Processing Systems , year =

  60. [60]

    Conference on Neural Information Processing Systems , volume=

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Conference on Neural Information Processing Systems , volume=. 2017 , url=

  61. [61]

    2025 , url=

    Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...

  62. [62]

    Understanding

    Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin , booktitle=. Understanding. 2025 , url=

  63. [63]

    Conference on Language Modeling , year=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Conference on Language Modeling , year=

  64. [64]

    International Conference on Artificial Intelligence and Statistics , volume =

    Nicolas Menet and Jonas H. International Conference on Artificial Intelligence and Statistics , volume =. 2025 , organization =

  65. [65]

    International conference on machine learning , year=

    More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize , author=. International conference on machine learning , year=

  66. [66]

    Back to Basics: Revisiting

    Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , volume=. 2024 , url=

  67. [67]

    Batched Large-scale

    Wang, Zi and Gehring, Clement and Kohli, Pushmeet and Jegelka, Stefanie , booktitle=. Batched Large-scale. 2018 , organization=

  68. [68]

    Optimistic Games for Combinatorial

    Bal, Melis Ilayda and Sessa, Pier Giuseppe and Mutny, Mojmir and Krause, Andreas , booktitle=. Optimistic Games for Combinatorial. 2025 , url=

  69. [69]

    Proceedings of the ninth annual conference of the Cognitive Science Society , pages=

    Using fast weights to deblur old memories , author=. Proceedings of the ninth annual conference of the Cognitive Science Society , pages=. 1987 , url=

  70. [70]

    International workshop on artificial intelligence and statistics , pages=

    Online bagging and boosting , author=. International workshop on artificial intelligence and statistics , pages=. 2001 , organization=

  71. [71]

    1964 , journal=

    A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , author=. 1964 , journal=

  72. [72]

    Constrained

    Griffiths, Ryan-Rhys and Hern. Constrained. 2020 , journal=

  73. [73]

    Mathematics of Operations Research , publisher =

    Learning to Optimize Via Posterior Sampling , author =. Mathematics of Operations Research , publisher =

  74. [74]

    Journal of Machine Learning Research , volume = 3, url=

    Using Confidence Bounds for Exploitation-Exploration Trade-offs , author =. Journal of Machine Learning Research , volume = 3, url=

  75. [75]

    An Information-Theoretic Analysis of

    Russo, Daniel and Van Roy, Benjamin , year = 2016, journal =. An Information-Theoretic Analysis of

  76. [76]

    A Tutorial on

    Russo, Daniel and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , year = 2018, journal =. A Tutorial on

  77. [77]

    IEEE transactions on systems, man, and cybernetics , pages=

    Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE transactions on systems, man, and cybernetics , pages=. 1983 , publisher=

  78. [78]

    Biometrika , volume=

    On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , author=. Biometrika , volume=. 1933 , publisher=

  79. [79]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  80. [80]

    Kool, Wouter and van Hoof, Herke and Welling, Max , journal=. Buy 4. International Conference on Learning Representations 2019 Deep Reinforcement Learning meets Structured Prediction Workshop , year=

Showing first 80 references.