Recognition: 3 theorem links
· Lean TheoremPOETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles
Pith reviewed 2026-05-11 03:36 UTC · model grok-4.3
The pith
POETS shows that training a policy ensemble to match KL-regularized rewards from bootstrapped data implicitly performs Thompson sampling with regret bounds O(sqrt(T gamma_T)).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POETS bypasses the nested process of training an uncertainty-aware reward model and separately fitting a policy to it. Instead, it directly trains a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online bootstrapped data. Using a shared pre-trained backbone with independent LoRA branches for diversity, the framework proves that this procedure implicitly conducts KL-regularized Thompson sampling and therefore inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, the same construction achieves state-of-the-art sample efficiency across protein search and quantum circuit design, and improves optimization trajectories in off- or
What carries the argument
The compute-efficient policy ensemble that shares a pre-trained LLM backbone while using independent LoRA branches to maintain diversity and directly matches implicitly encoded reward functions to bootstrapped data.
If this is right
- The procedure inherits cumulative regret bounds of O(sqrt(T gamma_T)) from its equivalence to KL-regularized Thompson sampling.
- Direct ensemble training on bootstrapped data removes the need for a separate uncertainty-aware reward model and subsequent policy fitting.
- The shared-backbone plus LoRA architecture enables practical ensembling of large language models under memory and compute limits.
- Empirical results show state-of-the-art sample efficiency on protein search and quantum circuit design tasks.
- Optimization trajectories improve in reinforcement-learning settings, especially off-policy with experience replay or small datasets.
Where Pith is reading between the lines
- The same LoRA-based ensembling pattern could be tested on other sequential decision problems that currently rely on explicit Bayesian uncertainty estimates.
- If the implicit-reward encoding holds across different regularization strengths, the framework offers a simpler alternative to full posterior sampling in large-model settings.
- Extending the bootstrap-matching step to non-stationary environments would require checking whether the regret bound still applies when the implicit reward function drifts.
Load-bearing premise
Policies trained with KL regularization implicitly encode an underlying reward function that an ensemble can match to bootstrapped data to capture epistemic uncertainty without an explicit reward model.
What would settle it
In a controlled multi-armed bandit setting, measure whether the action-selection distribution produced by the POETS ensemble deviates from the distribution of KL-regularized Thompson sampling or whether realized cumulative regret exceeds the O(sqrt(T gamma_T)) bound.
Figures
read the original abstract
Balancing exploration and exploitation is a core challenge in sequential decision-making and black-box optimization. We introduce POETS ($\textbf{Po}$licy $\textbf{E}$nsembles for $\textbf{T}$hompson $\textbf{S}$ampling), a novel framework that bridges uncertainty quantification and policy optimization. Our approach is grounded in the insight that policies trained with Kullback-Leibler (KL) regularization implicitly encode an underlying reward function. Building on this, POETS bypasses the complex, nested process of training an uncertainty-aware reward model and separately fitting a policy to this model. Instead, we directly train a policy ensemble to capture epistemic uncertainty by matching implicitly encoded reward functions to online, bootstrapped data. To overcome the prohibitive compute and memory constraints of ensembling Large Language Models (LLMs), POETS utilizes an efficient architecture: the ensemble shares a pre-trained backbone while maintaining diversity through independent Low-Rank Adaptation (LoRA) branches. Theoretically, we prove that POETS implicitly conducts KL-regularized Thompson sampling and thus inherits strong cumulative regret bounds of ${\mathcal O}(\sqrt{T \gamma_T})$. Empirically, we demonstrate that POETS achieves state-of-the-art sample efficiency across diverse scientific discovery domains, including protein search and quantum circuit design. Furthermore, it improves the optimization trajectories of reinforcement learning, proving particularly robust in off-policy settings with experience replay or in small dataset regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POETS, a framework for uncertainty-aware optimization using policy ensembles on LLMs. It claims that KL-regularized policies implicitly encode reward functions, enabling direct training of a LoRA-based ensemble (shared backbone, independent adapters) to capture epistemic uncertainty by matching to online bootstrapped data. Theoretically, it proves that this setup implicitly performs KL-regularized Thompson sampling and inherits cumulative regret bounds of O(sqrt(T gamma_T)). Empirically, it reports state-of-the-art sample efficiency in protein search, quantum circuit design, and RL optimization tasks, including robustness in off-policy and small-data regimes.
Significance. If the central theoretical equivalence holds, POETS would provide a practical advance by integrating uncertainty quantification into LLM policy optimization without nested reward-model training, while the LoRA ensemble architecture addresses compute constraints. The claimed regret bound and cross-domain empirical gains would be notable for sample-efficient black-box optimization if supported by explicit derivations and rigorous controls.
major comments (2)
- [Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.
- [Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.
minor comments (3)
- [Theoretical analysis section] Notation for gamma_T in the regret bound should be defined on first use and related to the specific function class or covering number used in the analysis.
- [Figures] Figure captions for optimization trajectories should state the number of trials and whether shaded regions represent standard error or min/max.
- [Discussion] The manuscript should add a limitations paragraph discussing failure modes when the implicit reward encoding assumption is violated (e.g., non-convex policy optimization or insufficient LoRA rank).
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We respond to each major comment in turn below.
read point-by-point responses
-
Referee: [Theoretical analysis section] Theoretical derivation (proof of implicit KL-regularized Thompson sampling): The central claim that KL-regularized policy training implicitly encodes an underlying reward function, allowing the LoRA ensemble to perform posterior sampling over rewards, must explicitly construct the mapping from ensemble parameters to the reward posterior and show that the online matching step preserves the KL-regularized objective. The abstract presents this as an 'insight' rather than a derived result; without this construction the inheritance of the O(sqrt(T gamma_T)) regret bound does not follow independently and risks being definitional.
Authors: We thank the referee for highlighting the need for explicitness in the theoretical derivation. The manuscript does provide a proof of the implicit KL-regularized Thompson sampling, but to address this concern directly, we will revise the Theoretical analysis section to include an explicit construction of the mapping from the ensemble parameters (including the shared backbone and independent LoRA adapters) to the reward posterior. We will also detail how the online matching to bootstrapped data preserves the KL-regularized objective, ensuring the regret bound follows rigorously rather than definitionally. The abstract will be updated to describe this as a derived result. revision: yes
-
Referee: [Experimental results section] Empirical evaluation (SOTA sample-efficiency claims): The reported gains in protein search and quantum circuit design must include explicit baselines, number of independent runs, error bars or statistical tests, and ablation of the bootstrapping procedure. If these controls are absent or the effect sizes are small relative to variance, the cross-domain superiority claim is not load-bearing.
Authors: We agree that additional experimental controls are necessary to substantiate the state-of-the-art sample-efficiency claims. In the revised manuscript, we will expand the Experimental results section to explicitly list all baselines, report the number of independent runs performed (currently 10 runs for each task), include error bars on all relevant figures, conduct appropriate statistical tests to assess significance, and provide an ablation study isolating the contribution of the bootstrapping procedure. These revisions will ensure the empirical results are robust and the superiority claims are well-supported. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided abstract and context present the key theoretical claim as an independent proof that POETS implicitly conducts KL-regularized Thompson sampling, inheriting O(sqrt(T gamma_T)) regret bounds from that equivalence. No equations, self-citations, or explicit reductions are available in the text to inspect for definitional equivalence (e.g., the training objective being restated as the sampling procedure by construction). The 'insight' about KL-regularized policies encoding rewards is framed as a grounding premise rather than a fitted or renamed input, and the derivation chain is not shown to collapse to its own inputs. This aligns with the default expectation that most papers maintain independent theoretical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Policies trained with KL regularization implicitly encode an underlying reward function.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
policies optimized with Kullback-Leibler (KL) regularization inherently encode their underlying reward functions... r_π(a) := (β+α) logπ(a)−βlogπ_ref(a) + (β+α) logZ
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
POETS implicitly conducts KL-regularized Thompson sampling... cumulative soft regret bound of O(√T γ_T)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trunk & Branch architecture... independent Low-Rank Adaptation (LoRA) branches
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Garnett, Roman , year = 2023, publisher =
work page 2023
- [2]
-
[3]
Rasmussen, Carl Edward and Williams, Christopher KI , year = 2006, publisher =
work page 2006
-
[4]
Entropy and Information Theory , author =
-
[5]
Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , author =
-
[6]
Adaptation in natural and artificial systems , author =
-
[7]
Ginsbourger, David and Le Riche, Rodolphe and Carraro, Laurent , booktitle=. 2010 , publisher=
work page 2010
-
[8]
Nature communications , volume=
Ferruz, Noelia and Schmidt, Steffen and H. Nature communications , volume=. 2022 , publisher=
work page 2022
-
[9]
Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=
work page 2024
-
[10]
Competition-Level Code Generation with
Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-Level Code Generation with. Science , volume=. 2022 , publisher=
work page 2022
-
[11]
Evolutionary-scale prediction of atomic-level protein structure with a language model , author=. Science , volume=. 2023 , publisher=
work page 2023
-
[12]
Predicting structured data , volume=
A Tutorial on Energy-Based Learning , author=. Predicting structured data , volume=. 2006 , url=
work page 2006
-
[13]
Parallelizing Exploration-Exploitation Tradeoffs in
Desautels, Thomas and Krause, Andreas and Burdick, Joel W , journal=. Parallelizing Exploration-Exploitation Tradeoffs in. 2014 , url=
work page 2014
-
[14]
Greenhill, Stewart and Rana, Santu and Gupta, Sunil and Vellanki, Pratibha and Venkatesh, Svetha , journal=. 2020 , publisher=
work page 2020
-
[15]
Annual Review of Pharmacology and Toxicology , volume=
Parallel Array and Mixture-Based Synthetic Combinatorial Chemistry: Tools for the Next Millennium , author=. Annual Review of Pharmacology and Toxicology , volume=. 2000 , publisher=
work page 2000
-
[16]
Wang, Ziyu and Hutter, Frank and Zoghi, Masrour and Matheson, David and De Feitas, Nando , journal=. 2016 , url=
work page 2016
-
[17]
Cock, Peter JA and Antao, Tiago and Chang, Jeffrey T and Chapman, Brad A and Cox, Cymon J and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and others , journal=. 2009 , url=
work page 2009
-
[18]
Protein Engineering, Design and Selection , volume=
Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence , author=. Protein Engineering, Design and Selection , volume=. 1990 , publisher=
work page 1990
-
[19]
Jos. Parallel and Distributed. International Conference on Machine Learning , pages =. 2017 , organization =
work page 2017
-
[20]
International Conference on Machine Learning , pages=
Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=
work page 2016
-
[21]
International Conference on Machine Learning , pages=
Asynchronous methods for deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2016 , organization=
work page 2016
-
[22]
International Conference on Machine Learning , pages=
On the global convergence rates of softmax policy gradient methods , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[23]
International Conference on Learning Representations , year=
Information-Directed Exploration for Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=
-
[24]
Chaorui Yao and Yanxi Chen and Yuchang Sun and Yushuo Chen and Wenhao Zhang and Xuchen Pan and Yaliang Li and Bolin Ding , booktitle=. Group-Relative. 2026 , url=
work page 2026
-
[25]
International Conference on Learning Representations , year=
A scalable laplace approximation for neural networks , author=. International Conference on Learning Representations , year=
-
[26]
Conference on Learning Representations , year=
Let's Verify Step by Step , author=. Conference on Learning Representations , year=
-
[27]
International Conference on Learning Representations , year=
Reward Model Ensembles Help Mitigate Overoptimization , author=. International Conference on Learning Representations , year=
-
[28]
International Conference on Machine Learning , pages=
Understanding the impact of entropy on policy optimization , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[29]
Kirschner, Johannes and Mutny, Mojmir and Hiller, Nicole and Ischebeck, Rasmus and Krause, Andreas , booktitle=. Adaptive and Safe. 2019 , organization=
work page 2019
-
[30]
Kandasamy, Kirthevasan and Krishnamurthy, Akshay and Schneider, Jeff and P. Parallelised. International conference on artificial intelligence and statistics , pages=. 2018 , organization=
work page 2018
-
[31]
Diversified Sampling for Batched
Nava, Elvis and Mutny, Mojmir and Krause, Andreas , booktitle=. Diversified Sampling for Batched. 2022 , organization=
work page 2022
-
[32]
Vishwakarma, Sanjay and Harkins, Francis and Golecha, Siddharth and Bajpe, Vishal Sharathchandra and Dupuis, Nicolas and Buratti, Luca and Kremer, David and Faro, Ismael and Puri, Ruchir and Cruz-Benito, Juan , booktitle=. 2024 , organization=
work page 2024
-
[33]
Rankovi. Conference on Neural Information Processing Systems 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World , year=
work page 2023
- [34]
-
[35]
Conference on Neural Information Processing Systems , volume=
Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization , author=. Conference on Neural Information Processing Systems , volume=. 2022 , url=
work page 2022
-
[36]
Conference on Neural Information Processing Systems , volume=
Rebel: Reinforcement learning via regressing relative rewards , author=. Conference on Neural Information Processing Systems , volume=. 2024 , url=
work page 2024
-
[37]
Conference on Neural Information Processing Systems , volume=
The epoch-greedy algorithm for multi-armed bandits with side information , author=. Conference on Neural Information Processing Systems , volume=. 2007 , url=
work page 2007
-
[38]
Conference on Neural Information Processing Systems , volume=
Effective diversity in population based reinforcement learning , author=. Conference on Neural Information Processing Systems , volume=. 2020 , url=
work page 2020
-
[39]
Conference on Neural Information Processing Systems , volume=
Deep exploration via bootstrapped DQN , author=. Conference on Neural Information Processing Systems , volume=. 2016 , url=
work page 2016
-
[40]
Conference on Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Conference on Neural Information Processing Systems , volume=. 2023 , url=
work page 2023
-
[41]
Conference on Neural Information Processing Systems , year=
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions , author=. Conference on Neural Information Processing Systems , year=
-
[42]
Conference on Neural Information Processing Systems , url=
Probabilistic Inference in Reinforcement Learning Done Right , author =. Conference on Neural Information Processing Systems , url=
-
[43]
O'Donoghue, Brendan and Lattimore, Tor , year = 2021, booktitle =. Variational
work page 2021
-
[44]
Srinivas, Niranjan and Krause, Andreas and Kakade, Sham M and Seeger, Matthias , year = 2010, booktitle =
work page 2010
-
[45]
Chapelle, Olivier and Li, Lihong , year = 2011, booktitle =. An Empirical Evaluation of
work page 2011
-
[46]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[47]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=. 2017 , url =
work page 2017
-
[48]
Conference on Neural Information Processing Systems , volume=
Eluder dimension and the sample complexity of optimistic exploration , author=. Conference on Neural Information Processing Systems , volume=. 2013 , url=
work page 2013
-
[49]
International Conference on Learning Representations , year=
Diversity is All You Need: Learning Skills without a Reward Function , author=. International Conference on Learning Representations , year=
-
[50]
International Conference on Learning Representations , year=
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution , author=. International Conference on Learning Representations , year=
-
[51]
Lee, Seunghun and Park, Jinyoung and Chu, Jaewon and Yoon, Minseo and Kim, Hyunwoo J , year =. Latent. International Conference on Learning Representations , url=
-
[52]
Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
-
[53]
Thompson Sampling via Fine-Tuning of
Nicolas Menet and Aleksandar Terzic and Michael Hersche and Andreas Krause and Abbas Rahimi , booktitle=. Thompson Sampling via Fine-Tuning of. 2026 , url=
work page 2026
-
[54]
Kingma and Jimmy Ba , year = 2015, booktitle =
Diederik P. Kingma and Jimmy Ba , year = 2015, booktitle =
work page 2015
-
[55]
International Conference on Machine Learning , year=
The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , year=
-
[56]
Geist, Matthieu and Scherrer, Bruno and Pietquin, Olivier , year = 2019, booktitle =. A Theory of Regularized
work page 2019
-
[57]
International Conference on Machine Learning , url=
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. International Conference on Machine Learning , url=
-
[58]
International Conference on Machine Learning , year=
On Kernelized Multi-armed Bandits , author=. International Conference on Machine Learning , year=
-
[59]
Improving black-box optimization in
Notin, Pascal and Hern\'. Improving black-box optimization in. Conference on Neural Information Processing Systems , year =
-
[60]
Conference on Neural Information Processing Systems , volume=
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. Conference on Neural Information Processing Systems , volume=. 2017 , url=
work page 2017
-
[61]
Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and YuYue and Weinan Dai and Tiantian Fan and Gaohong Liu and Juncai Liu and LingJun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Ru Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and ...
work page 2025
-
[62]
Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin , booktitle=. Understanding. 2025 , url=
work page 2025
-
[63]
Conference on Language Modeling , year=
Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Conference on Language Modeling , year=
-
[64]
International Conference on Artificial Intelligence and Statistics , volume =
Nicolas Menet and Jonas H. International Conference on Artificial Intelligence and Statistics , volume =. 2025 , organization =
work page 2025
-
[65]
International conference on machine learning , year=
More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize , author=. International conference on machine learning , year=
-
[66]
Ahmadian, Arash and Cremer, Chris and Gall. Back to Basics: Revisiting. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , volume=. 2024 , url=
work page 2024
-
[67]
Wang, Zi and Gehring, Clement and Kohli, Pushmeet and Jegelka, Stefanie , booktitle=. Batched Large-scale. 2018 , organization=
work page 2018
-
[68]
Optimistic Games for Combinatorial
Bal, Melis Ilayda and Sessa, Pier Giuseppe and Mutny, Mojmir and Krause, Andreas , booktitle=. Optimistic Games for Combinatorial. 2025 , url=
work page 2025
-
[69]
Proceedings of the ninth annual conference of the Cognitive Science Society , pages=
Using fast weights to deblur old memories , author=. Proceedings of the ninth annual conference of the Cognitive Science Society , pages=. 1987 , url=
work page 1987
-
[70]
International workshop on artificial intelligence and statistics , pages=
Online bagging and boosting , author=. International workshop on artificial intelligence and statistics , pages=. 2001 , organization=
work page 2001
-
[71]
A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , author=. 1964 , journal=
work page 1964
- [72]
-
[73]
Mathematics of Operations Research , publisher =
Learning to Optimize Via Posterior Sampling , author =. Mathematics of Operations Research , publisher =
-
[74]
Journal of Machine Learning Research , volume = 3, url=
Using Confidence Bounds for Exploitation-Exploration Trade-offs , author =. Journal of Machine Learning Research , volume = 3, url=
-
[75]
An Information-Theoretic Analysis of
Russo, Daniel and Van Roy, Benjamin , year = 2016, journal =. An Information-Theoretic Analysis of
work page 2016
-
[76]
Russo, Daniel and Van Roy, Benjamin and Kazerouni, Abbas and Osband, Ian and Wen, Zheng and others , year = 2018, journal =. A Tutorial on
work page 2018
-
[77]
IEEE transactions on systems, man, and cybernetics , pages=
Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE transactions on systems, man, and cybernetics , pages=. 1983 , publisher=
work page 1983
-
[78]
On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples , author=. Biometrika , volume=. 1933 , publisher=
work page 1933
-
[79]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[80]
Kool, Wouter and van Hoof, Herke and Welling, Max , journal=. Buy 4. International Conference on Learning Representations 2019 Deep Reinforcement Learning meets Structured Prediction Workshop , year=
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.