Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

Hanna Kurniawati; Jingyang You

arxiv: 2512.20974 · v3 · submitted 2025-12-24 · 💻 cs.LG · cs.AI· cs.RO

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

Jingyang You , Hanna Kurniawati This is my paper

Pith reviewed 2026-05-16 20:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords Bayesian Reinforcement LearningMeta-RLGeneralised Linear ModelsTask RepresentationsTractable InferenceMuJoCoMetaWorld

0 comments

The pith

GLiBRL replaces neural networks in deep Bayesian RL with generalised linear models and learnable basis functions to enable exact tractable inference over task parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GLiBRL to fix the problem that direct neural networks in deep Bayesian RL force variational approximations and produce indistinct task representations. It equips generalised linear models with learnable basis functions so that Bayesian inference over task parameters and noise becomes fully tractable with exact marginal likelihoods. The resulting permutation-invariant updates integrate directly with standard on-policy and off-policy RL algorithms. The method also derives a closed-form link between the L2 distance of learned task representations and empirical kernel correspondences between samples, the first such structural guarantee for online deep BRL. On MuJoCo and MetaWorld the approach raises state-of-the-art performance by up to 1.8 times.

Core claim

GLiBRL admits a closed-form relationship between the L2 distance of its task representations and empirical kernel-based correspondence between task samples, which is the first such structural result for online deep BRL, while delivering fully tractable Bayesian inference over task parameters and model noise together with exact marginal likelihood evaluation.

What carries the argument

Generalised linear models equipped with learnable basis functions that keep Bayesian updates over task parameters and noise fully tractable and permutation-invariant.

If this is right

Exact marginal likelihoods become available for learning transition and reward models without variational approximations.
Permutation-invariant inference allows direct combination with any on-policy or off-policy RL algorithm.
Task representations gain an interpretable geometric structure that relates L2 distances to kernel similarities.
Empirical gains up to 1.8 times appear on standard MuJoCo and MetaWorld suites.
The same linear-plus-basis construction can be reused across different model-based RL pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The structural guarantee may let researchers analyse generalisation bounds in Meta-RL more directly than with black-box neural models.
Replacing full networks with these basis functions could lower sample complexity in other model-based settings where exact inference matters.
The approach suggests a route to hybrid architectures that keep most of a network but replace the final layers with linear models for tractability.

Load-bearing premise

That generalised linear models with learnable basis functions are expressive enough to capture the transition and reward structure of the MuJoCo and MetaWorld tasks without the flexibility of full neural networks.

What would settle it

An experiment showing that the claimed closed-form L2-to-kernel relationship fails to hold on held-out task samples, or that GLiBRL performance falls below strong variational deep BRL baselines on new continuous-control benchmarks.

Figures

Figures reproduced from arXiv: 2512.20974 by Hanna Kurniawati, Jingyang You.

**Figure 1.** Figure 1: IQM and 95% CI of testing success rate of GLiBRL, VariBAD, MAML and RL2 , with related to the number of training steps. Left: the ML10 benchmark; Right: the ML45 benchmark. We list the comparators as follows. First, we compare GLiBRL to standard deep BRL and MetaRL baselines, including deep BRL method VariBAD (Zintgraf et al., 2021), PPG-based Meta-RL methods MAML, and black-box Meta-RL method RL2 . Follo… view at source ↗

**Figure 2.** Figure 2: IQM and 95% CI of errors in transition and reward predictions, comparing GLiBRL and GLiBRL wo NI. Up: Transitions; Bottom: Rewards; Left: ML10; Right: ML45. identical hyperparameters on both ML10 and ML45 benchmarks 6 . The metric being evaluated are L1 norms of prediction errors in both transitions (defined as |S ′ − CT Tµ|1) and rewards (defined as |r − CRRµ|1). 7 The results are shown in [PITH_FULL_IMA… view at source ↗

**Figure 3.** Figure 3: IQM and 95% CI of success rate for each testing scenario in ML10. The Shelf Place scenario is challenging as none of the method can achieve a single success. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: IQM and 95% CI of success rate for each testing scenario in ML45. GLiBRL achieves nearly 100% testing success rates in both Door Lock and Door Unlock scenarios. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

read the original abstract

Bayesian Reinforcement Learning (BRL), a subclass of Meta-Reinforcement Learning (Meta-RL), provides a principled framework for generalisation by explicitly incorporating Bayesian task parameters into transition and reward models. However, classical BRL methods assume known forms of transition and reward models. While recent deep BRL methods incorporate model learning to address this, applying neural networks directly to joint data and task parameters necessitates variational inference. This often yields indistinct task representations, compromising the resulting BRL policies. To overcome these limitations, we introduce Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL). Our approach features fully tractable Bayesian inference over task parameters and model noise, alongside exact marginal likelihood evaluation for learning transition and reward models. The permutation-invariant nature of exact Bayesian inference in GLiBRL enables seamless integration with both on-policy and off-policy RL algorithms. We further show that GLiBRL admits a closed-form relationship between the $\mathcal{L}_2$ distance of its task representations and empirical kernel-based correspondence between task samples, which is to our knowledge the first such structural result for online deep BRL. GLiBRL is compared against representative and recent Meta-RL methods, and improves state-of-the-art performance on both MuJoCo and MetaWorld benchmarks by up to 1.8$\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLiBRL swaps neural nets for GLMs with learnable bases to enable exact Bayesian inference in deep RL and claims a closed-form link between task L2 distances and kernel similarities, but the abstract leaves the derivations and experiment details too thin to verify the gains.

read the letter

The punchline is that GLiBRL swaps neural nets for generalized linear models with learnable bases to get exact Bayesian updates in deep RL. This avoids variational inference and gives a closed-form relationship between L2 distances of task representations and kernel correspondences, which they claim is new. The approach also integrates easily with standard RL algorithms due to its permutation invariance. They do a good job laying out how exact marginal likelihoods work for learning the models and how this leads to better task representations than the usual deep BRL methods. The benchmark improvements on MuJoCo and MetaWorld are presented clearly as up to 1.8 times better than recent meta-RL baselines. The main soft spot is expressiveness. Because the model is linear in the learned features, it may not capture the full nonlinear dynamics in those environments without the bases spanning everything needed. The abstract does not include derivations for the structural result or details on ablations and error bars, so the evidence for the claims is not verifiable yet. That matches the concern that any gains might not stem from the Bayesian part if the model class is too restrictive. This paper is for people in meta-RL and Bayesian RL who want more principled and tractable inference. A reader focused on structural results or exact methods would get value from working through it. It deserves serious peer review because the core idea is distinct and the claims are concrete enough to check.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces GLiBRL, a deep Bayesian RL method that replaces neural networks with generalised linear models equipped with learnable basis functions for transition and reward modeling. It claims fully tractable Bayesian inference over task parameters and noise, exact marginal likelihood evaluation, permutation invariance enabling on- and off-policy integration, a novel closed-form relationship between the L2 distance of task representations and empirical kernel-based task correspondences, and up to 1.8× performance gains over prior Meta-RL methods on MuJoCo and MetaWorld benchmarks.

Significance. If the closed-form structural result is rigorously derived and the model class proves expressive enough for the benchmark dynamics, the work would supply the first explicit structural property for online deep BRL together with practical performance improvements. The tractable inference and exact marginal likelihood are clear technical strengths.

major comments (2)

[§3] §3 (Model formulation): the assumption that GLMs with learnable bases suffice to represent the nonlinear transition and reward structure of MuJoCo contact/friction dynamics and MetaWorld tasks is load-bearing for both the performance claims and the attribution of gains to the Bayesian component; no capacity analysis, approximation-error bounds, or ablation against full neural-network baselines is supplied to support this.
[§4] §4 (Structural result): the closed-form relationship between L2 distance of task representations and empirical kernel correspondence is asserted as a non-trivial property of the model; without the explicit derivation (including the precise definition of the kernel and the steps showing independence from fitted parameters), it is impossible to confirm whether the relation is structural or reduces by construction to the GLM definition.

minor comments (2)

[Abstract and §5] Abstract and experimental section: error bars, number of seeds, and data-exclusion rules are absent, preventing verification of the reported 1.8× gains.
[§3] Notation: the precise form of the learnable basis functions and how they are optimised jointly with the Bayesian posterior is not stated clearly enough for reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important aspects of the model assumptions and the structural result. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Model formulation): the assumption that GLMs with learnable bases suffice to represent the nonlinear transition and reward structure of MuJoCo contact/friction dynamics and MetaWorld tasks is load-bearing for both the performance claims and the attribution of gains to the Bayesian component; no capacity analysis, approximation-error bounds, or ablation against full neural-network baselines is supplied to support this.

Authors: We agree that the expressiveness of the GLM with learnable basis functions is central to attributing performance gains to the Bayesian inference component rather than model capacity alone. The manuscript demonstrates competitive results on MuJoCo and MetaWorld, which involve nonlinear contact and friction dynamics, but we acknowledge the absence of explicit capacity analysis or ablations against full neural-network baselines. In the revised version we will add an ablation study comparing GLiBRL against neural-network-based deep BRL variants on the same benchmarks, along with a discussion of how the learnable bases provide sufficient flexibility for the evaluated tasks. Rigorous approximation-error bounds for arbitrary nonlinear dynamics are technically challenging and may lie outside the paper's primary scope; however, we will include a brief analysis of the basis-function capacity in the context of the reported tasks. revision: partial
Referee: [§4] §4 (Structural result): the closed-form relationship between L2 distance of task representations and empirical kernel correspondence is asserted as a non-trivial property of the model; without the explicit derivation (including the precise definition of the kernel and the steps showing independence from fitted parameters), it is impossible to confirm whether the relation is structural or reduces by construction to the GLM definition.

Authors: We appreciate this observation. The closed-form link between the L2 distance of task representations and the empirical kernel-based task correspondence is derived directly from the GLM structure and the definition of the kernel induced by the learnable bases. In the revised manuscript we will insert the complete derivation in §4, beginning with the precise definition of the kernel, followed by the algebraic steps that establish independence from the fitted parameters. This will make explicit that the relationship is a structural property of the model class rather than an artifact of a particular fitting procedure. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper derives a closed-form L2-to-kernel relationship for task representations directly from the GLiBRL model definition using generalised linear models with learnable bases and exact Bayesian inference. This structural result is presented as following from the model's permutation-invariant exact marginal likelihood and tractable posterior, without reducing to a fitted parameter or self-citation chain. No equations in the provided text exhibit a self-definitional loop or renaming of inputs as predictions. Empirical performance gains on MuJoCo/MetaWorld are reported separately from the structural claim and do not rely on the closed-form result for justification. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GLMs suffice for the benchmark dynamics and on the introduction of learnable basis-function parameters that are fitted to data.

free parameters (1)

learnable basis function parameters
These parameters are optimised to fit transition and reward models and are therefore free parameters of the method.

axioms (1)

domain assumption Generalised linear models with learnable basis functions can represent the transition and reward functions of the target RL tasks
This assumption is required for the tractable inference and exact marginal likelihood claims to hold.

pith-pipeline@v0.9.0 · 5543 in / 1346 out tokens · 40498 ms · 2026-05-16T20:26:14.258619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 3 internal anchors

[1]

Deep reinforcement learning at the edge of the statistical precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34: 0 29304--29320, 2021

work page 2021
[2]

Hypernetworks in meta-reinforcement learning

Jacob Beck, Matthew Thomas Jackson, Risto Vuorio, and Shimon Whiteson. Hypernetworks in meta-reinforcement learning. In Conference on Robot Learning, pp.\ 1478--1487. PMLR, 2023 a

work page 2023
[3]

A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028, 2023 b

work page arXiv 2023
[4]

Generating sentences from a continuous space

Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp.\ 10--21, 2016

work page 2016
[5]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018
[6]

Inference suboptimality in variational autoencoders

Chris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders. In International conference on machine learning, pp.\ 1078--1086. PMLR, 2018

work page 2018
[7]

The usual suspects? reassessing blame for vae posterior collapse

Bin Dai, Ziyu Wang, and David Wipf. The usual suspects? reassessing blame for vae posterior collapse. In International conference on machine learning, pp.\ 2313--2322. PMLR, 2020

work page 2020
[8]

Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations

Finale Doshi-Velez and George Konidaris. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference, volume 2016, pp.\ 1432, 2016

work page 2016
[9]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl ^2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes

Michael O'Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002

work page 2002
[11]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

work page 2017
[12]

Probabilistic model-agnostic meta-learning

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018

work page 2018
[13]

Bayesian reinforcement learning: A survey

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8 0 (5-6): 0 359--483, 2015

work page 2015
[14]

Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Arthur Guez, David Silver, and Peter Dayan. Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search. Journal of Artificial Intelligence Research, 48: 0 841--883, 2013

work page 2013
[15]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018
[16]

Control adaptation via meta-learning dynamics

James Harrison, Apoorva Sharma, Roberto Calandra, and Marco Pavone. Control adaptation via meta-learning dynamics. In Workshop on Meta-Learning at NeurIPS, volume 2018, 2018 a

work page 2018
[17]

Meta-learning priors for efficient online bayesian regression

James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efficient online bayesian regression. In International Workshop on the Algorithmic Foundations of Robotics, pp.\ 318--337. Springer, 2018 b

work page 2018
[18]

Robust and efficient transfer learning with hidden parameter markov decision processes

Taylor W Killian, Samuel Daulton, George Konidaris, and Finale Doshi-Velez. Robust and efficient transfer learning with hidden parameter markov decision processes. Advances in neural information processing systems, 30, 2017

work page 2017
[19]

Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition

Suyoung Lee, Myungsik Cho, and Youngchul Sung. Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition. Advances in Neural Information Processing Systems, 36: 0 43356--43383, 2023

work page 2023
[20]

Meta-world+: An improved, standardized, rl benchmark

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank R \"o der, Tianhe Yu, Zhanpeng He, KR Zentner, Ryan Julian, JK Terry, Isaac Woungang, et al. Meta-world+: An improved, standardized, rl benchmark. arXiv preprint arXiv:2505.11289, 2025

work page arXiv 2025
[21]

Transformers are meta-reinforcement learners

Luckeciano C Melo. Transformers are meta-reinforcement learners. In international conference on machine learning, pp.\ 15340--15359. PMLR, 2022

work page 2022
[22]

(more) efficient reinforcement learning via posterior sampling

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013

work page 2013
[23]

Generalized hidden parameter mdps: Transferable model-based rl in a handful of trials

Christian Perez, Felipe Petroski Such, and Theofanis Karaletsos. Generalized hidden parameter mdps: Transferable model-based rl in a handful of trials. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 5403--5411, 2020

work page 2020
[24]

An analytic solution to discrete bayesian reinforcement learning

Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp.\ 697--704, 2006

work page 2006
[25]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp.\ 5331--5340. PMLR, 2019

work page 2019
[26]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp.\ 1889--1897. PMLR, 2015

work page 2015
[27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Efficient cross-episode meta-rl

Gresa Shala, Andr \'e Biedenkapp, Pierre Krack, Florian Walter, and Josif Grabocka. Efficient cross-episode meta-rl. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[29]

A bayesian framework for reinforcement learning

Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pp.\ 943--950, 2000

work page 2000
[30]

Linear bayesian reinforcement learning

Nikolaos Tziortziotis, Christos Dimitrakakis, and Konstantinos Blekas. Linear bayesian reinforcement learning. In IJCAI, pp.\ 1721--1728, 2013

work page 2013
[31]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[32]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Single episode policy transfer in reinforcement learning

Jiachen Yang, Brenden Petersen, Hongyuan Zha, and Daniel Faissol. Single episode policy transfer in reinforcement learning. arXiv preprint arXiv:1910.07719, 2019

work page arXiv 1910
[34]

Direct policy transfer via hidden parameter markov decision processes

Jiayu Yao, Taylor Killian, George Konidaris, and Finale Doshi-Velez. Direct policy transfer via hidden parameter markov decision processes. In LLARLA Workshop, FAIM, volume 2018, 2018

work page 2018
[35]

Bayesian model-agnostic meta-learning

Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018

work page 2018
[36]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URL https://arxiv.org/abs/1910.10897

work page arXiv 2021
[37]

Varibad: Variational bayes-adaptive deep rl via meta-learning

Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: Variational bayes-adaptive deep rl via meta-learning. Journal of Machine Learning Research, 22 0 (289): 0 1--39, 2021

work page 2021
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[41]

Results from SDVT, TrMRL and ECET are taken as reported

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2021

[1] [1]

Deep reinforcement learning at the edge of the statistical precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34: 0 29304--29320, 2021

work page 2021

[2] [2]

Hypernetworks in meta-reinforcement learning

Jacob Beck, Matthew Thomas Jackson, Risto Vuorio, and Shimon Whiteson. Hypernetworks in meta-reinforcement learning. In Conference on Robot Learning, pp.\ 1478--1487. PMLR, 2023 a

work page 2023

[3] [3]

A survey of meta-reinforcement learning.arXiv preprint arXiv:2301.08028, 2023

Jacob Beck, Risto Vuorio, Evan Zheran Liu, Zheng Xiong, Luisa Zintgraf, Chelsea Finn, and Shimon Whiteson. A survey of meta-reinforcement learning. arXiv preprint arXiv:2301.08028, 2023 b

work page arXiv 2023

[4] [4]

Generating sentences from a continuous space

Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp.\ 10--21, 2016

work page 2016

[5] [5]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018

[6] [6]

Inference suboptimality in variational autoencoders

Chris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders. In International conference on machine learning, pp.\ 1078--1086. PMLR, 2018

work page 2018

[7] [7]

The usual suspects? reassessing blame for vae posterior collapse

Bin Dai, Ziyu Wang, and David Wipf. The usual suspects? reassessing blame for vae posterior collapse. In International conference on machine learning, pp.\ 2313--2322. PMLR, 2020

work page 2020

[8] [8]

Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations

Finale Doshi-Velez and George Konidaris. Hidden parameter markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference, volume 2016, pp.\ 1432, 2016

work page 2016

[9] [9]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl ^2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes

Michael O'Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002

work page 2002

[11] [11]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

work page 2017

[12] [12]

Probabilistic model-agnostic meta-learning

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018

work page 2018

[13] [13]

Bayesian reinforcement learning: A survey

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8 0 (5-6): 0 359--483, 2015

work page 2015

[14] [14]

Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search

Arthur Guez, David Silver, and Peter Dayan. Scalable and efficient bayes-adaptive reinforcement learning based on monte-carlo tree search. Journal of Artificial Intelligence Research, 48: 0 841--883, 2013

work page 2013

[15] [15]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018

[16] [16]

Control adaptation via meta-learning dynamics

James Harrison, Apoorva Sharma, Roberto Calandra, and Marco Pavone. Control adaptation via meta-learning dynamics. In Workshop on Meta-Learning at NeurIPS, volume 2018, 2018 a

work page 2018

[17] [17]

Meta-learning priors for efficient online bayesian regression

James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efficient online bayesian regression. In International Workshop on the Algorithmic Foundations of Robotics, pp.\ 318--337. Springer, 2018 b

work page 2018

[18] [18]

Robust and efficient transfer learning with hidden parameter markov decision processes

Taylor W Killian, Samuel Daulton, George Konidaris, and Finale Doshi-Velez. Robust and efficient transfer learning with hidden parameter markov decision processes. Advances in neural information processing systems, 30, 2017

work page 2017

[19] [19]

Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition

Suyoung Lee, Myungsik Cho, and Youngchul Sung. Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition. Advances in Neural Information Processing Systems, 36: 0 43356--43383, 2023

work page 2023

[20] [20]

Meta-world+: An improved, standardized, rl benchmark

Reginald McLean, Evangelos Chatzaroulas, Luc McCutcheon, Frank R \"o der, Tianhe Yu, Zhanpeng He, KR Zentner, Ryan Julian, JK Terry, Isaac Woungang, et al. Meta-world+: An improved, standardized, rl benchmark. arXiv preprint arXiv:2505.11289, 2025

work page arXiv 2025

[21] [21]

Transformers are meta-reinforcement learners

Luckeciano C Melo. Transformers are meta-reinforcement learners. In international conference on machine learning, pp.\ 15340--15359. PMLR, 2022

work page 2022

[22] [22]

(more) efficient reinforcement learning via posterior sampling

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013

work page 2013

[23] [23]

Generalized hidden parameter mdps: Transferable model-based rl in a handful of trials

Christian Perez, Felipe Petroski Such, and Theofanis Karaletsos. Generalized hidden parameter mdps: Transferable model-based rl in a handful of trials. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 5403--5411, 2020

work page 2020

[24] [24]

An analytic solution to discrete bayesian reinforcement learning

Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp.\ 697--704, 2006

work page 2006

[25] [25]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp.\ 5331--5340. PMLR, 2019

work page 2019

[26] [26]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp.\ 1889--1897. PMLR, 2015

work page 2015

[27] [27]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Efficient cross-episode meta-rl

Gresa Shala, Andr \'e Biedenkapp, Pierre Krack, Florian Walter, and Josif Grabocka. Efficient cross-episode meta-rl. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[29] [29]

A bayesian framework for reinforcement learning

Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pp.\ 943--950, 2000

work page 2000

[30] [30]

Linear bayesian reinforcement learning

Nikolaos Tziortziotis, Christos Dimitrakakis, and Konstantinos Blekas. Linear bayesian reinforcement learning. In IJCAI, pp.\ 1721--1728, 2013

work page 2013

[31] [31]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[32] [32]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

Single episode policy transfer in reinforcement learning

Jiachen Yang, Brenden Petersen, Hongyuan Zha, and Daniel Faissol. Single episode policy transfer in reinforcement learning. arXiv preprint arXiv:1910.07719, 2019

work page arXiv 1910

[34] [34]

Direct policy transfer via hidden parameter markov decision processes

Jiayu Yao, Taylor Killian, George Konidaris, and Finale Doshi-Velez. Direct policy transfer via hidden parameter markov decision processes. In LLARLA Workshop, FAIM, volume 2018, 2018

work page 2018

[35] [35]

Bayesian model-agnostic meta-learning

Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. Advances in neural information processing systems, 31, 2018

work page 2018

[36] [36]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URL https://arxiv.org/abs/1910.10897

work page arXiv 2021

[37] [37]

Varibad: Variational bayes-adaptive deep rl via meta-learning

Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: Variational bayes-adaptive deep rl via meta-learning. Journal of Machine Learning Research, 22 0 (289): 0 1--39, 2021

work page 2021

[38] [38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[39] [39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[40] [40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[41] [41]

Results from SDVT, TrMRL and ECET are taken as reported

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2021