Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

Jeongmo Kim; Minung Kim; Seungyul Han; Yisak Park

arxiv: 2502.02834 · v3 · pith:OX4FO4XHnew · submitted 2025-02-05 · 💻 cs.LG · cs.AI

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

Jeongmo Kim , Yisak Park , Minung Kim , Seungyul Han This is my paper

Pith reviewed 2026-05-23 03:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords meta-reinforcement learningout-of-distribution generalizationtask-aware virtual trainingmetric-based representation learningvirtual tasksstate regularizationMuJoCoMetaWorld

0 comments

The pith

Task-Aware Virtual Training improves meta-RL generalization to out-of-distribution tasks via metric-based representations and virtual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Task-Aware Virtual Training (TAVT) to help meta-reinforcement learning policies generalize better to tasks outside the training distribution. Context-based methods often fail here because their task representations do not hold up for unseen cases. TAVT applies metric-based representation learning to capture task characteristics reliably, generates virtual tasks that keep those characteristics intact, and adds state regularization to limit overestimation errors when states change. Experiments across MuJoCo and MetaWorld benchmarks show stronger performance on OOD tasks than prior approaches. A reader would care because this could make RL agents more adaptable without needing full retraining for every new scenario.

Core claim

TAVT is a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. It successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments, resulting in significantly enhanced generalization to OOD tasks across MuJoCo and MetaWorld environments.

What carries the argument

Task-Aware Virtual Training (TAVT) algorithm, which uses metric-based representation learning to generate virtual tasks that preserve task features and applies state regularization to control value overestimation.

If this is right

Policies trained with TAVT achieve higher returns on unseen tasks in continuous control settings such as MuJoCo.
The approach reduces overestimation errors when environments have varying state distributions.
Task representations remain effective beyond the exact distribution used for meta-training.
Virtual task generation allows the method to maintain performance when real OOD samples are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The virtual-task construction step might transfer to meta-learning problems outside reinforcement learning, such as few-shot supervised tasks.
Pairing TAVT with model-based planning could further reduce the sample cost of adapting to new tasks.
Measuring representation quality directly on OOD tasks before policy training could serve as an early diagnostic for when the method will succeed.

Load-bearing premise

The assumption that metric-based representation learning can accurately capture and preserve task characteristics for out-of-distribution scenarios.

What would settle it

A set of experiments on additional OOD tasks where TAVT produces no measurable improvement in generalization performance or task representation fidelity compared with standard context-based meta-RL baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.02834 by Jeongmo Kim, Minung Kim, Seungyul Han, Yisak Park.

**Figure 1.** Figure 1: (a) 2D Goal positions in the Ant-Goal environment: The blue shaded area indicates the training task distribution, with blue marks representing inner training tasks, green marks representing outer training tasks, and red marks denoting OOD test tasks. (b-e) t-SNE visualization of task latents for various context-based meta-RL methods. both rewards and next states, and we propose a state regularization meth… view at source ↗

**Figure 2.** Figure 2: Changes in latents zon of 4 randomly sampled training tasks in the Ant-Goal-OOD environment: (a) Using zon only (b) Using on-off latent loss for task representation learning representation compared to using zon alone. In summary, our encoder-decoder loss is defined as Lbisim(ψ, ϕ) = ETi,Tj∼p(Ttrain) h |z i off − z j off | − d(Ti, Tj ; pϕ¯) 2 | {z } Bisimulation loss + E(s,a,r,s′)∼D Ti off ,(ˆr,sˆ′)∼pϕ(s,… view at source ↗

**Figure 4.** Figure 4: An illustration for the structure of TAVT input zoff, as described in Eq. (3). Both the task-preserving loss and VT construction are always based on z α off, derived from zoff. Importantly, the off-policy task latent zoff conditions sample context generation, with its gradient disconnected to ensure stable training. By leveraging WGAN, we significantly reduce differences between generated and real sample… view at source ↗

**Figure 5.** Figure 5: (a) Q-function loss for VTs (b) Estimation bias for OOD tasks in the Walker-Mass-OOD environment. The virtual contexts cˆ α in TAVT include both rewards and next states, enabling it to handle state-varying environments. However, inaccuracies in the task decoder can introduce errors in the Q-function, leading to overestimation bias, a common issue in offline RL (Fujimoto et al., 2019). While reward errors h… view at source ↗

**Figure 6.** Figure 6: (a-f) MuJoCo environments (g-h) ML1 environments In contrast, using ϵreg = 1.0 (full use of sˆ ′α) results in higher Q-function loss and greater bias. Section 5 further demonstrates that this method improves OOD performance. 5. Experiments In this section, we compare the proposed TAVT algorithm with various on-policy and off-policy meta-RL methods across MuJoCo (Todorov et al., 2012) and MetaWorld ML1 (Yu … view at source ↗

**Figure 7.** Figure 7: Performance comparison for MuJoCo environments. The graphs for on-policy algorithms represent their final performance. • Hopper/Walker-Mass-OOD: The Hopper/Walker agent are required to run forward with scale mscale in M multiplied to their body mass. • ML1/ML1-OOD: The agent is tasked with reaching or pushing an object to a target goal position gtar, sampled from the 3D goal space M, which varies depending… view at source ↗

**Figure 8.** Figure 8: t-SNE visualization of task latents: (a) Cheetah-Vel-OOD (b) Walker-Mass-OOD (c) Ant-Dir-4 (d) Reach-OOD-Inter. For ML1 tasks in the 3D goal space, we provide both 3D representation for all tasks and 2D representation for tasks in the selected cross-section. For MetaWorld environments, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Component evaluation on (a) Cheetah-Vel-OOD and (b) Walker-Mass-OOD environments [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Context differences between real contexts and virtual contexts generated by the task decoder for OOD tasks: (a) CheetahVel-OOD and (b) Walker-Mass-OOD environments [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Performance comparison for various ϵreg on (a) WalkerMass-OOD (b) Hopper-Mass-OOD environments. omits the on-off loss, and ‘Recon only’, which uses only reconstruction loss with DropOut as in LDM. The results demonstrate that removing any component significantly degrades performance, underscoring the importance of task representation learning and sample generation for OOD task generalization. In additio… view at source ↗

read the original abstract

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Task-Aware Virtual Training (TAVT), a meta-RL algorithm that employs metric-based representation learning to capture and preserve task characteristics for both training and out-of-distribution (OOD) tasks, combined with a state regularization technique to address overestimation errors. It evaluates the approach on MuJoCo and MetaWorld environments and claims that TAVT yields significant improvements in generalization to OOD tasks relative to prior context-based meta-RL methods. Code is released at the provided GitHub link.

Significance. If the reported numerical gains hold under rigorous statistical scrutiny, TAVT would constitute a practical incremental advance in addressing OOD generalization, a recognized limitation of context-based meta-RL. The explicit release of code is a clear strength that facilitates reproducibility and follow-on work.

minor comments (3)

The abstract asserts 'significant' numerical improvements yet supplies no concrete metrics, baselines, or statistical details; the experimental section should include these with error bars and significance tests to support the central claim.
The precise definition of OOD tasks (e.g., how task parameters are shifted relative to the training distribution) should be stated explicitly, ideally with a table or equation, so that the metric-learning objective's claimed preservation of task characteristics can be evaluated.
The state-regularization term is described at a high level; a short derivation or pseudocode showing how it interacts with the metric embedding loss would improve clarity without altering the method.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the code release, and recommendation for minor revision. The referee's description of TAVT is accurate. No specific major comments appear in the report, so we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity; empirical algorithm with no derivations

full rationale

The paper presents TAVT as an empirical algorithm for meta-RL, validated solely by numerical results on MuJoCo and MetaWorld benchmarks. No equations, derivations, or mathematical claims appear in the abstract or method sketch. The reader's assessment confirms absence of reductions to fitted parameters or self-referential definitions. Central claims rest on experimental outcomes rather than any chain that collapses to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters, axioms, or new entities; insufficient information available.

pith-pipeline@v0.9.0 · 5668 in / 1048 out tokens · 33969 ms · 2026-05-23T03:25:25.416222+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 4.1 (Bisimulation metric for task representation) ... d(Ti, Tj) = E[|RTi−RTj| + η W2(PTi, PTj)]
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lbisim(ψ, ϕ) = ... Bisimulation loss + Reconstruction loss + on-off latent loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning
cs.LG 2025-10 unverdicted novelty 5.0

The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

C., Castro, P

Agarwal, R., Machado, M. C., Castro, P. S., and Bellemare, M. G. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

work page arXiv 2021
[2]

Distributionally adaptive meta reinforcement learning

Ajay, A., Gupta, A., Ghosh, D., Levine, S., and Agrawal, P. Distributionally adaptive meta reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 25856--25869, 2022

work page 2022
[3]

Wasserstein generative adversarial networks

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp.\ 214--223. PMLR, 2017

work page 2017
[4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Provable benefit of multitask representation learning in reinforcement learning

Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable benefit of multitask representation learning in reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 31741--31754, 2022

work page 2022
[6]

and Tamar, A

Choshen, E. and Tamar, A. Contrabar: Contrastive bayes-adaptive deep rl. In International Conference on Machine Learning, pp.\ 6005--6027. PMLR, 2023

work page 2023
[7]

Offline meta reinforcement learning--identifiability challenges and effective data collection strategies

Dorfman, R., Shenfeld, I., and Tamar, A. Offline meta reinforcement learning--identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34: 0 4607--4618, 2021

work page 2021
[8]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl ^2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Diversity is All You Need: Learning Skills without a Reward Function

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Fakoor, R., Chaudhari, P., Soatto, S., and Smola, A. J. Meta-q-learning. arXiv preprint arXiv:1910.00125, 2019

work page arXiv 1910
[11]

and Precup, D

Ferns, N. and Precup, D. Bisimulation metrics are optimal value functions. In UAI, pp.\ 210--219, 2014

work page 2014
[12]

Bisimulation metrics for continuous markov decision processes

Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40 0 (6): 0 1662--1714, 2011

work page 2011
[13]

Model-agnostic meta-learning for fast adaptation of deep networks

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

work page 2017
[14]

Meta Learning Shared Hierarchies

Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Towards effective context for meta-reinforcement learning: an approach based on contrastive learning

Fu, H., Tang, H., Hao, J., Chen, C., Feng, X., Li, D., and Liu, W. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 7457--7465, 2021

work page 2021
[16]

Meta-learning parameterized skills

Fu, H., Yu, S., Tiwari, S., Littman, M., and Konidaris, G. Meta-learning parameterized skills. arXiv preprint arXiv:2206.03597, 2022

work page arXiv 2022
[17]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

work page 2052
[18]

Context shift reduction for offline meta-reinforcement learning

Gao, Y., Zhang, R., Guo, J., Wu, F., Yi, Q., Peng, S., Lan, S., Chen, R., Du, Z., Hu, X., et al. Context shift reduction for offline meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[19]

Train hard, fight easy: Robust meta reinforcement learning

Greenberg, I., Mannor, S., Chechik, G., and Meirom, E. Train hard, fight easy: Robust meta reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[20]

Amago: Scalable in-context reinforcement learning for adaptive agents

Grigsby, J., Fan, L., and Zhu, Y. Amago: Scalable in-context reinforcement learning for adaptive agents. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[21]

Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation

Guan, C., Xue, R., Zhang, Z., Li, L., Li, Y.-C., Yuan, L., and Yu, Y. Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation. In AAMAS, pp.\ 743--751, 2024

work page 2024
[22]

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017

work page 2017
[23]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. PMLR, 2018

work page 2018
[24]

Bisimulation makes analogies in goal-conditioned reinforcement learning

Hansen-Estruch, P., Zhang, A., Nair, A., Yin, P., and Levine, S. Bisimulation makes analogies in goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp.\ 8407--8426. PMLR, 2022

work page 2022
[25]

Continuous meta-learning without tasks

Harrison, J., Sharma, A., Finn, C., and Pavone, M. Continuous meta-learning without tasks. Advances in neural information processing systems, 33: 0 17571--17581, 2020

work page 2020
[26]

Decoupling meta-reinforcement learning with gaussian task contexts and skills

He, H., Zhu, A., Liang, S., Chen, F., and Shao, J. Decoupling meta-reinforcement learning with gaussian task contexts and skills. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 12358--12366, 2024

work page 2024
[27]

Hejna III, D. J. and Sadigh, D. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pp.\ 2014--2025. PMLR, 2023

work page 2014
[28]

Offline multitask representation learning for reinforcement learning

Ishfaq, H., Nguyen-Tang, T., Feng, S., Arora, R., Wang, M., Yin, M., and Precup, D. Offline multitask representation learning for reinforcement learning. arXiv preprint arXiv:2403.11574, 2024

work page arXiv 2024
[29]

Doubly robust augmented transfer for meta-reinforcement learning

Jiang, Y., Kan, N., Li, C., Dai, W., Zou, J., and Xiong, H. Doubly robust augmented transfer for meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023
[30]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

Meta Reinforcement Learning with Task Embedding and Shared Policy

Lan, L., Li, Z., Guan, X., and Wang, P. Meta reinforcement learning with task embedding and shared policy. arXiv preprint arXiv:1905.06527, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[32]

Curl: Contrastive unsupervised representations for reinforcement learning

Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pp.\ 5639--5650. PMLR, 2020

work page 2020
[33]

and Chung, S.-Y

Lee, S. and Chung, S.-Y. Improving generalization in meta-rl with imaginary tasks from latent dynamics mixture. Advances in Neural Information Processing Systems, 34: 0 27222--27235, 2021

work page 2021
[34]

Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition

Lee, S., Cho, M., and Sung, Y. Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition. Advances in Neural Information Processing Systems, 36: 0 43356--43383, 2023

work page 2023
[35]

Multi-task batch reinforcement learning with metric learning

Li, J., Vuong, Q., Liu, S., Liu, M., Ciosek, K., Christensen, H., and Su, H. Multi-task batch reinforcement learning with metric learning. Advances in neural information processing systems, 33: 0 6197--6210, 2020 a

work page 2020
[36]

Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization

Li, L., Yang, R., and Luo, D. Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization. arXiv preprint arXiv:2010.01112, 2020 b

work page arXiv 2010
[37]

Model-based adversarial meta-reinforcement learning

Lin, Z., Thomas, G., Yang, G., and Ma, T. Model-based adversarial meta-reinforcement learning. Advances in Neural Information Processing Systems, 33: 0 10161--10173, 2020

work page 2020
[38]

Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions

Liu, Q., Zhou, Q., Yang, R., and Wang, J. Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 8843--8851, 2023

work page 2023
[39]

C., Pal, C

Mehta, B., Deleu, T., Raparthy, S. C., Pal, C. J., and Paull, L. Curriculum in gradient-based meta-reinforcement learning. arXiv preprint arXiv:2002.07956, 2020

work page arXiv 2002
[40]

Melo, L. C. Transformers are meta-reinforcement learners. In international conference on machine learning, pp.\ 15340--15359. PMLR, 2022

work page 2022
[41]

Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling

Mendonca, R., Geng, X., Finn, C., and Levine, S. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling. arXiv preprint arXiv:2006.07178, 2020

work page arXiv 2006
[42]

Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning

Mu, Y., Zhuang, Y., Ni, F., Wang, B., Chen, J., Hao, J., and Luo, P. Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 27563--27575, 2022

work page 2022
[43]

J., and Lim, J

Nam, T., Sun, S.-H., Pertsch, K., Hwang, S. J., and Lim, J. J. Skill-based meta-reinforcement learning. arXiv preprint arXiv:2204.11828, 2022

work page arXiv 2022
[44]

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Packer, C., Abbeel, P., and Gonzalez, J. E. Hindsight task relabelling: Experience replay for sparse reward meta-rl. Advances in neural information processing systems, 34: 0 2466--2477, 2021

work page 2021
[46]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp.\ 5331--5340. PMLR, 2019

work page 2019
[47]

Efficient meta reinforcement learning for preference-based fast adaptation

Ren, Z., Liu, A., Liang, Y., Peng, J., and Ma, J. Efficient meta reinforcement learning for preference-based fast adaptation. Advances in Neural Information Processing Systems, 35: 0 15502--15515, 2022

work page 2022
[48]

Mamba: an effective world model approach for meta-reinforcement learning

Rimon, Z., Jurgenson, T., Krupnik, O., Adler, G., and Tamar, A. Mamba: an effective world model approach for meta-reinforcement learning. arXiv preprint arXiv:2403.09859, 2024

work page arXiv 2024
[49]

Multi-task reinforcement learning with context-based representations

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pp.\ 9767--9779. PMLR, 2021

work page 2021
[50]

Block contextual mdps for continual learning

Sodhani, S., Meier, F., Pineau, J., and Zhang, A. Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pp.\ 608--623. PMLR, 2022

work page 2022
[51]

Mujoco: A physics engine for model-based control

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

work page 2012
[52]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[53]

Hindsight foresight relabeling for meta-reinforcement learning

Wan, M., Peng, J., and Gangwani, T. Hindsight foresight relabeling for meta-reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021
[54]

Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks

Wang, L., Zhang, Y., Zhu, D., Coleman, S., and Kerr, D. Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks. IEEE Transactions on Cognitive and Developmental Systems, 16 0 (2): 0 681--691, 2023 a

work page 2023
[55]

Meta-reinforcement learning based on self-supervised task representation learning

Wang, M., Bing, Z., Yao, X., Wang, S., Kai, H., Su, H., Yang, C., and Knoll, A. Meta-reinforcement learning based on self-supervised task representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 10157--10165, 2023 b

work page 2023
[56]

H., Peng, H., and Zhang, S

Wen, L., Tseng, E. H., Peng, H., and Zhang, S. Dream to adapt: Meta reinforcement learning by latent context imagination and mdp imagination. IEEE Robotics and Automation Letters, 9 0 (11): 0 9701--9708, 2024. doi:10.1109/LRA.2024.3417114

work page doi:10.1109/lra.2024.3417114 2024
[57]

Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning

Xu, T., Li, Z., and Ren, Q. Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning. In International Conference on Machine Learning, 2024

work page 2024
[58]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.\ 1094--1100. PMLR, 2020

work page 2020
[59]

and Lu, Z

Yuan, H. and Lu, Z. Robust task representations for offline meta-reinforcement learning via contrastive learning. In International Conference on Machine Learning, pp.\ 25747--25759. PMLR, 2022

work page 2022
[60]

T., Calandra, R., Gal, Y., and Levine, S

Zhang, A., McAllister, R. T., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021

work page 2021
[61]

Generalizable task representation learning for offline meta-reinforcement learning with data limitations

Zhou, R., Gao, C.-X., Zhang, Z., and Yu, Y. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 17132--17140, 2024

work page 2024
[62]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

work page arXiv 1910
[63]

Relabeling and policy distillation of hierarchical reinforcement learning

Zou, Q., Zhao, X., Gao, B., Chen, S., Liu, Z., and Zhang, Z. Relabeling and policy distillation of hierarchical reinforcement learning. International Journal of Machine Learning and Cybernetics, pp.\ 1--17, 2024

work page 2024
[64]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

C., Castro, P

Agarwal, R., Machado, M. C., Castro, P. S., and Bellemare, M. G. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

work page arXiv 2021

[2] [2]

Distributionally adaptive meta reinforcement learning

Ajay, A., Gupta, A., Ghosh, D., Levine, S., and Agrawal, P. Distributionally adaptive meta reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 25856--25869, 2022

work page 2022

[3] [3]

Wasserstein generative adversarial networks

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp.\ 214--223. PMLR, 2017

work page 2017

[4] [4]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Provable benefit of multitask representation learning in reinforcement learning

Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable benefit of multitask representation learning in reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 31741--31754, 2022

work page 2022

[6] [6]

and Tamar, A

Choshen, E. and Tamar, A. Contrabar: Contrastive bayes-adaptive deep rl. In International Conference on Machine Learning, pp.\ 6005--6027. PMLR, 2023

work page 2023

[7] [7]

Offline meta reinforcement learning--identifiability challenges and effective data collection strategies

Dorfman, R., Shenfeld, I., and Tamar, A. Offline meta reinforcement learning--identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34: 0 4607--4618, 2021

work page 2021

[8] [8]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl ^2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [9]

Diversity is All You Need: Learning Skills without a Reward Function

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Fakoor, R., Chaudhari, P., Soatto, S., and Smola, A. J. Meta-q-learning. arXiv preprint arXiv:1910.00125, 2019

work page arXiv 1910

[11] [11]

and Precup, D

Ferns, N. and Precup, D. Bisimulation metrics are optimal value functions. In UAI, pp.\ 210--219, 2014

work page 2014

[12] [12]

Bisimulation metrics for continuous markov decision processes

Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40 0 (6): 0 1662--1714, 2011

work page 2011

[13] [13]

Model-agnostic meta-learning for fast adaptation of deep networks

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

work page 2017

[14] [14]

Meta Learning Shared Hierarchies

Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Towards effective context for meta-reinforcement learning: an approach based on contrastive learning

Fu, H., Tang, H., Hao, J., Chen, C., Feng, X., Li, D., and Liu, W. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 7457--7465, 2021

work page 2021

[16] [16]

Meta-learning parameterized skills

Fu, H., Yu, S., Tiwari, S., Littman, M., and Konidaris, G. Meta-learning parameterized skills. arXiv preprint arXiv:2206.03597, 2022

work page arXiv 2022

[17] [17]

Off-policy deep reinforcement learning without exploration

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

work page 2052

[18] [18]

Context shift reduction for offline meta-reinforcement learning

Gao, Y., Zhang, R., Guo, J., Wu, F., Yi, Q., Peng, S., Lan, S., Chen, R., Du, Z., Hu, X., et al. Context shift reduction for offline meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[19] [19]

Train hard, fight easy: Robust meta reinforcement learning

Greenberg, I., Mannor, S., Chechik, G., and Meirom, E. Train hard, fight easy: Robust meta reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[20] [20]

Amago: Scalable in-context reinforcement learning for adaptive agents

Grigsby, J., Fan, L., and Zhu, Y. Amago: Scalable in-context reinforcement learning for adaptive agents. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[21] [21]

Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation

Guan, C., Xue, R., Zhang, Z., Li, L., Li, Y.-C., Yuan, L., and Yu, Y. Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation. In AAMAS, pp.\ 743--751, 2024

work page 2024

[22] [22]

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017

work page 2017

[23] [23]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. PMLR, 2018

work page 2018

[24] [24]

Bisimulation makes analogies in goal-conditioned reinforcement learning

Hansen-Estruch, P., Zhang, A., Nair, A., Yin, P., and Levine, S. Bisimulation makes analogies in goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp.\ 8407--8426. PMLR, 2022

work page 2022

[25] [25]

Continuous meta-learning without tasks

Harrison, J., Sharma, A., Finn, C., and Pavone, M. Continuous meta-learning without tasks. Advances in neural information processing systems, 33: 0 17571--17581, 2020

work page 2020

[26] [26]

Decoupling meta-reinforcement learning with gaussian task contexts and skills

He, H., Zhu, A., Liang, S., Chen, F., and Shao, J. Decoupling meta-reinforcement learning with gaussian task contexts and skills. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 12358--12366, 2024

work page 2024

[27] [27]

Hejna III, D. J. and Sadigh, D. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pp.\ 2014--2025. PMLR, 2023

work page 2014

[28] [28]

Offline multitask representation learning for reinforcement learning

Ishfaq, H., Nguyen-Tang, T., Feng, S., Arora, R., Wang, M., Yin, M., and Precup, D. Offline multitask representation learning for reinforcement learning. arXiv preprint arXiv:2403.11574, 2024

work page arXiv 2024

[29] [29]

Doubly robust augmented transfer for meta-reinforcement learning

Jiang, Y., Kan, N., Li, C., Dai, W., Zou, J., and Xiong, H. Doubly robust augmented transfer for meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

work page 2023

[30] [30]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[31] [31]

Meta Reinforcement Learning with Task Embedding and Shared Policy

Lan, L., Li, Z., Guan, X., and Wang, P. Meta reinforcement learning with task embedding and shared policy. arXiv preprint arXiv:1905.06527, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[32] [32]

Curl: Contrastive unsupervised representations for reinforcement learning

Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pp.\ 5639--5650. PMLR, 2020

work page 2020

[33] [33]

and Chung, S.-Y

Lee, S. and Chung, S.-Y. Improving generalization in meta-rl with imaginary tasks from latent dynamics mixture. Advances in Neural Information Processing Systems, 34: 0 27222--27235, 2021

work page 2021

[34] [34]

Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition

Lee, S., Cho, M., and Sung, Y. Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition. Advances in Neural Information Processing Systems, 36: 0 43356--43383, 2023

work page 2023

[35] [35]

Multi-task batch reinforcement learning with metric learning

Li, J., Vuong, Q., Liu, S., Liu, M., Ciosek, K., Christensen, H., and Su, H. Multi-task batch reinforcement learning with metric learning. Advances in neural information processing systems, 33: 0 6197--6210, 2020 a

work page 2020

[36] [36]

Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization

Li, L., Yang, R., and Luo, D. Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization. arXiv preprint arXiv:2010.01112, 2020 b

work page arXiv 2010

[37] [37]

Model-based adversarial meta-reinforcement learning

Lin, Z., Thomas, G., Yang, G., and Ma, T. Model-based adversarial meta-reinforcement learning. Advances in Neural Information Processing Systems, 33: 0 10161--10173, 2020

work page 2020

[38] [38]

Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions

Liu, Q., Zhou, Q., Yang, R., and Wang, J. Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 8843--8851, 2023

work page 2023

[39] [39]

C., Pal, C

Mehta, B., Deleu, T., Raparthy, S. C., Pal, C. J., and Paull, L. Curriculum in gradient-based meta-reinforcement learning. arXiv preprint arXiv:2002.07956, 2020

work page arXiv 2002

[40] [40]

Melo, L. C. Transformers are meta-reinforcement learners. In international conference on machine learning, pp.\ 15340--15359. PMLR, 2022

work page 2022

[41] [41]

Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling

Mendonca, R., Geng, X., Finn, C., and Levine, S. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling. arXiv preprint arXiv:2006.07178, 2020

work page arXiv 2006

[42] [42]

Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning

Mu, Y., Zhuang, Y., Ni, F., Wang, B., Chen, J., Hao, J., and Luo, P. Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 27563--27575, 2022

work page 2022

[43] [43]

J., and Lim, J

Nam, T., Sun, S.-H., Pertsch, K., Hwang, S. J., and Lim, J. J. Skill-based meta-reinforcement learning. arXiv preprint arXiv:2204.11828, 2022

work page arXiv 2022

[44] [44]

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [45]

Packer, C., Abbeel, P., and Gonzalez, J. E. Hindsight task relabelling: Experience replay for sparse reward meta-rl. Advances in neural information processing systems, 34: 0 2466--2477, 2021

work page 2021

[46] [46]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp.\ 5331--5340. PMLR, 2019

work page 2019

[47] [47]

Efficient meta reinforcement learning for preference-based fast adaptation

Ren, Z., Liu, A., Liang, Y., Peng, J., and Ma, J. Efficient meta reinforcement learning for preference-based fast adaptation. Advances in Neural Information Processing Systems, 35: 0 15502--15515, 2022

work page 2022

[48] [48]

Mamba: an effective world model approach for meta-reinforcement learning

Rimon, Z., Jurgenson, T., Krupnik, O., Adler, G., and Tamar, A. Mamba: an effective world model approach for meta-reinforcement learning. arXiv preprint arXiv:2403.09859, 2024

work page arXiv 2024

[49] [49]

Multi-task reinforcement learning with context-based representations

Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pp.\ 9767--9779. PMLR, 2021

work page 2021

[50] [50]

Block contextual mdps for continual learning

Sodhani, S., Meier, F., Pineau, J., and Zhang, A. Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pp.\ 608--623. PMLR, 2022

work page 2022

[51] [51]

Mujoco: A physics engine for model-based control

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

work page 2012

[52] [52]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[53] [53]

Hindsight foresight relabeling for meta-reinforcement learning

Wan, M., Peng, J., and Gangwani, T. Hindsight foresight relabeling for meta-reinforcement learning. In International Conference on Learning Representations, 2021

work page 2021

[54] [54]

Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks

Wang, L., Zhang, Y., Zhu, D., Coleman, S., and Kerr, D. Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks. IEEE Transactions on Cognitive and Developmental Systems, 16 0 (2): 0 681--691, 2023 a

work page 2023

[55] [55]

Meta-reinforcement learning based on self-supervised task representation learning

Wang, M., Bing, Z., Yao, X., Wang, S., Kai, H., Su, H., Yang, C., and Knoll, A. Meta-reinforcement learning based on self-supervised task representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 10157--10165, 2023 b

work page 2023

[56] [56]

H., Peng, H., and Zhang, S

Wen, L., Tseng, E. H., Peng, H., and Zhang, S. Dream to adapt: Meta reinforcement learning by latent context imagination and mdp imagination. IEEE Robotics and Automation Letters, 9 0 (11): 0 9701--9708, 2024. doi:10.1109/LRA.2024.3417114

work page doi:10.1109/lra.2024.3417114 2024

[57] [57]

Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning

Xu, T., Li, Z., and Ren, Q. Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning. In International Conference on Machine Learning, 2024

work page 2024

[58] [58]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.\ 1094--1100. PMLR, 2020

work page 2020

[59] [59]

and Lu, Z

Yuan, H. and Lu, Z. Robust task representations for offline meta-reinforcement learning via contrastive learning. In International Conference on Machine Learning, pp.\ 25747--25759. PMLR, 2022

work page 2022

[60] [60]

T., Calandra, R., Gal, Y., and Levine, S

Zhang, A., McAllister, R. T., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021

work page 2021

[61] [61]

Generalizable task representation learning for offline meta-reinforcement learning with data limitations

Zhou, R., Gao, C.-X., Zhang, Z., and Yu, Y. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 17132--17140, 2024

work page 2024

[62] [62]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

work page arXiv 1910

[63] [63]

Relabeling and policy distillation of hierarchical reinforcement learning

Zou, Q., Zhao, X., Gao, B., Chen, S., Liu, Z., and Zhang, Z. Relabeling and policy distillation of hierarchical reinforcement learning. International Journal of Machine Learning and Cybernetics, pp.\ 1--17, 2024

work page 2024

[64] [64]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page