pith. sign in

arxiv: 2502.02834 · v3 · pith:OX4FO4XHnew · submitted 2025-02-05 · 💻 cs.LG · cs.AI

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

Pith reviewed 2026-05-23 03:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords meta-reinforcement learningout-of-distribution generalizationtask-aware virtual trainingmetric-based representation learningvirtual tasksstate regularizationMuJoCoMetaWorld
0
0 comments X

The pith

Task-Aware Virtual Training improves meta-RL generalization to out-of-distribution tasks via metric-based representations and virtual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Task-Aware Virtual Training (TAVT) to help meta-reinforcement learning policies generalize better to tasks outside the training distribution. Context-based methods often fail here because their task representations do not hold up for unseen cases. TAVT applies metric-based representation learning to capture task characteristics reliably, generates virtual tasks that keep those characteristics intact, and adds state regularization to limit overestimation errors when states change. Experiments across MuJoCo and MetaWorld benchmarks show stronger performance on OOD tasks than prior approaches. A reader would care because this could make RL agents more adaptable without needing full retraining for every new scenario.

Core claim

TAVT is a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. It successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments, resulting in significantly enhanced generalization to OOD tasks across MuJoCo and MetaWorld environments.

What carries the argument

Task-Aware Virtual Training (TAVT) algorithm, which uses metric-based representation learning to generate virtual tasks that preserve task features and applies state regularization to control value overestimation.

If this is right

  • Policies trained with TAVT achieve higher returns on unseen tasks in continuous control settings such as MuJoCo.
  • The approach reduces overestimation errors when environments have varying state distributions.
  • Task representations remain effective beyond the exact distribution used for meta-training.
  • Virtual task generation allows the method to maintain performance when real OOD samples are scarce.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The virtual-task construction step might transfer to meta-learning problems outside reinforcement learning, such as few-shot supervised tasks.
  • Pairing TAVT with model-based planning could further reduce the sample cost of adapting to new tasks.
  • Measuring representation quality directly on OOD tasks before policy training could serve as an early diagnostic for when the method will succeed.

Load-bearing premise

The assumption that metric-based representation learning can accurately capture and preserve task characteristics for out-of-distribution scenarios.

What would settle it

A set of experiments on additional OOD tasks where TAVT produces no measurable improvement in generalization performance or task representation fidelity compared with standard context-based meta-RL baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2502.02834 by Jeongmo Kim, Minung Kim, Seungyul Han, Yisak Park.

Figure 1
Figure 1. Figure 1: (a) 2D Goal positions in the Ant-Goal environment: The blue shaded area indicates the training task distribution, with blue marks representing inner training tasks, green marks representing outer training tasks, and red marks denoting OOD test tasks. (b-e) t-SNE visualization of task latents for various context-based meta-RL methods. both rewards and next states, and we propose a state reg￾ularization meth… view at source ↗
Figure 2
Figure 2. Figure 2: Changes in latents zon of 4 randomly sampled training tasks in the Ant-Goal-OOD environment: (a) Using zon only (b) Using on-off latent loss for task representation learning representation compared to using zon alone. In summary, our encoder-decoder loss is defined as Lbisim(ψ, ϕ) = ETi,Tj∼p(Ttrain) h |z i off − z j off | − d(Ti, Tj ; pϕ¯) 2 | {z } Bisimulation loss + E(s,a,r,s′)∼D Ti off ,(ˆr,sˆ′)∼pϕ(s,… view at source ↗
Figure 4
Figure 4. Figure 4: An illustration for the structure of TAVT input zoff, as described in Eq. (3). Both the task-preserving loss and VT construction are always based on z α off, derived from zoff. Importantly, the off-policy task latent zoff con￾ditions sample context generation, with its gradient discon￾nected to ensure stable training. By leveraging WGAN, we significantly reduce differences between generated and real sample… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Q-function loss for VTs (b) Estimation bias for OOD tasks in the Walker-Mass-OOD environment. The virtual contexts cˆ α in TAVT include both rewards and next states, enabling it to handle state-varying environments. However, inaccuracies in the task decoder can introduce errors in the Q-function, leading to overestimation bias, a common issue in offline RL (Fujimoto et al., 2019). While reward errors h… view at source ↗
Figure 6
Figure 6. Figure 6: (a-f) MuJoCo environments (g-h) ML1 environments In contrast, using ϵreg = 1.0 (full use of sˆ ′α) results in higher Q-function loss and greater bias. Section 5 further demonstrates that this method improves OOD performance. 5. Experiments In this section, we compare the proposed TAVT algorithm with various on-policy and off-policy meta-RL methods across MuJoCo (Todorov et al., 2012) and MetaWorld ML1 (Yu … view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison for MuJoCo environments. The graphs for on-policy algorithms represent their final performance. • Hopper/Walker-Mass-OOD: The Hopper/Walker agent are required to run forward with scale mscale in M multiplied to their body mass. • ML1/ML1-OOD: The agent is tasked with reaching or pushing an object to a target goal position gtar, sampled from the 3D goal space M, which varies depending… view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE visualization of task latents: (a) Cheetah-Vel-OOD (b) Walker-Mass-OOD (c) Ant-Dir-4 (d) Reach-OOD-Inter. For ML1 tasks in the 3D goal space, we provide both 3D representation for all tasks and 2D representation for tasks in the selected cross-section. For MetaWorld environments, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Component evaluation on (a) Cheetah-Vel-OOD and (b) Walker-Mass-OOD environments [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Context differences between real contexts and virtual contexts generated by the task decoder for OOD tasks: (a) Cheetah￾Vel-OOD and (b) Walker-Mass-OOD environments [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison for various ϵreg on (a) Walker￾Mass-OOD (b) Hopper-Mass-OOD environments. omits the on-off loss, and ‘Recon only’, which uses only reconstruction loss with DropOut as in LDM. The results demonstrate that removing any component significantly de￾grades performance, underscoring the importance of task representation learning and sample generation for OOD task generalization. In additio… view at source ↗
read the original abstract

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Task-Aware Virtual Training (TAVT), a meta-RL algorithm that employs metric-based representation learning to capture and preserve task characteristics for both training and out-of-distribution (OOD) tasks, combined with a state regularization technique to address overestimation errors. It evaluates the approach on MuJoCo and MetaWorld environments and claims that TAVT yields significant improvements in generalization to OOD tasks relative to prior context-based meta-RL methods. Code is released at the provided GitHub link.

Significance. If the reported numerical gains hold under rigorous statistical scrutiny, TAVT would constitute a practical incremental advance in addressing OOD generalization, a recognized limitation of context-based meta-RL. The explicit release of code is a clear strength that facilitates reproducibility and follow-on work.

minor comments (3)
  1. The abstract asserts 'significant' numerical improvements yet supplies no concrete metrics, baselines, or statistical details; the experimental section should include these with error bars and significance tests to support the central claim.
  2. The precise definition of OOD tasks (e.g., how task parameters are shifted relative to the training distribution) should be stated explicitly, ideally with a table or equation, so that the metric-learning objective's claimed preservation of task characteristics can be evaluated.
  3. The state-regularization term is described at a high level; a short derivation or pseudocode showing how it interacts with the metric embedding loss would improve clarity without altering the method.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the code release, and recommendation for minor revision. The referee's description of TAVT is accurate. No specific major comments appear in the report, so we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity; empirical algorithm with no derivations

full rationale

The paper presents TAVT as an empirical algorithm for meta-RL, validated solely by numerical results on MuJoCo and MetaWorld benchmarks. No equations, derivations, or mathematical claims appear in the abstract or method sketch. The reader's assessment confirms absence of reductions to fitted parameters or self-referential definitions. Central claims rest on experimental outcomes rather than any chain that collapses to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on parameters, axioms, or new entities; insufficient information available.

pith-pipeline@v0.9.0 · 5668 in / 1048 out tokens · 33969 ms · 2026-05-23T03:25:25.416222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning

    cs.LG 2025-10 unverdicted novelty 5.0

    The work establishes OOD generalization bounds for meta-supervised learning and meta-RL that exploit MDP structure, then analyzes a gradient-based meta-RL algorithm.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    C., Castro, P

    Agarwal, R., Machado, M. C., Castro, P. S., and Bellemare, M. G. Contrastive behavioral similarity embeddings for generalization in reinforcement learning. arXiv preprint arXiv:2101.05265, 2021

  2. [2]

    Distributionally adaptive meta reinforcement learning

    Ajay, A., Gupta, A., Ghosh, D., Levine, S., and Agrawal, P. Distributionally adaptive meta reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 25856--25869, 2022

  3. [3]

    Wasserstein generative adversarial networks

    Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning, pp.\ 214--223. PMLR, 2017

  4. [4]

    OpenAI Gym

    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  5. [5]

    Provable benefit of multitask representation learning in reinforcement learning

    Cheng, Y., Feng, S., Yang, J., Zhang, H., and Liang, Y. Provable benefit of multitask representation learning in reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 31741--31754, 2022

  6. [6]

    and Tamar, A

    Choshen, E. and Tamar, A. Contrabar: Contrastive bayes-adaptive deep rl. In International Conference on Machine Learning, pp.\ 6005--6027. PMLR, 2023

  7. [7]

    Offline meta reinforcement learning--identifiability challenges and effective data collection strategies

    Dorfman, R., Shenfeld, I., and Tamar, A. Offline meta reinforcement learning--identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34: 0 4607--4618, 2021

  8. [8]

    RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

    Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl ^2 : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016

  9. [9]

    Diversity is All You Need: Learning Skills without a Reward Function

    Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018

  10. [10]

    Fakoor, R., Chaudhari, P., Soatto, S., and Smola, A. J. Meta-q-learning. arXiv preprint arXiv:1910.00125, 2019

  11. [11]

    and Precup, D

    Ferns, N. and Precup, D. Bisimulation metrics are optimal value functions. In UAI, pp.\ 210--219, 2014

  12. [12]

    Bisimulation metrics for continuous markov decision processes

    Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40 0 (6): 0 1662--1714, 2011

  13. [13]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.\ 1126--1135. PMLR, 2017

  14. [14]

    Meta Learning Shared Hierarchies

    Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017

  15. [15]

    Towards effective context for meta-reinforcement learning: an approach based on contrastive learning

    Fu, H., Tang, H., Hao, J., Chen, C., Feng, X., Li, D., and Liu, W. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 7457--7465, 2021

  16. [16]

    Meta-learning parameterized skills

    Fu, H., Yu, S., Tiwari, S., Littman, M., and Konidaris, G. Meta-learning parameterized skills. arXiv preprint arXiv:2206.03597, 2022

  17. [17]

    Off-policy deep reinforcement learning without exploration

    Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

  18. [18]

    Context shift reduction for offline meta-reinforcement learning

    Gao, Y., Zhang, R., Guo, J., Wu, F., Yi, Q., Peng, S., Lan, S., Chen, R., Du, Z., Hu, X., et al. Context shift reduction for offline meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

  19. [19]

    Train hard, fight easy: Robust meta reinforcement learning

    Greenberg, I., Mannor, S., Chechik, G., and Meirom, E. Train hard, fight easy: Robust meta reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

  20. [20]

    Amago: Scalable in-context reinforcement learning for adaptive agents

    Grigsby, J., Fan, L., and Zhu, Y. Amago: Scalable in-context reinforcement learning for adaptive agents. In The Twelfth International Conference on Learning Representations, 2024

  21. [21]

    Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation

    Guan, C., Xue, R., Zhang, Z., Li, L., Li, Y.-C., Yuan, L., and Yu, Y. Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation. In AAMAS, pp.\ 743--751, 2024

  22. [22]

    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017

  23. [23]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. PMLR, 2018

  24. [24]

    Bisimulation makes analogies in goal-conditioned reinforcement learning

    Hansen-Estruch, P., Zhang, A., Nair, A., Yin, P., and Levine, S. Bisimulation makes analogies in goal-conditioned reinforcement learning. In International Conference on Machine Learning, pp.\ 8407--8426. PMLR, 2022

  25. [25]

    Continuous meta-learning without tasks

    Harrison, J., Sharma, A., Finn, C., and Pavone, M. Continuous meta-learning without tasks. Advances in neural information processing systems, 33: 0 17571--17581, 2020

  26. [26]

    Decoupling meta-reinforcement learning with gaussian task contexts and skills

    He, H., Zhu, A., Liang, S., Chen, F., and Shao, J. Decoupling meta-reinforcement learning with gaussian task contexts and skills. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 12358--12366, 2024

  27. [27]

    Hejna III, D. J. and Sadigh, D. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pp.\ 2014--2025. PMLR, 2023

  28. [28]

    Offline multitask representation learning for reinforcement learning

    Ishfaq, H., Nguyen-Tang, T., Feng, S., Arora, R., Wang, M., Yin, M., and Precup, D. Offline multitask representation learning for reinforcement learning. arXiv preprint arXiv:2403.11574, 2024

  29. [29]

    Doubly robust augmented transfer for meta-reinforcement learning

    Jiang, Y., Kan, N., Li, C., Dai, W., Zou, J., and Xiong, H. Doubly robust augmented transfer for meta-reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023

  30. [30]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  31. [31]

    Meta Reinforcement Learning with Task Embedding and Shared Policy

    Lan, L., Li, Z., Guan, X., and Wang, P. Meta reinforcement learning with task embedding and shared policy. arXiv preprint arXiv:1905.06527, 2019

  32. [32]

    Curl: Contrastive unsupervised representations for reinforcement learning

    Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pp.\ 5639--5650. PMLR, 2020

  33. [33]

    and Chung, S.-Y

    Lee, S. and Chung, S.-Y. Improving generalization in meta-rl with imaginary tasks from latent dynamics mixture. Advances in Neural Information Processing Systems, 34: 0 27222--27235, 2021

  34. [34]

    Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition

    Lee, S., Cho, M., and Sung, Y. Parameterizing non-parametric meta-reinforcement learning tasks via subtask decomposition. Advances in Neural Information Processing Systems, 36: 0 43356--43383, 2023

  35. [35]

    Multi-task batch reinforcement learning with metric learning

    Li, J., Vuong, Q., Liu, S., Liu, M., Ciosek, K., Christensen, H., and Su, H. Multi-task batch reinforcement learning with metric learning. Advances in neural information processing systems, 33: 0 6197--6210, 2020 a

  36. [36]

    Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization

    Li, L., Yang, R., and Luo, D. Focal: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization. arXiv preprint arXiv:2010.01112, 2020 b

  37. [37]

    Model-based adversarial meta-reinforcement learning

    Lin, Z., Thomas, G., Yang, G., and Ma, T. Model-based adversarial meta-reinforcement learning. Advances in Neural Information Processing Systems, 33: 0 10161--10173, 2020

  38. [38]

    Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions

    Liu, Q., Zhou, Q., Yang, R., and Wang, J. Robust representation learning by clustering with bisimulation metrics for visual reinforcement learning with distractions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 8843--8851, 2023

  39. [39]

    C., Pal, C

    Mehta, B., Deleu, T., Raparthy, S. C., Pal, C. J., and Paull, L. Curriculum in gradient-based meta-reinforcement learning. arXiv preprint arXiv:2002.07956, 2020

  40. [40]

    Melo, L. C. Transformers are meta-reinforcement learners. In international conference on machine learning, pp.\ 15340--15359. PMLR, 2022

  41. [41]

    Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling

    Mendonca, R., Geng, X., Finn, C., and Levine, S. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling. arXiv preprint arXiv:2006.07178, 2020

  42. [42]

    Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning

    Mu, Y., Zhuang, Y., Ni, F., Wang, B., Chen, J., Hao, J., and Luo, P. Domino: Decomposed mutual information optimization for generalized context in meta-reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 27563--27575, 2022

  43. [43]

    J., and Lim, J

    Nam, T., Sun, S.-H., Pertsch, K., Hwang, S. J., and Lim, J. J. Skill-based meta-reinforcement learning. arXiv preprint arXiv:2204.11828, 2022

  44. [44]

    Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  45. [45]

    Packer, C., Abbeel, P., and Gonzalez, J. E. Hindsight task relabelling: Experience replay for sparse reward meta-rl. Advances in neural information processing systems, 34: 0 2466--2477, 2021

  46. [46]

    Efficient off-policy meta-reinforcement learning via probabilistic context variables

    Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp.\ 5331--5340. PMLR, 2019

  47. [47]

    Efficient meta reinforcement learning for preference-based fast adaptation

    Ren, Z., Liu, A., Liang, Y., Peng, J., and Ma, J. Efficient meta reinforcement learning for preference-based fast adaptation. Advances in Neural Information Processing Systems, 35: 0 15502--15515, 2022

  48. [48]

    Mamba: an effective world model approach for meta-reinforcement learning

    Rimon, Z., Jurgenson, T., Krupnik, O., Adler, G., and Tamar, A. Mamba: an effective world model approach for meta-reinforcement learning. arXiv preprint arXiv:2403.09859, 2024

  49. [49]

    Multi-task reinforcement learning with context-based representations

    Sodhani, S., Zhang, A., and Pineau, J. Multi-task reinforcement learning with context-based representations. In International Conference on Machine Learning, pp.\ 9767--9779. PMLR, 2021

  50. [50]

    Block contextual mdps for continual learning

    Sodhani, S., Meier, F., Pineau, J., and Zhang, A. Block contextual mdps for continual learning. In Learning for Dynamics and Control Conference, pp.\ 608--623. PMLR, 2022

  51. [51]

    Mujoco: A physics engine for model-based control

    Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 5026--5033. IEEE, 2012

  52. [52]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  53. [53]

    Hindsight foresight relabeling for meta-reinforcement learning

    Wan, M., Peng, J., and Gangwani, T. Hindsight foresight relabeling for meta-reinforcement learning. In International Conference on Learning Representations, 2021

  54. [54]

    Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks

    Wang, L., Zhang, Y., Zhu, D., Coleman, S., and Kerr, D. Supervised meta-reinforcement learning with trajectory optimization for manipulation tasks. IEEE Transactions on Cognitive and Developmental Systems, 16 0 (2): 0 681--691, 2023 a

  55. [55]

    Meta-reinforcement learning based on self-supervised task representation learning

    Wang, M., Bing, Z., Yao, X., Wang, S., Kai, H., Su, H., Yang, C., and Knoll, A. Meta-reinforcement learning based on self-supervised task representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 10157--10165, 2023 b

  56. [56]

    H., Peng, H., and Zhang, S

    Wen, L., Tseng, E. H., Peng, H., and Zhang, S. Dream to adapt: Meta reinforcement learning by latent context imagination and mdp imagination. IEEE Robotics and Automation Letters, 9 0 (11): 0 9701--9708, 2024. doi:10.1109/LRA.2024.3417114

  57. [57]

    Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning

    Xu, T., Li, Z., and Ren, Q. Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning. In International Conference on Machine Learning, 2024

  58. [58]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.\ 1094--1100. PMLR, 2020

  59. [59]

    and Lu, Z

    Yuan, H. and Lu, Z. Robust task representations for offline meta-reinforcement learning via contrastive learning. In International Conference on Machine Learning, pp.\ 25747--25759. PMLR, 2022

  60. [60]

    T., Calandra, R., Gal, Y., and Levine, S

    Zhang, A., McAllister, R. T., Calandra, R., Gal, Y., and Levine, S. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021

  61. [61]

    Generalizable task representation learning for offline meta-reinforcement learning with data limitations

    Zhou, R., Gao, C.-X., Zhang, Z., and Yu, Y. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.\ 17132--17140, 2024

  62. [62]

    Varibad: A very good method for bayes-adaptive deep rl via meta-learning

    Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. arXiv preprint arXiv:1910.08348, 2019

  63. [63]

    Relabeling and policy distillation of hierarchical reinforcement learning

    Zou, Q., Zhao, X., Gao, B., Chen, S., Liu, Z., and Zhang, Z. Relabeling and policy distillation of hierarchical reinforcement learning. International Journal of Machine Learning and Cybernetics, pp.\ 1--17, 2024

  64. [64]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...