Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Sanghyeon Lee; Sangjun Bae; Seungyul Han; Yisak Park

SISL refines skills from noisy offline data via decoupled policies and return relabeling for stable meta-RL adaptation.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-23 03:19 UTC pith:SQBGSVP3

load-bearing objection SISL pairs decoupled policies with max-return relabeling to clean noisy offline skills in meta-RL, but the abstract gives no experiments to check if it works. the 2 major comments →

arxiv 2502.03752 v5 pith:SQBGSVP3 submitted 2025-02-06 cs.LG cs.AI

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Sanghyeon Lee , Sangjun Bae , Yisak Park , Seungyul Han This is my paper

classification cs.LG cs.AI

keywords meta-reinforcement learningskill-based RLnoisy demonstrationshierarchical policiesself-improving learninglong-horizon tasksoffline reinforcement learning

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Self-Improving Skill Learning (SISL) to make skill-based meta-reinforcement learning reliable when offline demonstrations contain noise. It decouples high-level decision making from skill improvement policies and applies maximum return relabeling to prioritize useful trajectories during updates. This combination produces more stable skill learning and stronger performance on long-horizon tasks than prior methods. A sympathetic reader would care because imperfect demonstration data is common in practice, and current skill-based approaches degrade quickly under such conditions.

Core claim

SISL performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data.

What carries the argument

Decoupled high-level and skill-improvement policies combined with maximum-return relabeling to identify and refine task-relevant skills from noisy demonstrations.

Load-bearing premise

That the combination of decoupled high-level and skill-improvement policies plus maximum-return relabeling can reliably identify and refine task-relevant skills from noisy offline demonstrations without ground-truth labels or additional clean data.

What would settle it

An experiment that adds increasing levels of action or state noise to the offline dataset and measures whether SISL still outperforms standard skill-based meta-RL baselines at every noise level.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Reliable skill learning occurs from noisy offline data without requiring ground-truth labels.
Consistent outperformance on diverse long-horizon tasks compared with other skill-based meta-RL methods.
Stable adaptation to unseen tasks holds even when demonstrations are suboptimal.
Noise effects are mitigated by focusing updates on trajectories with highest returns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The return-relabeling step could be tested as a plug-in module for other hierarchical RL algorithms facing noisy data.
Evaluating SISL on physical robot platforms would show whether the learned skills transfer under real sensor noise.
Extending the same self-improvement loop to non-meta settings might improve robustness in standard offline RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SISL pairs decoupled policies with max-return relabeling to clean noisy offline skills in meta-RL, but the abstract gives no experiments to check if it works.

read the letter

The main takeaway is that this paper targets noise sensitivity in skill-based meta-RL for long-horizon tasks by adding a self-improvement stage. It decouples the high-level policy from a separate skill-improvement policy and uses maximum-return relabeling to focus on better trajectories from the offline data. That combination is the concrete addition they put forward over prior hierarchical approaches. It directly tackles a known issue where noisy demonstrations lead to unstable skills and poor adaptation. The motivation is practical and the components are motivated independently rather than circular. The paper does a clean job laying out why existing methods struggle and how these two pieces could help without requiring ground-truth labels or extra clean data. That part reads as a reasonable incremental step for people already working in this corner of RL. The soft spot is the lack of any visible results. The abstract states that SISL outperforms other methods on diverse tasks, yet there are no baselines listed, no ablation numbers, no statistical details, and no description of how the noise was introduced or measured in the experiments. The key assumption—that return relabeling will reliably surface task-relevant skills even when returns themselves may be corrupted—remains untested in what is shown. If early selections reinforce weak skills, the loop could make things worse instead of better. This is the part that needs the full paper's results to evaluate. The work is aimed at RL researchers focused on robust skill learning from offline data in robotics or similar domains. A reader already following skill-based meta-RL might pick up the design choices for their own experiments. It does not look like a broad shift, but the targeted fix could be worth testing if the experiments hold up. Based on the abstract alone, I would not send this to peer review yet; the claims are too unsupported to justify referee time without seeing the actual evidence.

Referee Report

2 major / 0 minor

Summary. The paper proposes Self-Improving Skill Learning (SISL) to address noise sensitivity in skill-based meta-RL for long-horizon tasks. It introduces decoupled high-level and skill-improvement policies combined with maximum-return relabeling to enable self-guided refinement and prioritization of task-relevant trajectories from noisy offline demonstrations, claiming this yields reliable skill learning and consistent outperformance over prior skill-based meta-RL methods.

Significance. If the empirical claims hold, the work would offer a practical route to robust hierarchical meta-RL under imperfect data, a common real-world constraint. The public code release is a positive contribution that supports reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that 'SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods' is presented without any experimental results, baselines, ablation studies, or statistical details visible in the manuscript. This absence prevents assessment of whether the decoupled policies and maximum-return relabeling actually mitigate noise as asserted.
[Method (implied from abstract description)] The method description relies on the assumption that maximum-return relabeling (without ground-truth labels or clean data) will reliably identify and refine task-relevant skills. No analysis or safeguards are provided against the possibility that corrupted returns or early high-level selections could amplify rather than filter noise, which is load-bearing for the robustness claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the full manuscript and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods' is presented without any experimental results, baselines, ablation studies, or statistical details visible in the manuscript. This absence prevents assessment of whether the decoupled policies and maximum-return relabeling actually mitigate noise as asserted.

Authors: The abstract is a concise summary; the full manuscript (Sections 4–5) contains the requested details: quantitative comparisons against skill-based meta-RL baselines, ablation studies isolating the decoupled policies and maximum-return relabeling, and statistical results (means and standard errors over 5–10 seeds) on long-horizon tasks with injected noise. These experiments directly support the claim that the components mitigate noise. We will revise the abstract to include a short clause referencing the empirical validation. revision: partial
Referee: [Method (implied from abstract description)] The method description relies on the assumption that maximum-return relabeling (without ground-truth labels or clean data) will reliably identify and refine task-relevant skills. No analysis or safeguards are provided against the possibility that corrupted returns or early high-level selections could amplify rather than filter noise, which is load-bearing for the robustness claim.

Authors: The paper presents empirical evidence that maximum-return relabeling improves performance under noisy demonstrations, but we agree that explicit analysis of failure modes (e.g., early mis-selection amplifying noise) is missing. We will add a short discussion subsection and potential safeguards (return thresholding, periodic re-evaluation of high-level selections) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe SISL as a proposed combination of decoupled high-level and skill-improvement policies with maximum-return relabeling to handle noisy offline data. No equations, self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations are present that would reduce the claimed performance gains to the method's own inputs by construction. The central claims rest on the empirical effectiveness of the introduced components rather than any self-referential reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard RL assumptions (MDP formulation, policy gradient updates) that are not detailed here.

pith-pipeline@v0.9.0 · 5682 in / 1090 out tokens · 38119 ms · 2026-05-23T03:19:49.108094+00:00 · methodology

0 comments

read the original abstract

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://epsilog.github.io/SISL.

Figures

Figures reproduced from arXiv: 2502.03752 by Sanghyeon Lee, Sangjun Bae, Seungyul Han, Yisak Park.

**Figure 2.** Figure 2: Comparison of prior skill learning methods in microwave-opening task: (a) Learned skills [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The SISL framework ℒ𝐬𝐬𝐬𝐬𝐬𝐬𝐬𝐬𝐬𝐬 Maximum Return Relabeling 𝑮𝑮 � 𝝉𝝉 � 𝟏𝟏 Task 1 Task 2 Task 3 4.3 2.4 1.2 Relabeled Return 𝑝𝑝ℬoff 3.5 3.1 3.4 𝝉𝝉𝟏𝟏 𝒊𝒊 𝝉𝝉𝟐𝟐 𝒊𝒊 𝝉𝝉𝟏𝟏 𝒊𝒊 … 𝝉𝝉𝟐𝟐 𝒊𝒊 𝝉𝝉𝟑𝟑 𝒊𝒊 2.9 3.3 1.9 𝒊𝒊 … 𝝉𝝉 � 𝟏𝟏 𝝉𝝉 � 𝟐𝟐 𝝉𝝉 � 𝟑𝟑 4.3 2.4 1.2 1.3 1.9 2.9 2.4 3.4 2.3 𝑮𝑮 � 𝝉𝝉 � 𝟏𝟏 … 𝟏𝟏 − 𝜷𝜷 𝜷𝜷 𝑮𝑮 � 𝝉𝝉 � 𝟐𝟐 𝑮𝑮 � 𝝉𝝉 � 𝟑𝟑 𝑩𝑩 𝒊𝒊 𝐨𝐨𝐨𝐨 𝑩𝑩𝐨𝐨𝐨𝐨𝐨𝐨 Prioritized Sampling [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Considered long-horizon environments We evaluate SISL across four long-horizon, multi-task environments: Kitchen and Maze2D from Nam et al. (2022), and Office and AntMaze, newly introduced in this work, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Learning curves of the meta-train and meta-test phases on Kitchen ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of buffer mixing coefficient [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Component evaluation 0 0.1K 0.2K 0.3K 0.4K 0.5K Iteration 2.5 3.0 3.5 Test Average Return 0 0.1K 0.2K 0.3K 0.4K 0.5K Iteration 0.2 0.4 0.6 0.8 1.0 (a) Kitchen( =0.3) (b) Maze2D( =1.5) T = 0.1 T = 0.5 T = 1.0 T = 2.0 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

[1]

Variational Option Discovery Algorithms

Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Variational skill embeddings for meta reinforcement learning

Jen-Tzung Chien and Weiwei Lai. Variational skill embeddings for meta reinforcement learning. In 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE,

work page 2023
[3]

Hierarchical meta-reinforcement learning via automated macro-action discovery.arXiv preprint arXiv:2412.11930,

Minjae Cho and Chuangchuang Sun. Hierarchical meta-reinforcement learning via automated macro-action discovery.arXiv preprint arXiv:2412.11930,

work page arXiv
[4]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Diversity is all you need: Learning skills without a reward function

10 Under review as a conference paper at ICLR 2026 Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representa- tions,

work page 2026
[6]

Mghrl: Meta goal-generation for hierarchical reinforcement learning

Haotian Fu, Hongyao Tang, Jianye Hao, Wulong Liu, and Chen Chen. Mghrl: Meta goal-generation for hierarchical reinforcement learning. InDistributed Artificial Intelligence: Second Interna- tional Conference, DAI 2020, Nanjing, China, October 24–27, 2020, Proceedings 2, pp. 29–39. Springer, 2020a. Haotian Fu, Shangqun Yu, Saket Tiwari, Michael Littman, and...

work page 2020
[7]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020b. Jonas Gehring, Gabriel Synnaeve, Andreas Krause, and Nicolas Usunier. Hierarchical skills for ef- ficient exploration.Advances in Neural Information Processing Systems, 34:11553–11564,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[8]

Variational Intrinsic Control

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control.arXiv preprint arXiv:1611.07507,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Unsupervised meta- learning for reinforcement learning.arXiv preprint arXiv:1806.04640,

Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised meta- learning for reinforcement learning.arXiv preprint arXiv:1806.04640,

work page arXiv
[10]

Efficient and stable offline-to-online reinforcement learning via continual policy revitalization

11 Under review as a conference paper at ICLR 2026 Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, and Ming Li. Efficient and stable offline-to-online reinforcement learning via continual policy revitalization. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pp. 4317–4325,

work page 2026
[11]

Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling

Russell Mendonca, Xinyang Geng, Chelsea Finn, and Sergey Levine. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling.arXiv preprint arXiv:2006.07178,

work page arXiv 2006
[12]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

12 Under review as a conference paper at ICLR 2026 Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. InInternational conference on machine learning, pp. 5331–5340. PMLR,

work page 2026
[13]

Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics

Krishan Rana, Ming Xu, Brendan Tidd, Michael Milford, and Niko S ¨underhauf. Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics. InConference on Robot Learning, pp. 2095–2104. PMLR,

work page 2095
[14]

Hierarchical transformers are efficient meta- reinforcement learners.arXiv preprint arXiv:2402.06402,

Gresa Shala, Andr´e Biedenkapp, and Josif Grabocka. Hierarchical transformers are efficient meta- reinforcement learners.arXiv preprint arXiv:2402.06402,

work page arXiv
[15]

Generalizable task representation learning for offline meta-reinforcement learning with data limitations

13 Under review as a conference paper at ICLR 2026 Renzhe Zhou, Chen-Xiao Gao, Zongzhang Zhang, and Yang Yu. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17132–17140,

work page 2026
[16]

We used large language models only for copy editing to improve spelling and readability, and we verified all suggested revisions before incorporation

14 Under review as a conference paper at ICLR 2026 A THE USE OFLARGELANGUAGEMODELS(LLMS) We wrote the entire manuscript ourselves, including the main text and the appendix. We used large language models only for copy editing to improve spelling and readability, and we verified all suggested revisions before incorporation. LLMs were not used to generate id...

work page 2026
[17]

The low-level skill policyπ l,ϕ, skill encoderq ϕ, and skill priorp ϕ are parame- terized byϕand trained using the following loss function (modified from Eq

B.1 INITIALSKILLLEARNINGPHASE Following SPiRL (Pertsch et al., 2021), introduced in Section 3, we train initial skills using the offline datasetB off. The low-level skill policyπ l,ϕ, skill encoderq ϕ, and skill priorp ϕ are parame- terized byϕand trained using the following loss function (modified from Eq. (1)): Lspirl(ϕ) :=E (st:t+Hs ,at:t+Hs)∼Boff z∼qϕ...

work page 2021
[18]

A trajectoryτ i is added to Bi on if its returnG(τ i)exceeds the minimum return in the buffer,min τ ′∈Bion G(τ ′)

Self-Improvement Skill Learning To extract better trajectories and learn skills that effectively solve tasks, the online bufferB i on selec- tively stores high-return trajectories collected during the meta-training phase through the execution of the low-level policyπ l,ϕ and the skill-improvement policyπ imp,ψ. A trajectoryτ i is added to Bi on if its ret...

work page 2026
[19]

Instead of a standard value function, SAC uses a soft value function that combines entropy, with the entropy coefficient adjusted automatically to maintain the target en- tropy

is a reinforcement learning algorithm that incorporates entropy to improve exploration. Instead of a standard value function, SAC uses a soft value function that combines entropy, with the entropy coefficient adjusted automatically to maintain the target en- tropy. To enhance value function estimation, SAC employs doubleQlearning, using two inde- pendent ...

work page 2018
[20]

is a skill-based meta-RL algorithm that uses both offline datasets and meta-train tasks. While it shares SISL’s approach of extracting reusable skills and performing meta- train and meta-test phases, SiMPL fixes the skill model without further updates during meta-training. SiMPL’s loss function is also detailed in Section 3, and SiMPL’s implementation use...

work page 2026
[21]

Additionally, for implementing the SISL, we utilize MLP structures forπ imp,Q imp, and ˆR

for the task encoder and MLP structures for the high-level policy and value function. Additionally, for implementing the SISL, we utilize MLP structures forπ imp,Q imp, and ˆR. The de- tailed hidden network sizes are presented in Table C.3 and Table C.4. Table C.3 presents the network architectures (the number of nodes in fully connected layers) and the h...

work page 2000
[22]

SISL consistently demonstrated superior robustness, outper- forming all baselines across various environments and noise levels

Rows represent evaluation environments, and columns denote noise levels. SISL consistently demonstrated superior robustness, outper- forming all baselines across various environments and noise levels. At higher noise levels such as Noise(σ= 0.2), Noise(σ= 0.3) for Kitchen and Office, and Noise(σ= 1.0), Noise(σ= 1.5) for Maze2D and AntMaze, significant per...

work page 2026

[1] [1]

Variational Option Discovery Algorithms

Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter Abbeel. Variational option discovery algorithms.arXiv preprint arXiv:1807.10299,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Variational skill embeddings for meta reinforcement learning

Jen-Tzung Chien and Weiwei Lai. Variational skill embeddings for meta reinforcement learning. In 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE,

work page 2023

[3] [3]

Hierarchical meta-reinforcement learning via automated macro-action discovery.arXiv preprint arXiv:2412.11930,

Minjae Cho and Chuangchuang Sun. Hierarchical meta-reinforcement learning via automated macro-action discovery.arXiv preprint arXiv:2412.11930,

work page arXiv

[4] [4]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl 2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Diversity is all you need: Learning skills without a reward function

10 Under review as a conference paper at ICLR 2026 Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representa- tions,

work page 2026

[6] [6]

Mghrl: Meta goal-generation for hierarchical reinforcement learning

Haotian Fu, Hongyao Tang, Jianye Hao, Wulong Liu, and Chen Chen. Mghrl: Meta goal-generation for hierarchical reinforcement learning. InDistributed Artificial Intelligence: Second Interna- tional Conference, DAI 2020, Nanjing, China, October 24–27, 2020, Proceedings 2, pp. 29–39. Springer, 2020a. Haotian Fu, Shangqun Yu, Saket Tiwari, Michael Littman, and...

work page 2020

[7] [7]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020b. Jonas Gehring, Gabriel Synnaeve, Andreas Krause, and Nicolas Usunier. Hierarchical skills for ef- ficient exploration.Advances in Neural Information Processing Systems, 34:11553–11564,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[8] [8]

Variational Intrinsic Control

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control.arXiv preprint arXiv:1611.07507,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Unsupervised meta- learning for reinforcement learning.arXiv preprint arXiv:1806.04640,

Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised meta- learning for reinforcement learning.arXiv preprint arXiv:1806.04640,

work page arXiv

[10] [10]

Efficient and stable offline-to-online reinforcement learning via continual policy revitalization

11 Under review as a conference paper at ICLR 2026 Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, and Ming Li. Efficient and stable offline-to-online reinforcement learning via continual policy revitalization. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pp. 4317–4325,

work page 2026

[11] [11]

Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling

Russell Mendonca, Xinyang Geng, Chelsea Finn, and Sergey Levine. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling.arXiv preprint arXiv:2006.07178,

work page arXiv 2006

[12] [12]

Efficient off-policy meta-reinforcement learning via probabilistic context variables

12 Under review as a conference paper at ICLR 2026 Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. InInternational conference on machine learning, pp. 5331–5340. PMLR,

work page 2026

[13] [13]

Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics

Krishan Rana, Ming Xu, Brendan Tidd, Michael Milford, and Niko S ¨underhauf. Residual skill policies: Learning an adaptable skill-based action space for reinforcement learning for robotics. InConference on Robot Learning, pp. 2095–2104. PMLR,

work page 2095

[14] [14]

Hierarchical transformers are efficient meta- reinforcement learners.arXiv preprint arXiv:2402.06402,

Gresa Shala, Andr´e Biedenkapp, and Josif Grabocka. Hierarchical transformers are efficient meta- reinforcement learners.arXiv preprint arXiv:2402.06402,

work page arXiv

[15] [15]

Generalizable task representation learning for offline meta-reinforcement learning with data limitations

13 Under review as a conference paper at ICLR 2026 Renzhe Zhou, Chen-Xiao Gao, Zongzhang Zhang, and Yang Yu. Generalizable task representation learning for offline meta-reinforcement learning with data limitations. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17132–17140,

work page 2026

[16] [16]

We used large language models only for copy editing to improve spelling and readability, and we verified all suggested revisions before incorporation

14 Under review as a conference paper at ICLR 2026 A THE USE OFLARGELANGUAGEMODELS(LLMS) We wrote the entire manuscript ourselves, including the main text and the appendix. We used large language models only for copy editing to improve spelling and readability, and we verified all suggested revisions before incorporation. LLMs were not used to generate id...

work page 2026

[17] [17]

The low-level skill policyπ l,ϕ, skill encoderq ϕ, and skill priorp ϕ are parame- terized byϕand trained using the following loss function (modified from Eq

B.1 INITIALSKILLLEARNINGPHASE Following SPiRL (Pertsch et al., 2021), introduced in Section 3, we train initial skills using the offline datasetB off. The low-level skill policyπ l,ϕ, skill encoderq ϕ, and skill priorp ϕ are parame- terized byϕand trained using the following loss function (modified from Eq. (1)): Lspirl(ϕ) :=E (st:t+Hs ,at:t+Hs)∼Boff z∼qϕ...

work page 2021

[18] [18]

A trajectoryτ i is added to Bi on if its returnG(τ i)exceeds the minimum return in the buffer,min τ ′∈Bion G(τ ′)

Self-Improvement Skill Learning To extract better trajectories and learn skills that effectively solve tasks, the online bufferB i on selec- tively stores high-return trajectories collected during the meta-training phase through the execution of the low-level policyπ l,ϕ and the skill-improvement policyπ imp,ψ. A trajectoryτ i is added to Bi on if its ret...

work page 2026

[19] [19]

Instead of a standard value function, SAC uses a soft value function that combines entropy, with the entropy coefficient adjusted automatically to maintain the target en- tropy

is a reinforcement learning algorithm that incorporates entropy to improve exploration. Instead of a standard value function, SAC uses a soft value function that combines entropy, with the entropy coefficient adjusted automatically to maintain the target en- tropy. To enhance value function estimation, SAC employs doubleQlearning, using two inde- pendent ...

work page 2018

[20] [20]

is a skill-based meta-RL algorithm that uses both offline datasets and meta-train tasks. While it shares SISL’s approach of extracting reusable skills and performing meta- train and meta-test phases, SiMPL fixes the skill model without further updates during meta-training. SiMPL’s loss function is also detailed in Section 3, and SiMPL’s implementation use...

work page 2026

[21] [21]

Additionally, for implementing the SISL, we utilize MLP structures forπ imp,Q imp, and ˆR

for the task encoder and MLP structures for the high-level policy and value function. Additionally, for implementing the SISL, we utilize MLP structures forπ imp,Q imp, and ˆR. The de- tailed hidden network sizes are presented in Table C.3 and Table C.4. Table C.3 presents the network architectures (the number of nodes in fully connected layers) and the h...

work page 2000

[22] [22]

SISL consistently demonstrated superior robustness, outper- forming all baselines across various environments and noise levels

Rows represent evaluation environments, and columns denote noise levels. SISL consistently demonstrated superior robustness, outper- forming all baselines across various environments and noise levels. At higher noise levels such as Noise(σ= 0.2), Noise(σ= 0.3) for Kitchen and Office, and Noise(σ= 1.0), Noise(σ= 1.5) for Maze2D and AntMaze, significant per...

work page 2026