Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Fuyuan Qian; Menglong Zhang; Quanying Liu; Song Wang

arxiv: 2606.00780 · v1 · pith:X4ZFC2CRnew · submitted 2026-05-30 · 💻 cs.LG · cs.AI

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Fuyuan Qian , Menglong Zhang , Song Wang , Quanying Liu This is my paper

Pith reviewed 2026-06-28 19:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline meta-reinforcement learningtask representation learningTransformer world modelcontext distribution shiftbehavior-invariant latentsconservative value penaltysparse rewards

0 comments

The pith

Task representations from a Transformer world model stay the same regardless of the policy that collected the offline data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to remove the dependence of learned task features on the behavior policy that produced a static dataset, so that meta-RL agents can adapt to new tasks without being misled by shifts in how the data were gathered. It does this by training a Transformer-based stochastic world model under an information-theoretic objective that isolates task-defining latent variables. A conservative penalty is then added to value estimates obtained from imagined rollouts to keep the policy from exploiting inaccuracies in the learned model. If the invariance holds, agents should generalize reliably even when rewards are sparse and the test environments differ from the training distribution.

Core claim

Integrating an information-theoretic task representation objective with a Transformer-based stochastic world model produces latent variables whose distribution is independent of the behavior policy that generated the offline data, thereby mitigating context distribution shift; a conservative value penalty on imagination-based rollouts simultaneously limits exploitation of model error and supports robust adaptation.

What carries the argument

The information-theoretic objective applied inside the Transformer-based stochastic world model, which isolates behavior-invariant latent variables that define each task.

If this is right

Agents can adapt to unseen tasks from static datasets without suffering from context distribution shift caused by the original data-collection policy.
The conservative penalty keeps the policy from exploiting errors in the world model during imagination-based planning.
Performance remains stable under out-of-distribution tasks and sparse-reward conditions.
The overall method yields higher success rates and better generalization than prior offline meta-RL approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the latent variables truly factor out policy effects, the same information-theoretic objective could be inserted into other model-based offline algorithms to reduce distribution shift.
A direct test would measure whether the learned latents predict task identity better than they predict statistics of the collecting policy.
Replacing the Transformer backbone with a different sequence model would show whether the invariance depends on the specific architecture.

Load-bearing premise

An information-theoretic penalty on the world model will force the extracted latent variables to have the same distribution no matter which policy collected the offline data.

What would settle it

Train the model on two separate offline datasets for identical tasks that were collected by behavior policies with clearly different state-action distributions; if the resulting latent distributions differ by more than sampling noise, the invariance claim is false.

Figures

Figures reproduced from arXiv: 2606.00780 by Fuyuan Qian, Menglong Zhang, Quanying Liu, Song Wang.

**Figure 1.** Figure 1: Modeling latent task dynamics with a world model facilitates the capture of behavior-invariant task information, thereby supporting task inference and enabling rapid adaptation. et al., 2016; Rakelly et al., 2019; Zintgraf et al., 2020). Compared to online meta-RL, offline meta-RL is required to model a static task distribution and ensure that policies learned from offline data can robustly transfer to un… view at source ↗

**Figure 2.** Figure 2: MetaSTAR Framework. (a) The world model is responsible not only for inferring task representation from context but also for augmenting the data distribution via imagination. (b) The Transformer module simultaneously handles temporal encoding and dynamic prediction, ensuring that the extracted task representations are causal. the Transformer, which outputs a context embedding ht that aggregates historical i… view at source ↗

**Figure 3.** Figure 3: Average online meta-testing performance on 4 dense-reward environments and 4 sparse-reward environments. Evaluation Protocols. In the meta-testing phase, we employ two evaluation protocols: offline test and online test. Offline test is an idealized but impractical evaluation method. It directly uses pre-collected offline data as the context, thus ignoring the context shift problem, and usually achieves hi… view at source ↗

**Figure 4.** Figure 4: Average online adaptation performance of the first five episodes on out-of-distribution tasks. To address the second question, we evaluate MetaSTAR under out-of-distribution online adaptation, where policy distribution shift becomes more challenging [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: T-SNE visualization of the learned task representation space in Hopper-Param. To qualitatively examine the property of behavior invariance, we used t-SNE (Maaten & Hinton, 2008) to visualize the learned task representations [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Average offline meta-testing performance on 4 dense-reward environments and 4 sparse-reward environments [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Average online adaptation performance of the first five episodes on in-distribution tasks [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: T-SNE visualization of the task representation with online testing in dense-reward environments [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: T-SNE visualization of the task representation with online testing in sparse-reward environments [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Euclidean distance of task representations on Cheetah-Vel. Cheetah-Vel is a relatively simple dense-reward task, most methods can capture a meaningful distance pattern to some extent: tasks with similar target velocities tend to be closer in the learned representation space, while tasks with larger velocity gaps are generally farther apart. MetaSTAR also preserves this locally smooth structure, indicating… view at source ↗

**Figure 12.** Figure 12: Euclidean distance of task representations on Point-Robot-Sparse [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation study on online testing across 8 environments, to compare MetaSTAR with methods that without conservative policy optimization and/or LWM. MetaSTAR w/o ℒ𝑊𝑀 w/o Conservative Policy Optimization w/o ℒ𝑊𝑀 and Conservative Policy Optimization [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Ablation study on offline testing across 8 environments, to compare MetaSTAR with methods that without conservative policy optimization and/or LWM [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Effect of removing the previous reward rt−1 from the world-model observation on Cheetah-Vel, Cheetah-Vel-Sparse, and Point-Robot-Sparse. We additionally ablate the use of the previous reward rt−1 in the world-model observation, where the observation is defined as ot = [st, rt−1] [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Different hyperparameter settings of λ on Cheetah-Vel-Sparse. We conduct a sensitivity analysis on the contrastive loss weight λ in Eq. (8). As shown in [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Effect of the conservative penalty weight β on Cheetah-Vel, Cheetah-Vel-Sparse, and Point-Robot-Sparse. Mechanistically, β controls the strength of pessimistic regularization on unsupported state-action regions, forming a trade-off between exploiting world-model imagination and suppressing model-error propagation. When β is small, the policy can 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Effect of the context length L in contextual imagination on Cheetah-Vel, Cheetah-Vel-Sparse, and Point-Robot-Sparse. We also investigate the role of the real context length L in contextual imagination by comparing L = 1 and L = 8. As shown in [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Effect of the imagination horizon H on Cheetah-Vel, Cheetah-Vel-Sparse, and Point-Robot-Sparse. Point-Robot-Sparse, the influence of H is also relatively mild, suggesting that MetaSTAR is not highly sensitive to the imagination horizon within the tested range. This result reflects the trade-off in imagination-based policy optimization. A longer horizon provides more imagined transitions for policy learnin… view at source ↗

read the original abstract

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines info-theoretic task reps with Transformer world models and conservative penalties for offline meta-RL, but the behavior-invariance claim rests on an assumption rather than an explicit constraint.

read the letter

The main point is a framework that learns task-defining latents claimed to be invariant to the behavior policy, using an information-theoretic objective inside a Transformer stochastic world model, then adds a conservative value penalty on imagined rollouts to limit exploitation of model errors.

It does a reasonable job spelling out the context-shift and policy-shift problems that arise in offline meta-RL, especially with sparse rewards, and shows how a combined representation-plus-planning approach might address them. The choice of Transformers for the world model fits the sequential nature of the data, and the conservative penalty is a standard way to stay safe during planning.

The soft spot is the invariance claim. The abstract states that the latents are invariant to the behavior policy, yet standard mutual-information objectives encourage informativeness without forcing the latent distribution to be independent of the policy that generated the trajectories. The stress-test note is right that an extra term—an adversarial discriminator on policy features or an explicit KL between latents from different behavior policies—would be needed to derive that property, and nothing in the description indicates such a term is present. Without it, invariance is an assumption, not a result. The evaluations are reported to beat prior methods on OOD and sparse-reward cases, but the contribution of the invariance piece versus the other components cannot be judged from the given information.

This is for people already working in offline meta-RL who need tools for distribution shift. A reader in that niche could get incremental ideas from the combination, though the paper would need the full equations and ablations before it changes anyone's approach.

Send it to peer review so referees can inspect the objective and check whether the invariance actually holds in the experiments.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a framework for offline meta-reinforcement learning that integrates an information-theoretic objective for task representation learning inside a Transformer-based stochastic world model. The central claim is that this produces task latents z that are invariant to the behavior policy generating the offline data, thereby mitigating context distribution shift; a conservative value penalty is then applied to imagination-based rollouts to address policy shift and prevent model exploitation. The method is reported to outperform prior approaches on out-of-distribution and sparse-reward settings.

Significance. If the invariance property is actually realized by the objective and the conservative penalty demonstrably prevents exploitation without sacrificing adaptation, the work would address two load-bearing obstacles in offline meta-RL and could improve generalization from static datasets.

major comments (2)

[Abstract] Abstract: the claim that the learned latents are 'invariant to behavior policy' is presented without any explicit invariance term (e.g., an adversarial discriminator on policy features or a KL(p(z|π_b1) || p(z|π_b2)) regularizer). Standard mutual-information objectives maximize I(context; z) but do not cancel dependence on π_b; no derivation or equation is supplied showing how invariance follows.
[Abstract] Abstract: the conservative value penalty is described only at the level of 'preventing the policy from exploiting model inaccuracies.' Without the precise functional form (e.g., whether it is a penalty on the Q-function, a CQL-style term, or a model-uncertainty bonus) it is impossible to verify that the penalty simultaneously blocks exploitation and preserves the adaptation claimed in the meta-RL setting.

minor comments (1)

The abstract states that 'extensive evaluations demonstrate' superiority but supplies no benchmark names, dataset sizes, or quantitative metrics, preventing any assessment of the strength of the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these precise observations on the abstract. Both comments identify areas where the abstract's wording exceeds what is explicitly derived or specified in the manuscript. We will revise the abstract accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the learned latents are 'invariant to behavior policy' is presented without any explicit invariance term (e.g., an adversarial discriminator on policy features or a KL(p(z|π_b1) || p(z|π_b2)) regularizer). Standard mutual-information objectives maximize I(context; z) but do not cancel dependence on π_b; no derivation or equation is supplied showing how invariance follows.

Authors: The referee is correct that a standard mutual-information objective does not automatically cancel dependence on the behavior policy π_b, and the manuscript supplies neither an explicit invariance regularizer nor a derivation establishing invariance. The abstract therefore overstates the property. We will revise the abstract to replace the phrase 'invariant to behavior policy' with 'that mitigate context distribution shift arising from behavior policy variations', consistent with the experimental claims and the introduction. A clarifying sentence on the objective's limitations will be added to the methods section. revision: yes
Referee: [Abstract] Abstract: the conservative value penalty is described only at the level of 'preventing the policy from exploiting model inaccuracies.' Without the precise functional form (e.g., whether it is a penalty on the Q-function, a CQL-style term, or a model-uncertainty bonus) it is impossible to verify that the penalty simultaneously blocks exploitation and preserves the adaptation claimed in the meta-RL setting.

Authors: We agree that the abstract gives only a high-level description and does not state the functional form. The precise implementation appears in the methods; to address the comment we will expand the abstract by one sentence to indicate that the penalty is applied to Q-value estimates on imagined rollouts. This revision will make the mechanism verifiable from the abstract alone while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity; invariance presented as modeling outcome without definitional reduction

full rationale

The provided abstract and context describe an information-theoretic objective inside a Transformer world model whose output is asserted to be behavior-invariant latents. No equations, fitted parameters renamed as predictions, or self-citation chains are visible that would make the invariance equivalent to the input objective by construction. The claim is therefore a modeling assertion rather than a tautological step, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations or details on parameters or assumptions.

pith-pipeline@v0.9.1-grok · 6490 in / 1021 out tokens · 41495 ms · 2026-06-28T19:24:58.655643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 3 internal anchors

[1]

and Timofte, R

Burchi, M. and Timofte, R. Mudreamer: Learning predictive world models without reconstruction.arXiv preprint arXiv:2405.15083,

work page arXiv
[2]

Learningtransformer-basedworldmodelswithcontrastive predictive coding.arXiv preprint arXiv:2503.04416, 2025

Burchi, M. and Timofte, R. Learning transformer-based world models with contrastive predictive coding.arXiv preprint arXiv:2503.04416,

work page arXiv
[3]

Transdreamer: Rein- forcement learning with transformer world models,

Chen, C., Wu, Y .-F., Yoon, J., and Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481,

work page arXiv
[4]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Duan, Y ., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learn- ing via slow reinforcement learning.arXiv preprint arXiv:1611.02779,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

World Models

Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training Agents Inside of Scalable World Models

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse control tasks through world models.Nature, pp. 1–7, 2025a. Hafner, D., Yan, W., and Lillicrap, T. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025b. Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Focal: Efficient fully- offline meta-reinforcement learning via distance met- ric learning and behavior regularization.arXiv preprint arXiv:2010.01112,

Li, L., Yang, R., and Luo, D. Focal: Efficient fully- offline meta-reinforcement learning via distance met- ric learning and behavior regularization.arXiv preprint arXiv:2010.01112,

work page arXiv 2010
[8]

Transformers are sample-efficient world models

Micheli, V ., Alonso, E., and Fleuret, F. Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

work page arXiv
[9]

Mamba: an effective world model approach for meta- reinforcement learning.arXiv preprint arXiv:2403.09859,

Rimon, Z., Jurgenson, T., Krupnik, O., Adler, G., and Tamar, A. Mamba: an effective world model approach for meta- reinforcement learning.arXiv preprint arXiv:2403.09859,

work page arXiv
[10]

arXiv preprint arXiv:2303.07109 (2023)

Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109,

work page arXiv
[11]

doi: 10.1109/IROS.2012. 6386109. Wang, J., Zhang, J., Jiang, H., Zhang, J., Wang, L., and Zhang, C. Offline meta reinforcement learning with in- distribution online adaptation. InInternational Confer- ence on Machine Learning, pp. 36626–36669. PMLR,

work page doi:10.1109/iros.2012 2012
[12]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., and Lu, J. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean Conference on Computer Vision, pp. 55–72. Springer, 2024a. Wang, Y ., He, J., Fan, L., Li, H., Chen, Y ., and Zhang, Z. Driving into the future: Multiview visual forecasting and planning with world model for au...

work page arXiv
[13]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning.arXiv preprint arXiv:1910.08348,

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y ., Hof- mann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning.arXiv preprint arXiv:1910.08348,

work page arXiv 1910
[14]

Therefore, the table should be interpreted mainly as a relative comparison under the same hardware and implementation setting

We note that the absolute wall-clock time may vary across machines due to differences in CPU performance, memory bandwidth, data loading efficiency, and software configuration. Therefore, the table should be interpreted mainly as a relative comparison under the same hardware and implementation setting. Table 3.Training time comparison on a single RTX 4090...

2020

[1] [1]

and Timofte, R

Burchi, M. and Timofte, R. Mudreamer: Learning predictive world models without reconstruction.arXiv preprint arXiv:2405.15083,

work page arXiv

[2] [2]

Learningtransformer-basedworldmodelswithcontrastive predictive coding.arXiv preprint arXiv:2503.04416, 2025

Burchi, M. and Timofte, R. Learning transformer-based world models with contrastive predictive coding.arXiv preprint arXiv:2503.04416,

work page arXiv

[3] [3]

Transdreamer: Rein- forcement learning with transformer world models,

Chen, C., Wu, Y .-F., Yoon, J., and Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481,

work page arXiv

[4] [4]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Duan, Y ., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learn- ing via slow reinforcement learning.arXiv preprint arXiv:1611.02779,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

World Models

Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Training Agents Inside of Scalable World Models

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse control tasks through world models.Nature, pp. 1–7, 2025a. Hafner, D., Yan, W., and Lillicrap, T. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025b. Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Focal: Efficient fully- offline meta-reinforcement learning via distance met- ric learning and behavior regularization.arXiv preprint arXiv:2010.01112,

Li, L., Yang, R., and Luo, D. Focal: Efficient fully- offline meta-reinforcement learning via distance met- ric learning and behavior regularization.arXiv preprint arXiv:2010.01112,

work page arXiv 2010

[8] [8]

Transformers are sample-efficient world models

Micheli, V ., Alonso, E., and Fleuret, F. Transform- ers are sample-efficient world models.arXiv preprint arXiv:2209.00588,

work page arXiv

[9] [9]

Mamba: an effective world model approach for meta- reinforcement learning.arXiv preprint arXiv:2403.09859,

Rimon, Z., Jurgenson, T., Krupnik, O., Adler, G., and Tamar, A. Mamba: an effective world model approach for meta- reinforcement learning.arXiv preprint arXiv:2403.09859,

work page arXiv

[10] [10]

arXiv preprint arXiv:2303.07109 (2023)

Robine, J., H ¨oftmann, M., Uelwer, T., and Harmeling, S. Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109,

work page arXiv

[11] [11]

doi: 10.1109/IROS.2012. 6386109. Wang, J., Zhang, J., Jiang, H., Zhang, J., Wang, L., and Zhang, C. Offline meta reinforcement learning with in- distribution online adaptation. InInternational Confer- ence on Machine Learning, pp. 36626–36669. PMLR,

work page doi:10.1109/iros.2012 2012

[12] [12]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., and Lu, J. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean Conference on Computer Vision, pp. 55–72. Springer, 2024a. Wang, Y ., He, J., Fan, L., Li, H., Chen, Y ., and Zhang, Z. Driving into the future: Multiview visual forecasting and planning with world model for au...

work page arXiv

[13] [13]

Varibad: A very good method for bayes-adaptive deep rl via meta-learning.arXiv preprint arXiv:1910.08348,

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y ., Hof- mann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning.arXiv preprint arXiv:1910.08348,

work page arXiv 1910

[14] [14]

Therefore, the table should be interpreted mainly as a relative comparison under the same hardware and implementation setting

We note that the absolute wall-clock time may vary across machines due to differences in CPU performance, memory bandwidth, data loading efficiency, and software configuration. Therefore, the table should be interpreted mainly as a relative comparison under the same hardware and implementation setting. Table 3.Training time comparison on a single RTX 4090...

2020