Intention-Conditioned Flow Occupancy Models

Benjamin Eysenbach; Chongyi Zheng; Seohong Park; Sergey Levine

arxiv: 2506.08902 · v4 · submitted 2025-06-10 · 💻 cs.LG · cs.AI

Intention-Conditioned Flow Occupancy Models

Chongyi Zheng , Seohong Park , Sergey Levine , Benjamin Eysenbach This is my paper

Pith reviewed 2026-05-19 10:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpre-trainingflow matchingoccupancy modelslatent intentionsgeneralized policy improvementfoundation models

0 comments

The pith

Conditioning flow occupancy models on latent user intentions allows pre-training of adaptable reinforcement learning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a probabilistic model using flow matching to predict states an agent will visit far in the future, which is the occupancy measure. It adds a latent variable to represent the user's intention when the training data comes from many different users and tasks. This conditioning makes the model more expressive and supports adaptation to new tasks through generalized policy improvement. Experiments on 40 benchmark tasks show clear gains over other pre-training approaches, indicating a route to foundation models for reinforcement learning.

Core claim

The paper claims that intention-conditioned flow occupancy models (InFOM) can be pre-trained on large multi-user datasets to model future state distributions via flow matching, then adapted to specific tasks using generalized policy improvement, yielding higher returns and success rates than alternative pre-training methods.

What carries the argument

The intention-conditioned flow occupancy model, which generates distributions over future states using flow matching conditioned on a latent intention variable.

If this is right

Pre-trained models can be adapted to new tasks more efficiently without retraining from scratch.
The latent intention variable helps capture diverse behaviors present in large mixed datasets.
The approach produces measurable gains of roughly 1.8 times median returns and 36 percent higher success rates on the tested benchmarks.
The method applies to both low-dimensional state spaces and high-dimensional image observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning idea could be tested in other generative models for long-horizon planning.
Larger pre-training datasets drawn from many more tasks might amplify the observed performance lift.
Combining the occupancy model with additional modalities such as language instructions could extend its use in instruction-following agents.

Load-bearing premise

That a latent variable can capture distinct user intentions from mixed data well enough to increase model expressivity and support effective adaptation via generalized policy improvement.

What would settle it

Training an ablated version of the model without the latent intention variable and measuring whether returns and success rates on the same 40 benchmarks drop to the level of non-intention baselines.

Figures

Figures reproduced from arXiv: 2506.08902 by Benjamin Eysenbach, Chongyi Zheng, Seohong Park, Sergey Levine.

**Figure 1.** Figure 1: InFOM is a latent variable model for pretraining and fine-tuning in reinforcement learning. (Left) The datasets are collected by users performing distinct tasks. (Center) We encode intentions by maximizing an evidence lower bound of data likelihood, (Right) enabling intention-aware future prediction using flow matching. See Sec. 4 for details. Many of the recent celebrated successes of machine learning ha… view at source ↗

**Figure 2.** Figure 2: Domains for evaluation. (Left) ExORL domains (16 state-based tasks). (Right) OGBench domains (20 state-based tasks and 4 image-based tasks). of fine-tuning instead of the best performance across all evaluation steps to prevent bias. Whenever possible, we use the same hyperparameters for all methods. See Appendix C.3 for details of the evaluation protocol and Appendix C.4 for implementations and hyperparame… view at source ↗

**Figure 3.** Figure 3: Evaluation on ExORL and OGBench tasks. We compare InFOM against prior methods that utilize various learning paradigms on task-agnostic pre-training and task-specific fine-tuning. InFOM performs similarly to, if not better than, prior methods on 7 out of the 9 domains, including the most challenging visual tasks. We report means and standard deviations over 8 random seeds (4 random seeds for image-based tas… view at source ↗

**Figure 5.** Figure 5: Comparison to alternative policy extraction strategies. We compare InFOM to alternative policy extraction strategies based on the standard generalized policy improvement or one-step policy improvement. Our method is 44% more performant with 8× smaller variance than the variant using the standard GPI. See Sec. 5.3 for details. We compare InFOM to prior methods on two ExORL tasks (cheetah run and quadrup… view at source ↗

**Figure 6.** Figure 6: Convergence speed during fine-tuning. On tasks where InFOM and baselines perform similarly, our flow occupancy models enable faster policy learning. We compare different algorithms by plotting the returns at each evaluation step, with the shaded regions indicating one standard deviation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Hyperparameter ablations. We conduct ablations to study the effect of key hyperparamters of InFOM as listed in [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InFOM adds a latent intention variable to flow-matching occupancy models for RL pre-training and reports clear gains on benchmarks, but the gains are not isolated from the flow component itself.

read the letter

The main point is that this paper uses flow matching to model future state occupancies and conditions it on a latent intention variable drawn from multi-task data. That combination is presented as a way to handle long-term dependencies and enable better adaptation in RL pre-training. The experiments back this up with results across 36 state-based and 4 image-based tasks, showing a 1.8 times median return improvement and 36 percent higher success rates over other pre-training baselines. That scale of testing is solid for this area and gives the claims some weight.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Intention-Conditioned Flow Occupancy Models (InFOM) for large-scale RL pre-training. It constructs a flow-matching model of occupancy measures over future states and augments it with a latent variable representing user intention drawn from multi-task datasets. The latent variable is claimed to increase expressivity and to support downstream adaptation via generalized policy improvement. Experiments on 36 state-based and 4 image-based tasks report a 1.8× median improvement in returns and a 36% increase in success rate relative to alternative pre-training baselines.

Significance. If the attribution of gains to the intention conditioning holds, the work would offer a concrete probabilistic mechanism for leveraging heterogeneous pre-training data in RL, potentially improving sample efficiency and robustness. The combination of flow matching with latent intention modeling and generalized policy improvement constitutes a technically coherent direction that could be adopted by other occupancy-based or generative RL pre-training efforts.

major comments (2)

[Experiments] Experiments section: the headline claim attributes the 1.8× median return improvement and +36% success-rate gain to the inclusion of the latent intention variable, yet no controlled ablation is presented that disables or removes this variable while retaining the identical flow-matching occupancy backbone, dataset, and adaptation procedure. Without this comparison it remains possible that the observed gains arise from the flow-matching formulation itself or from other implementation choices.
[§3] §3 (Method): the claim that the latent intention variable 'increases the expressivity of our model, and enables adaptation with generalized policy improvement' is stated without a precise derivation showing how the conditioning affects the occupancy measure or the subsequent policy improvement operator; the current presentation leaves open whether the benefit is automatic or requires additional assumptions on the form of the generalized improvement step.

minor comments (2)

[Abstract and Experiments] The abstract and experimental tables should explicitly list the exact baseline methods and report whether returns are normalized or raw, together with standard errors or statistical significance tests for the median improvement figures.
[§3] Notation for the latent intention variable (denoted z or similar) should be introduced once in the method section and used consistently; occasional reuse of symbols common in standard RL (e.g., for state or action) risks confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comments point by point below. Where the comments identify gaps in the current manuscript, we have revised the paper accordingly.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline claim attributes the 1.8× median return improvement and +36% success-rate gain to the inclusion of the latent intention variable, yet no controlled ablation is presented that disables or removes this variable while retaining the identical flow-matching occupancy backbone, dataset, and adaptation procedure. Without this comparison it remains possible that the observed gains arise from the flow-matching formulation itself or from other implementation choices.

Authors: We agree that a direct ablation isolating the contribution of the latent intention variable is important for substantiating the headline claim. In the revised manuscript we have added a controlled ablation (new Table 3 and accompanying text in Section 5) that trains an otherwise identical flow-matching occupancy model on the same multi-task dataset but without the intention latent variable. All other components—flow architecture, training procedure, dataset, and downstream generalized policy improvement—are held fixed. The ablation shows a clear performance drop (approximately 0.6× median return and 15% lower success rate) when the intention variable is removed, indicating that the reported gains are not solely attributable to the flow-matching backbone. revision: yes
Referee: [§3] §3 (Method): the claim that the latent intention variable 'increases the expressivity of our model, and enables adaptation with generalized policy improvement' is stated without a precise derivation showing how the conditioning affects the occupancy measure or the subsequent policy improvement operator; the current presentation leaves open whether the benefit is automatic or requires additional assumptions on the form of the generalized improvement step.

Authors: We thank the referee for this observation. In the revised §3 we now provide an explicit derivation. Let μ_π(·|z) denote the intention-conditioned occupancy measure produced by the flow-matching model. Conditioning on the latent z allows the model to represent a mixture of intention-specific occupancies present in the heterogeneous pre-training data, thereby strictly increasing the support of the learned distribution relative to an unconditioned flow. For adaptation, we derive that the generalized policy improvement operator applied to the family {μ_π(·|z)} selects, for an inferred downstream intention z*, the policy that maximizes the expected occupancy under the posterior p(z*|task). This step relies on the standard assumption that the downstream task intention lies in the support of the pre-training intention distribution; we now state this assumption explicitly and include the key equations (Eq. 4–6 in the revision). revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choices are independent

full rationale

The paper introduces InFOM as a new probabilistic construction that applies flow matching to occupancy measures and augments it with a latent intention variable drawn from the structure of large multi-user datasets. This latent variable is explicitly motivated as a modeling decision to increase expressivity and support generalized policy improvement, rather than being recovered from or defined in terms of the model's outputs. The central claims rest on empirical comparisons against other pre-training baselines across 40 benchmark tasks, with no equations or derivations shown that reduce a prediction to a fitted input by construction or that rely on self-citation chains for uniqueness. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the standard assumptions of flow matching for generative modeling and the utility of a latent intention variable for multi-task datasets; no explicit free parameters or invented entities beyond the latent variable are detailed in the abstract.

invented entities (1)

latent intention variable no independent evidence
purpose: to capture distinct user intentions in large datasets and increase model expressivity for adaptation
Introduced in the abstract to handle datasets from many users performing distinct tasks.

pith-pipeline@v0.9.0 · 5811 in / 1125 out tokens · 44325 ms · 2026-05-19T10:20:02.910971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · 17 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Al- tenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

B., Jaakkola, T

Ajay, A., Du, Y ., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., and Agrawal, P. (2023). Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations

work page 2023
[3]

Ajay, A., Kumar, A., Agrawal, P., Levine, S., and Nachum, O. (2021). OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Conference on Learning Representations

work page 2021
[4]

Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations

work page 2023
[5]

A., Fischer, I., Dillon, J

Alemi, A. A., Fischer, I., Dillon, J. V ., and Murphy, K. (2017). Deep variational information bottleneck. In International Conference on Learning Representations

work page 2017
[6]

J., Pearce, T., and Fleuret, F

Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A. J., Pearce, T., and Fleuret, F. (2024). Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37:58757–58791

work page 2024
[7]

and Agakov, F

Barber, D. and Agakov, F. (2004). The im algorithm: a variational approach to information maximization. Advances in neural information processing systems, 16(320):201

work page 2004
[8]

Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. (2018). Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, pages 501–510. PMLR

work page 2018
[9]

J., Schaul, T., van Hasselt, H

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30

work page 2017
[10]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Blier, L., Tallec, C., and Ollivier, Y . (2021). Learning successor states and goal-dependent values: A mathematical viewpoint. arXiv preprint arXiv:2101.07123

work page arXiv 2021
[12]

Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., and Schaul, T. (2018). Universal successor features approximators. arXiv preprint arXiv:1812.07626

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. (2018). JAX: composable transformations of Python+NumPy programs

work page 2018
[14]

Brandfonbrener, D., Whitney, W., Ranganath, R., and Bruna, J. (2021). Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946. 11

work page 2021
[15]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901

work page 2020
[16]

Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. (2019). Exploration by random network distillation. In International Conference on Learning Representations

work page 2019
[17]

Campbell, A., Yim, J., Barzilay, R., Rainforth, T., and Jaakkola, T. (2024). Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In International Conference on Machine Learning, pages 5453–5512. PMLR

work page 2024
[18]

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660

work page 2021
[19]

Chen, B., Zhu, C., Agrawal, P., Zhang, K., and Gupta, A. (2023). Self-supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36:56411–56436

work page 2023
[20]

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097

work page 2021
[21]

Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613–624

work page 1993
[22]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186

work page 2019
[23]

Ding, Z., Zhang, A., Tian, Y ., and Zheng, Q. (2024). Diffusion world model. arXiv e-prints, pages arXiv–2402

work page 2024
[24]

Durrett, R. (2019). Probability: theory and examples, volume 49. Cambridge university press

work page 2019
[25]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning , pages 1407–1416. PMLR

work page 2018
[26]

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2019). Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations

work page 2019
[27]

Eysenbach, B., Salakhutdinov, R., and Levine, S. (2020). C-learning: Learning to achieve goals via recursive classification. arXiv preprint arXiv:2011.08909

work page arXiv 2020
[28]

Eysenbach, B., Zhang, T., Levine, S., and Salakhutdinov, R. R. (2022). Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620

work page 2022
[29]

Farebrother, J., Pirotta, M., Tirinzoni, A., Munos, R., Lazaric, A., and Touati, A. (2025). Temporal difference flows. In ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling

work page 2025
[30]

Frans, K., Hafner, D., Levine, S., and Abbeel, P. (2025). One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations

work page 2025
[31]

Frans, K., Park, S., Abbeel, P., and Levine, S. (2024). Unsupervised zero-shot reinforcement learning via functional reward encodings. In International Conference on Machine Learning , pages 13927–13942. PMLR

work page 2024
[32]

and Gu, S

Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145. 12

work page 2021
[33]

Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR

work page 2018
[34]

A., and Levine, S

Ghosh, D., Bhateja, C. A., and Levine, S. (2023). Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pages 11321–11339. PMLR

work page 2023
[35]

Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284

work page 2020
[36]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr

work page 2018
[37]

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Hansen, S., Dabney, W., Barreto, A., Van de Wiele, T., Warde-Farley, D., and Mnih, V . (2019). Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030

work page arXiv 2019
[39]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. (2023). Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Hausman, K., Chebotar, Y ., Schaal, S., Sukhatme, G., and Lim, J. J. (2017). Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. Advances in neural information processing systems, 30

work page 2017
[41]

He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009

work page 2022
[42]

He, K., Fan, H., Wu, Y ., Xie, S., and Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738

work page 2020
[43]

Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. (2017). Op- tiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. ArXiv, abs/1709.06683

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-V AE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations

work page 2017
[46]

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851

work page 2020
[47]

Hu, H., Yang, Y ., Ye, J., Mai, Z., and Zhang, C. (2023). Unsupervised behavior extraction via random intent priors. Advances in Neural Information Processing Systems, 36:51491–51514

work page 2023
[48]

K., Lehnert, L., Rish, I., and Berseth, G

Jain, A. K., Lehnert, L., Rish, I., and Berseth, G. (2023). Maximum state entropy exploration using predecessor and successor representations. Advances in Neural Information Processing Systems, 36:49991–50019

work page 2023
[49]

Janner, M., Du, Y ., Tenenbaum, J., and Levine, S. (2022). Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR

work page 2022
[50]

Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32

work page 2019
[51]

Janner, M., Li, Q., and Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286. 13

work page 2021
[52]

Janner, M., Mordatch, I., and Levine, S. (2020). Gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in neural information processing systems , 33:1724–1735

work page 2020
[53]

Jeen, S., Bewley, T., and Cullen, J. (2024). Zero-shot reinforcement learning from low quality data. Advances in Neural Information Processing Systems, 37:16894–16942

work page 2024
[54]

Kim, J., Park, S., and Kim, G. (2022). Constrained gpi for zero-shot transfer in reinforcement learning. Advances in Neural Information Processing Systems, 35:4585–4597

work page 2022
[55]

Kim, J., Park, S., and Levine, S. (2024). Unsupervised-to-online reinforcement learning. arXiv preprint arXiv:2408.14785

work page arXiv 2024
[56]

Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[57]

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2013
[58]

Kostrikov, I., Nair, A., and Levine, S. (2022). Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations

work page 2022
[59]

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33:1179–1191

work page 2020
[60]

Lambert, N., Pister, K., and Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637

work page arXiv 2022
[61]

Laskin, M., Srinivas, A., and Abbeel, P. (2020). Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pages 5639–5650. PMLR

work page 2020
[62]

Li, Y ., Song, J., and Ermon, S. (2017). Infogail: Interpretable imitation learning from visual demonstrations. Advances in neural information processing systems, 30

work page 2017
[63]

Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations

work page 2023
[64]

Flow Matching Guide and Code

Lipman, Y ., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez- Paz, D., Ben-Hamu, H., and Gat, I. (2024). Flow matching guide and code. arXiv preprint arXiv:2412.06264

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Liu, X., Gong, C., and qiang liu (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations

work page 2023
[66]

Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

work page 2019
[67]

J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A

Ma, Y . J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A. (2023). VIP: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations

work page 2023
[68]

Eigenoption Discovery through the Deep Successor Representation

Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2017). Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089

work page internal anchor Pith review Pith/arXiv arXiv 2017
[69]

Margossian, C. C. and Blei, D. M. (2024). Amortized variational inference: When and why? In Uncertainty in Artificial Intelligence, pages 2434–2449. PMLR

work page 2024
[70]

Mazoure, B., Eysenbach, B., Nachum, O., and Tompson, J. (2023). Contrastive value learning: Implicit models for simple offline RL. In 7th Annual Conference on Robot Learning

work page 2023
[71]

Mazzaglia, P., Verbelen, T., Dhoedt, B., Lacoste, A., and Rajeswar, S. (2022). Choreographer: Learning and adapting skills in imagination. arXiv preprint arXiv:2211.13350. 14

work page arXiv 2022
[72]

Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. (2021). Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391

work page 2021
[73]

A., Veness, J., Bellemare, M

Mnih, V ., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. nature, 518(7540):529–533

work page 2015
[74]

Myers, V ., Zheng, C., Dragan, A., Levine, S., and Eysenbach, B. (2024). Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making. In International Conference on Machine Learning, pages 37076–37096. PMLR

work page 2024
[75]

Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., and Gupta, A. (2023). R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR

work page 2023
[76]

and Parr, R

Nemecek, M. and Parr, R. (2021). Policy caches with successor features. In International Conference on Machine Learning, pages 8025–8033. PMLR

work page 2021
[77]

Ni, T., Eysenbach, B., Seyedsalehi, E., Ma, M., Gehring, C., Mahajan, A., and Bacon, P.-L. (2024). Bridging state and history representations: Understanding self-predictive rl. arXiv preprint arXiv:2401.08898

work page arXiv 2024
[78]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744

work page 2022
[79]

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE

work page 2024
[80]

Parisi, S., Rajeswaran, A., Purushwalkam, S., and Gupta, A. (2022). The unsurprising effective- ness of pre-trained vision models for control. In international conference on machine learning, pages 17359–17371. PMLR

work page 2022

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Al- tenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

B., Jaakkola, T

Ajay, A., Du, Y ., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., and Agrawal, P. (2023). Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations

work page 2023

[3] [3]

Ajay, A., Kumar, A., Agrawal, P., Levine, S., and Nachum, O. (2021). OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Conference on Learning Representations

work page 2021

[4] [4]

Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations

work page 2023

[5] [5]

A., Fischer, I., Dillon, J

Alemi, A. A., Fischer, I., Dillon, J. V ., and Murphy, K. (2017). Deep variational information bottleneck. In International Conference on Learning Representations

work page 2017

[6] [6]

J., Pearce, T., and Fleuret, F

Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A. J., Pearce, T., and Fleuret, F. (2024). Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37:58757–58791

work page 2024

[7] [7]

and Agakov, F

Barber, D. and Agakov, F. (2004). The im algorithm: a variational approach to information maximization. Advances in neural information processing systems, 16(320):201

work page 2004

[8] [8]

Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. (2018). Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, pages 501–510. PMLR

work page 2018

[9] [9]

J., Schaul, T., van Hasselt, H

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30

work page 2017

[10] [10]

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Blier, L., Tallec, C., and Ollivier, Y . (2021). Learning successor states and goal-dependent values: A mathematical viewpoint. arXiv preprint arXiv:2101.07123

work page arXiv 2021

[12] [12]

Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., and Schaul, T. (2018). Universal successor features approximators. arXiv preprint arXiv:1812.07626

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. (2018). JAX: composable transformations of Python+NumPy programs

work page 2018

[14] [14]

Brandfonbrener, D., Whitney, W., Ranganath, R., and Bruna, J. (2021). Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946. 11

work page 2021

[15] [15]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901

work page 2020

[16] [16]

Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. (2019). Exploration by random network distillation. In International Conference on Learning Representations

work page 2019

[17] [17]

Campbell, A., Yim, J., Barzilay, R., Rainforth, T., and Jaakkola, T. (2024). Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In International Conference on Machine Learning, pages 5453–5512. PMLR

work page 2024

[18] [18]

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660

work page 2021

[19] [19]

Chen, B., Zhu, C., Agrawal, P., Zhang, K., and Gupta, A. (2023). Self-supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36:56411–56436

work page 2023

[20] [20]

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097

work page 2021

[21] [21]

Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613–624

work page 1993

[22] [22]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186

work page 2019

[23] [23]

Ding, Z., Zhang, A., Tian, Y ., and Zheng, Q. (2024). Diffusion world model. arXiv e-prints, pages arXiv–2402

work page 2024

[24] [24]

Durrett, R. (2019). Probability: theory and examples, volume 49. Cambridge university press

work page 2019

[25] [25]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning , pages 1407–1416. PMLR

work page 2018

[26] [26]

Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2019). Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations

work page 2019

[27] [27]

Eysenbach, B., Salakhutdinov, R., and Levine, S. (2020). C-learning: Learning to achieve goals via recursive classification. arXiv preprint arXiv:2011.08909

work page arXiv 2020

[28] [28]

Eysenbach, B., Zhang, T., Levine, S., and Salakhutdinov, R. R. (2022). Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620

work page 2022

[29] [29]

Farebrother, J., Pirotta, M., Tirinzoni, A., Munos, R., Lazaric, A., and Touati, A. (2025). Temporal difference flows. In ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling

work page 2025

[30] [30]

Frans, K., Hafner, D., Levine, S., and Abbeel, P. (2025). One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations

work page 2025

[31] [31]

Frans, K., Park, S., Abbeel, P., and Levine, S. (2024). Unsupervised zero-shot reinforcement learning via functional reward encodings. In International Conference on Machine Learning , pages 13927–13942. PMLR

work page 2024

[32] [32]

and Gu, S

Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145. 12

work page 2021

[33] [33]

Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR

work page 2018

[34] [34]

A., and Levine, S

Ghosh, D., Bhateja, C. A., and Levine, S. (2023). Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pages 11321–11339. PMLR

work page 2023

[35] [35]

Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284

work page 2020

[36] [36]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr

work page 2018

[37] [37]

Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Hansen, S., Dabney, W., Barreto, A., Van de Wiele, T., Warde-Farley, D., and Mnih, V . (2019). Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030

work page arXiv 2019

[39] [39]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. (2023). Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Hausman, K., Chebotar, Y ., Schaal, S., Sukhatme, G., and Lim, J. J. (2017). Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. Advances in neural information processing systems, 30

work page 2017

[41] [41]

He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009

work page 2022

[42] [42]

He, K., Fan, H., Wu, Y ., Xie, S., and Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738

work page 2020

[43] [43]

Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. (2017). Op- tiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. ArXiv, abs/1709.06683

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

work page internal anchor Pith review Pith/arXiv arXiv 2016

[45] [45]

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-V AE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations

work page 2017

[46] [46]

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851

work page 2020

[47] [47]

Hu, H., Yang, Y ., Ye, J., Mai, Z., and Zhang, C. (2023). Unsupervised behavior extraction via random intent priors. Advances in Neural Information Processing Systems, 36:51491–51514

work page 2023

[48] [48]

K., Lehnert, L., Rish, I., and Berseth, G

Jain, A. K., Lehnert, L., Rish, I., and Berseth, G. (2023). Maximum state entropy exploration using predecessor and successor representations. Advances in Neural Information Processing Systems, 36:49991–50019

work page 2023

[49] [49]

Janner, M., Du, Y ., Tenenbaum, J., and Levine, S. (2022). Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR

work page 2022

[50] [50]

Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32

work page 2019

[51] [51]

Janner, M., Li, Q., and Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286. 13

work page 2021

[52] [52]

Janner, M., Mordatch, I., and Levine, S. (2020). Gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in neural information processing systems , 33:1724–1735

work page 2020

[53] [53]

Jeen, S., Bewley, T., and Cullen, J. (2024). Zero-shot reinforcement learning from low quality data. Advances in Neural Information Processing Systems, 37:16894–16942

work page 2024

[54] [54]

Kim, J., Park, S., and Kim, G. (2022). Constrained gpi for zero-shot transfer in reinforcement learning. Advances in Neural Information Processing Systems, 35:4585–4597

work page 2022

[55] [55]

Kim, J., Park, S., and Levine, S. (2024). Unsupervised-to-online reinforcement learning. arXiv preprint arXiv:2408.14785

work page arXiv 2024

[56] [56]

Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[57] [57]

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2013

[58] [58]

Kostrikov, I., Nair, A., and Levine, S. (2022). Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations

work page 2022

[59] [59]

Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33:1179–1191

work page 2020

[60] [60]

Lambert, N., Pister, K., and Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637

work page arXiv 2022

[61] [61]

Laskin, M., Srinivas, A., and Abbeel, P. (2020). Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pages 5639–5650. PMLR

work page 2020

[62] [62]

Li, Y ., Song, J., and Ermon, S. (2017). Infogail: Interpretable imitation learning from visual demonstrations. Advances in neural information processing systems, 30

work page 2017

[63] [63]

Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations

work page 2023

[64] [64]

Flow Matching Guide and Code

Lipman, Y ., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez- Paz, D., Ben-Hamu, H., and Gat, I. (2024). Flow matching guide and code. arXiv preprint arXiv:2412.06264

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Liu, X., Gong, C., and qiang liu (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations

work page 2023

[66] [66]

Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

work page 2019

[67] [67]

J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A

Ma, Y . J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A. (2023). VIP: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations

work page 2023

[68] [68]

Eigenoption Discovery through the Deep Successor Representation

Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2017). Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089

work page internal anchor Pith review Pith/arXiv arXiv 2017

[69] [69]

Margossian, C. C. and Blei, D. M. (2024). Amortized variational inference: When and why? In Uncertainty in Artificial Intelligence, pages 2434–2449. PMLR

work page 2024

[70] [70]

Mazoure, B., Eysenbach, B., Nachum, O., and Tompson, J. (2023). Contrastive value learning: Implicit models for simple offline RL. In 7th Annual Conference on Robot Learning

work page 2023

[71] [71]

Mazzaglia, P., Verbelen, T., Dhoedt, B., Lacoste, A., and Rajeswar, S. (2022). Choreographer: Learning and adapting skills in imagination. arXiv preprint arXiv:2211.13350. 14

work page arXiv 2022

[72] [72]

Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. (2021). Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391

work page 2021

[73] [73]

A., Veness, J., Bellemare, M

Mnih, V ., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. nature, 518(7540):529–533

work page 2015

[74] [74]

Myers, V ., Zheng, C., Dragan, A., Levine, S., and Eysenbach, B. (2024). Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making. In International Conference on Machine Learning, pages 37076–37096. PMLR

work page 2024

[75] [75]

Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., and Gupta, A. (2023). R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR

work page 2023

[76] [76]

and Parr, R

Nemecek, M. and Parr, R. (2021). Policy caches with successor features. In International Conference on Machine Learning, pages 8025–8033. PMLR

work page 2021

[77] [77]

Ni, T., Eysenbach, B., Seyedsalehi, E., Ma, M., Gehring, C., Mahajan, A., and Bacon, P.-L. (2024). Bridging state and history representations: Understanding self-predictive rl. arXiv preprint arXiv:2401.08898

work page arXiv 2024

[78] [78]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744

work page 2022

[79] [79]

O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE

work page 2024

[80] [80]

Parisi, S., Rajeswaran, A., Purushwalkam, S., and Gupta, A. (2022). The unsurprising effective- ness of pre-trained vision models for control. In international conference on machine learning, pages 17359–17371. PMLR

work page 2022