pith. sign in

arxiv: 2506.08902 · v4 · submitted 2025-06-10 · 💻 cs.LG · cs.AI

Intention-Conditioned Flow Occupancy Models

Pith reviewed 2026-05-19 10:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningpre-trainingflow matchingoccupancy modelslatent intentionsgeneralized policy improvementfoundation models
0
0 comments X

The pith

Conditioning flow occupancy models on latent user intentions allows pre-training of adaptable reinforcement learning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a probabilistic model using flow matching to predict states an agent will visit far in the future, which is the occupancy measure. It adds a latent variable to represent the user's intention when the training data comes from many different users and tasks. This conditioning makes the model more expressive and supports adaptation to new tasks through generalized policy improvement. Experiments on 40 benchmark tasks show clear gains over other pre-training approaches, indicating a route to foundation models for reinforcement learning.

Core claim

The paper claims that intention-conditioned flow occupancy models (InFOM) can be pre-trained on large multi-user datasets to model future state distributions via flow matching, then adapted to specific tasks using generalized policy improvement, yielding higher returns and success rates than alternative pre-training methods.

What carries the argument

The intention-conditioned flow occupancy model, which generates distributions over future states using flow matching conditioned on a latent intention variable.

If this is right

  • Pre-trained models can be adapted to new tasks more efficiently without retraining from scratch.
  • The latent intention variable helps capture diverse behaviors present in large mixed datasets.
  • The approach produces measurable gains of roughly 1.8 times median returns and 36 percent higher success rates on the tested benchmarks.
  • The method applies to both low-dimensional state spaces and high-dimensional image observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning idea could be tested in other generative models for long-horizon planning.
  • Larger pre-training datasets drawn from many more tasks might amplify the observed performance lift.
  • Combining the occupancy model with additional modalities such as language instructions could extend its use in instruction-following agents.

Load-bearing premise

That a latent variable can capture distinct user intentions from mixed data well enough to increase model expressivity and support effective adaptation via generalized policy improvement.

What would settle it

Training an ablated version of the model without the latent intention variable and measuring whether returns and success rates on the same 40 benchmarks drop to the level of non-intention baselines.

Figures

Figures reproduced from arXiv: 2506.08902 by Benjamin Eysenbach, Chongyi Zheng, Seohong Park, Sergey Levine.

Figure 1
Figure 1. Figure 1: InFOM is a latent variable model for pre￾training and fine-tuning in reinforcement learning. (Left) The datasets are collected by users performing distinct tasks. (Center) We encode intentions by maximizing an evidence lower bound of data likelihood, (Right) enabling intention-aware future prediction using flow matching. See Sec. 4 for details. Many of the recent celebrated successes of machine learning ha… view at source ↗
Figure 2
Figure 2. Figure 2: Domains for evaluation. (Left) ExORL domains (16 state-based tasks). (Right) OGBench domains (20 state-based tasks and 4 image-based tasks). of fine-tuning instead of the best performance across all evaluation steps to prevent bias. Whenever possible, we use the same hyperparameters for all methods. See Appendix C.3 for details of the evaluation protocol and Appendix C.4 for implementations and hyperparame… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation on ExORL and OGBench tasks. We compare InFOM against prior methods that utilize various learning paradigms on task-agnostic pre-training and task-specific fine-tuning. InFOM performs similarly to, if not better than, prior methods on 7 out of the 9 domains, including the most challenging visual tasks. We report means and standard deviations over 8 random seeds (4 random seeds for image-based tas… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison to alternative policy extrac￾tion strategies. We compare InFOM to alternative policy extraction strategies based on the standard gen￾eralized policy improvement or one-step policy im￾provement. Our method is 44% more performant with 8× smaller variance than the variant using the stan￾dard GPI. See Sec. 5.3 for details. We compare InFOM to prior methods on two ExORL tasks (cheetah run and quadrup… view at source ↗
Figure 6
Figure 6. Figure 6: Convergence speed during fine-tuning. On tasks where InFOM and baselines perform similarly, our flow occupancy models enable faster policy learning. We compare different algorithms by plotting the returns at each evaluation step, with the shaded regions indicating one standard devia￾tion. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hyperparameter ablations. We conduct ablations to study the effect of key hyperparamters of InFOM as listed in [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Intention-Conditioned Flow Occupancy Models (InFOM) for large-scale RL pre-training. It constructs a flow-matching model of occupancy measures over future states and augments it with a latent variable representing user intention drawn from multi-task datasets. The latent variable is claimed to increase expressivity and to support downstream adaptation via generalized policy improvement. Experiments on 36 state-based and 4 image-based tasks report a 1.8× median improvement in returns and a 36% increase in success rate relative to alternative pre-training baselines.

Significance. If the attribution of gains to the intention conditioning holds, the work would offer a concrete probabilistic mechanism for leveraging heterogeneous pre-training data in RL, potentially improving sample efficiency and robustness. The combination of flow matching with latent intention modeling and generalized policy improvement constitutes a technically coherent direction that could be adopted by other occupancy-based or generative RL pre-training efforts.

major comments (2)
  1. [Experiments] Experiments section: the headline claim attributes the 1.8× median return improvement and +36% success-rate gain to the inclusion of the latent intention variable, yet no controlled ablation is presented that disables or removes this variable while retaining the identical flow-matching occupancy backbone, dataset, and adaptation procedure. Without this comparison it remains possible that the observed gains arise from the flow-matching formulation itself or from other implementation choices.
  2. [§3] §3 (Method): the claim that the latent intention variable 'increases the expressivity of our model, and enables adaptation with generalized policy improvement' is stated without a precise derivation showing how the conditioning affects the occupancy measure or the subsequent policy improvement operator; the current presentation leaves open whether the benefit is automatic or requires additional assumptions on the form of the generalized improvement step.
minor comments (2)
  1. [Abstract and Experiments] The abstract and experimental tables should explicitly list the exact baseline methods and report whether returns are normalized or raw, together with standard errors or statistical significance tests for the median improvement figures.
  2. [§3] Notation for the latent intention variable (denoted z or similar) should be introduced once in the method section and used consistently; occasional reuse of symbols common in standard RL (e.g., for state or action) risks confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comments point by point below. Where the comments identify gaps in the current manuscript, we have revised the paper accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim attributes the 1.8× median return improvement and +36% success-rate gain to the inclusion of the latent intention variable, yet no controlled ablation is presented that disables or removes this variable while retaining the identical flow-matching occupancy backbone, dataset, and adaptation procedure. Without this comparison it remains possible that the observed gains arise from the flow-matching formulation itself or from other implementation choices.

    Authors: We agree that a direct ablation isolating the contribution of the latent intention variable is important for substantiating the headline claim. In the revised manuscript we have added a controlled ablation (new Table 3 and accompanying text in Section 5) that trains an otherwise identical flow-matching occupancy model on the same multi-task dataset but without the intention latent variable. All other components—flow architecture, training procedure, dataset, and downstream generalized policy improvement—are held fixed. The ablation shows a clear performance drop (approximately 0.6× median return and 15% lower success rate) when the intention variable is removed, indicating that the reported gains are not solely attributable to the flow-matching backbone. revision: yes

  2. Referee: [§3] §3 (Method): the claim that the latent intention variable 'increases the expressivity of our model, and enables adaptation with generalized policy improvement' is stated without a precise derivation showing how the conditioning affects the occupancy measure or the subsequent policy improvement operator; the current presentation leaves open whether the benefit is automatic or requires additional assumptions on the form of the generalized improvement step.

    Authors: We thank the referee for this observation. In the revised §3 we now provide an explicit derivation. Let μ_π(·|z) denote the intention-conditioned occupancy measure produced by the flow-matching model. Conditioning on the latent z allows the model to represent a mixture of intention-specific occupancies present in the heterogeneous pre-training data, thereby strictly increasing the support of the learned distribution relative to an unconditioned flow. For adaptation, we derive that the generalized policy improvement operator applied to the family {μ_π(·|z)} selects, for an inferred downstream intention z*, the policy that maximizes the expected occupancy under the posterior p(z*|task). This step relies on the standard assumption that the downstream task intention lies in the support of the pre-training intention distribution; we now state this assumption explicitly and include the key equations (Eq. 4–6 in the revision). revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choices are independent

full rationale

The paper introduces InFOM as a new probabilistic construction that applies flow matching to occupancy measures and augments it with a latent intention variable drawn from the structure of large multi-user datasets. This latent variable is explicitly motivated as a modeling decision to increase expressivity and support generalized policy improvement, rather than being recovered from or defined in terms of the model's outputs. The central claims rest on empirical comparisons against other pre-training baselines across 40 benchmark tasks, with no equations or derivations shown that reduce a prediction to a fitted input by construction or that rely on self-citation chains for uniqueness. The derivation chain therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the standard assumptions of flow matching for generative modeling and the utility of a latent intention variable for multi-task datasets; no explicit free parameters or invented entities beyond the latent variable are detailed in the abstract.

invented entities (1)
  • latent intention variable no independent evidence
    purpose: to capture distinct user intentions in large datasets and increase model expressivity for adaptation
    Introduced in the abstract to handle datasets from many users performing distinct tasks.

pith-pipeline@v0.9.0 · 5811 in / 1125 out tokens · 44325 ms · 2026-05-19T10:20:02.910971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Al- tenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    B., Jaakkola, T

    Ajay, A., Du, Y ., Gupta, A., Tenenbaum, J. B., Jaakkola, T. S., and Agrawal, P. (2023). Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations

  3. [3]

    Ajay, A., Kumar, A., Agrawal, P., Levine, S., and Nachum, O. (2021). OPAL: Offline primitive discovery for accelerating offline reinforcement learning. In International Conference on Learning Representations

  4. [4]

    Albergo, M. S. and Vanden-Eijnden, E. (2023). Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations

  5. [5]

    A., Fischer, I., Dillon, J

    Alemi, A. A., Fischer, I., Dillon, J. V ., and Murphy, K. (2017). Deep variational information bottleneck. In International Conference on Learning Representations

  6. [6]

    J., Pearce, T., and Fleuret, F

    Alonso, E., Jelley, A., Micheli, V ., Kanervisto, A., Storkey, A. J., Pearce, T., and Fleuret, F. (2024). Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37:58757–58791

  7. [7]

    and Agakov, F

    Barber, D. and Agakov, F. (2004). The im algorithm: a variational approach to information maximization. Advances in neural information processing systems, 16(320):201

  8. [8]

    Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A., and Munos, R. (2018). Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, pages 501–510. PMLR

  9. [9]

    J., Schaul, T., van Hasselt, H

    Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. (2017). Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30

  10. [10]

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164

  11. [11]

    Blier, L., Tallec, C., and Ollivier, Y . (2021). Learning successor states and goal-dependent values: A mathematical viewpoint. arXiv preprint arXiv:2101.07123

  12. [12]

    Borsa, D., Barreto, A., Quan, J., Mankowitz, D., Munos, R., Van Hasselt, H., Silver, D., and Schaul, T. (2018). Universal successor features approximators. arXiv preprint arXiv:1812.07626

  13. [13]

    J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q

    Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. (2018). JAX: composable transformations of Python+NumPy programs

  14. [14]

    Brandfonbrener, D., Whitney, W., Ranganath, R., and Bruna, J. (2021). Offline rl without off-policy evaluation. Advances in neural information processing systems, 34:4933–4946. 11

  15. [15]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901

  16. [16]

    Burda, Y ., Edwards, H., Storkey, A., and Klimov, O. (2019). Exploration by random network distillation. In International Conference on Learning Representations

  17. [17]

    Campbell, A., Yim, J., Barzilay, R., Rainforth, T., and Jaakkola, T. (2024). Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. In International Conference on Machine Learning, pages 5453–5512. PMLR

  18. [18]

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660

  19. [19]

    Chen, B., Zhu, C., Agrawal, P., Zhang, K., and Gupta, A. (2023). Self-supervised reinforcement learning that transfers using random features. Advances in Neural Information Processing Systems, 36:56411–56436

  20. [20]

    Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. (2021). Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097

  21. [21]

    Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural computation, 5(4):613–624

  22. [22]

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186

  23. [23]

    Ding, Z., Zhang, A., Tian, Y ., and Zheng, Q. (2024). Diffusion world model. arXiv e-prints, pages arXiv–2402

  24. [24]

    Durrett, R. (2019). Probability: theory and examples, volume 49. Cambridge university press

  25. [25]

    Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V ., Ward, T., Doron, Y ., Firoiu, V ., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning , pages 1407–1416. PMLR

  26. [26]

    Eysenbach, B., Gupta, A., Ibarz, J., and Levine, S. (2019). Diversity is all you need: Learning skills without a reward function. In International Conference on Learning Representations

  27. [27]

    Eysenbach, B., Salakhutdinov, R., and Levine, S. (2020). C-learning: Learning to achieve goals via recursive classification. arXiv preprint arXiv:2011.08909

  28. [28]

    Eysenbach, B., Zhang, T., Levine, S., and Salakhutdinov, R. R. (2022). Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35:35603–35620

  29. [29]

    Farebrother, J., Pirotta, M., Tirinzoni, A., Munos, R., Lazaric, A., and Touati, A. (2025). Temporal difference flows. In ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling

  30. [30]

    Frans, K., Hafner, D., Levine, S., and Abbeel, P. (2025). One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations

  31. [31]

    Frans, K., Park, S., Abbeel, P., and Levine, S. (2024). Unsupervised zero-shot reinforcement learning via functional reward encodings. In International Conference on Machine Learning , pages 13927–13942. PMLR

  32. [32]

    and Gu, S

    Fujimoto, S. and Gu, S. S. (2021). A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145. 12

  33. [33]

    Fujimoto, S., Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR

  34. [34]

    A., and Levine, S

    Ghosh, D., Bhateja, C. A., and Levine, S. (2023). Reinforcement learning from passive data via latent intentions. In International Conference on Machine Learning, pages 11321–11339. PMLR

  35. [35]

    Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284

  36. [36]

    Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr

  37. [37]

    Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104

  38. [38]

    Hansen, S., Dabney, W., Barreto, A., Van de Wiele, T., Warde-Farley, D., and Mnih, V . (2019). Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030

  39. [39]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. (2023). Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573

  40. [40]

    Hausman, K., Chebotar, Y ., Schaal, S., Sukhatme, G., and Lim, J. J. (2017). Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. Advances in neural information processing systems, 30

  41. [41]

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., and Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009

  42. [42]

    He, K., Fan, H., Wu, Y ., Xie, S., and Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738

  43. [43]

    Henderson, P., Chang, W.-D., Bacon, P.-L., Meger, D., Pineau, J., and Precup, D. (2017). Op- tiongan: Learning joint reward-policy options using generative adversarial inverse reinforcement learning. ArXiv, abs/1709.06683

  44. [44]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

  45. [45]

    Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-V AE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations

  46. [46]

    Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851

  47. [47]

    Hu, H., Yang, Y ., Ye, J., Mai, Z., and Zhang, C. (2023). Unsupervised behavior extraction via random intent priors. Advances in Neural Information Processing Systems, 36:51491–51514

  48. [48]

    K., Lehnert, L., Rish, I., and Berseth, G

    Jain, A. K., Lehnert, L., Rish, I., and Berseth, G. (2023). Maximum state entropy exploration using predecessor and successor representations. Advances in Neural Information Processing Systems, 36:49991–50019

  49. [49]

    Janner, M., Du, Y ., Tenenbaum, J., and Levine, S. (2022). Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR

  50. [50]

    Janner, M., Fu, J., Zhang, M., and Levine, S. (2019). When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32

  51. [51]

    Janner, M., Li, Q., and Levine, S. (2021). Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286. 13

  52. [52]

    Janner, M., Mordatch, I., and Levine, S. (2020). Gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in neural information processing systems , 33:1724–1735

  53. [53]

    Jeen, S., Bewley, T., and Cullen, J. (2024). Zero-shot reinforcement learning from low quality data. Advances in Neural Information Processing Systems, 37:16894–16942

  54. [54]

    Kim, J., Park, S., and Kim, G. (2022). Constrained gpi for zero-shot transfer in reinforcement learning. Advances in Neural Information Processing Systems, 35:4585–4597

  55. [55]

    Kim, J., Park, S., and Levine, S. (2024). Unsupervised-to-online reinforcement learning. arXiv preprint arXiv:2408.14785

  56. [56]

    Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  57. [57]

    Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  58. [58]

    Kostrikov, I., Nair, A., and Levine, S. (2022). Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations

  59. [59]

    Kumar, A., Zhou, A., Tucker, G., and Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in neural information processing systems, 33:1179–1191

  60. [60]

    Lambert, N., Pister, K., and Calandra, R. (2022). Investigating compounding prediction errors in learned dynamics models. arXiv preprint arXiv:2203.09637

  61. [61]

    Laskin, M., Srinivas, A., and Abbeel, P. (2020). Curl: Contrastive unsupervised representations for reinforcement learning. In International conference on machine learning, pages 5639–5650. PMLR

  62. [62]

    Li, Y ., Song, J., and Ermon, S. (2017). Infogail: Interpretable imitation learning from visual demonstrations. Advances in neural information processing systems, 30

  63. [63]

    Lipman, Y ., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations

  64. [64]

    Flow Matching Guide and Code

    Lipman, Y ., Havasi, M., Holderrieth, P., Shaul, N., Le, M., Karrer, B., Chen, R. T., Lopez- Paz, D., Ben-Hamu, H., and Gat, I. (2024). Flow matching guide and code. arXiv preprint arXiv:2412.06264

  65. [65]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Liu, X., Gong, C., and qiang liu (2023). Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations

  66. [66]

    Lu, J., Batra, D., Parikh, D., and Lee, S. (2019). Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32

  67. [67]

    J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A

    Ma, Y . J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A. (2023). VIP: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations

  68. [68]

    Eigenoption Discovery through the Deep Successor Representation

    Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2017). Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089

  69. [69]

    Margossian, C. C. and Blei, D. M. (2024). Amortized variational inference: When and why? In Uncertainty in Artificial Intelligence, pages 2434–2449. PMLR

  70. [70]

    Mazoure, B., Eysenbach, B., Nachum, O., and Tompson, J. (2023). Contrastive value learning: Implicit models for simple offline RL. In 7th Annual Conference on Robot Learning

  71. [71]

    Mazzaglia, P., Verbelen, T., Dhoedt, B., Lacoste, A., and Rajeswar, S. (2022). Choreographer: Learning and adapting skills in imagination. arXiv preprint arXiv:2211.13350. 14

  72. [72]

    Mendonca, R., Rybkin, O., Daniilidis, K., Hafner, D., and Pathak, D. (2021). Discovering and achieving goals via world models. Advances in Neural Information Processing Systems , 34:24379–24391

  73. [73]

    A., Veness, J., Bellemare, M

    Mnih, V ., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. nature, 518(7540):529–533

  74. [74]

    Myers, V ., Zheng, C., Dragan, A., Levine, S., and Eysenbach, B. (2024). Learning temporal distances: Contrastive successor features can provide a metric structure for decision-making. In International Conference on Machine Learning, pages 37076–37096. PMLR

  75. [75]

    Nair, S., Rajeswaran, A., Kumar, V ., Finn, C., and Gupta, A. (2023). R3m: A universal visual representation for robot manipulation. In Conference on Robot Learning, pages 892–909. PMLR

  76. [76]

    and Parr, R

    Nemecek, M. and Parr, R. (2021). Policy caches with successor features. In International Conference on Machine Learning, pages 8025–8033. PMLR

  77. [77]

    Ni, T., Eysenbach, B., Seyedsalehi, E., Ma, M., Gehring, C., Mahajan, A., and Bacon, P.-L. (2024). Bridging state and history representations: Understanding self-predictive rl. arXiv preprint arXiv:2401.08898

  78. [78]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744

  79. [79]

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., et al. (2024). Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE

  80. [80]

    Parisi, S., Rajeswaran, A., Purushwalkam, S., and Gupta, A. (2022). The unsurprising effective- ness of pre-trained vision models for control. In international conference on machine learning, pages 17359–17371. PMLR

Showing first 80 references.