pith. machine review for the scientific record. sign in

arxiv: 2605.06145 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Recognition: unknown

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY
keywords goal-conditioned reinforcement learningmutual information skill learningcontrol maximizationtrajectory sensitivityunsupervised pretrainingreinforcement learning theorybehavioral diversity
0
0 comments X

The pith

GCRL and MISL are unified through control maximization, where each goal-reaching formulation matches a skill-learning objective that benefits from greater diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a theoretical link between goal-conditioned reinforcement learning and unsupervised mutual information skill learning by framing both as forms of control maximization. It first distinguishes three inequivalent GCRL formulations that measure goal-reaching success differently and can lead to different optimal policies. Then it shows that MISL objectives serve as bounds on the goal-sensitivity measures for these formulations. This correspondence means that for any chosen GCRL task, there is an aligned MISL method where increasing skill diversity directly improves the ability to reach goals. The result explains the empirical success of pretraining and guides which unsupervised method to use for particular downstream problems.

Core claim

We unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks.

What carries the argument

Control maximization, which frames goal-reaching and skill discovery as maximizing the sensitivity of future trajectories to a command (goal or skill identifier).

If this is right

  • The three canonical GCRL formulations can induce incompatible optimal policies even in the same environment.
  • For every GCRL formulation there exists a matching MISL objective.
  • More diverse skills afford greater downstream goal sensitivity under the matching MISL objective.
  • Pretraining objectives should be selected to align with the sensitivity definition of the intended downstream GCRL tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The correspondence could guide selection of pretraining methods in applied settings such as robotics based on expected goal types.
  • Hybrid objectives might be designed to cover multiple sensitivity measures for robustness across task classes.
  • The bounding technique may extend to other unsupervised RL methods that optimize behavioral diversity.

Load-bearing premise

The sensitivity measures defined for MISL are directly comparable to those in each GCRL formulation in a way that permits the stated bounds to hold across general MDPs, without additional restrictions on the reward functions or policy classes.

What would settle it

A counterexample MDP where, for a given GCRL formulation, increasing diversity under the matching MISL objective does not increase the corresponding goal-sensitivity or violates the claimed bound.

Figures

Figures reproduced from arXiv: 2605.06145 by Alireza Modirshanechi, Benjamin Eysenbach, Eric Schulz, Peter Dayan.

Figure 1
Figure 1. Figure 1: We unify GCRL and MISL as control-maximization problems and prove a correspondence view at source ↗
Figure 2
Figure 2. Figure 2: Different GCRL formulations yield incompatible optimal policies ( view at source ↗
Figure 3
Figure 3. Figure 3: Equivalence conditions. Black edges indicate identical policy orderings; blue edges indicate view at source ↗
Figure 4
Figure 4. Figure 4: Goal sensitivity C(s, π{.,.}) reflects both objective controllability C ∗ (s) and agent compe￾tence. In uncontrollable environments, it is zero (A); in fully controllable environments, it depends on whether the policy ignores goals (B) or reliably selects goal-reaching actions (C). for every s, g, g ′ ∈ S (see Appendix F for attainability of this condition). To measure the degree of consistency, we can com… view at source ↗
Figure 5
Figure 5. Figure 5: The precise correspondence of the MISL objectives to the downstream GCRL performance; view at source ↗
Figure 6
Figure 6. Figure 6: Theoretical bounds linking goal-sensitivity to empowerment and goal-behavior MIs. view at source ↗
Figure 7
Figure 7. Figure 7: Counterexample environment (A) showing that different GCRL formulations can induce view at source ↗
Figure 8
Figure 8. Figure 8: Counterexample showing that maximizing goal sensitivity view at source ↗
read the original abstract

Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper unifies goal-conditioned RL (GCRL) and mutual information skill learning (MISL) by framing both as instances of control maximization. It identifies three canonical GCRL formulations, proves they are inequivalent (inducing incompatible optimal policies in the same environment), and reinterprets each via a distinct notion of goal-sensitivity of future trajectories. It then shows that MISL objectives are bounded above by formulation-specific downstream goal-sensitivities, establishing a precise matching correspondence: for each GCRL formulation there exists an MISL objective such that greater skill diversity yields greater goal sensitivity.

Significance. If the sensitivity bounds hold with the stated generality, the work supplies a missing theoretical account for why unsupervised MISL pretraining aids downstream GCRL and supplies a principled way to select pretraining objectives for a given class of goal-reaching tasks. The inequivalence result among GCRL formulations is itself a useful clarification. The control-maximization perspective and the explicit bounds constitute a substantive contribution beyond reinterpretation.

major comments (1)
  1. [Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.
minor comments (2)
  1. [Introduction / Section 3] The three GCRL formulations are introduced without an explicit table or side-by-side comparison of their objective functions and optimal-policy characterizations; adding such a table would make the inequivalence claim easier to verify at a glance.
  2. [Preliminaries] Notation for the various sensitivity measures (goal-sensitivity vs. skill-sensitivity) is introduced piecemeal; a single consolidated definition table would reduce ambiguity when the bounds are stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary, for recognizing the value of the inequivalence result and the control-maximization framing, and for the constructive request to clarify the sensitivity bounds. We address the single major comment below and will revise the manuscript to improve transparency.

read point-by-point responses
  1. Referee: [Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.

    Authors: We appreciate the referee drawing attention to the need for explicit assumptions. The derivations appear in Appendix B of the full manuscript (not included in the excerpt). They are stated for finite MDPs with bounded rewards and hold for stochastic policies; no further restrictions on reward functions or dynamics are imposed beyond these. The abstract and main text do not claim the bounds for arbitrary continuous or infinite MDPs. In the revision we will (i) move the complete proof steps into the main body of the sensitivity-bounds section, (ii) open the section with an explicit list of assumptions, and (iii) add a clarifying remark that extensions to continuous settings require additional regularity conditions and are left for future work. This change will make the correspondence fully reproducible from the main text. revision: yes

Circularity Check

0 steps flagged

Reinterpretation of MISL objectives as skill-sensitivity bounds on GCRL goal-sensitivity without definitional reduction or fitted predictions

full rationale

The paper derives a correspondence by defining goal-sensitivity for each of three inequivalent GCRL formulations and showing that MISL objectives are bounded by matching sensitivity measures. This establishes that more diverse skills improve downstream sensitivity for the corresponding formulation. The bounds follow from the paper's stated definitions of sensitivity in general MDPs rather than from any parameter fit, self-referential definition, or load-bearing self-citation. The central unification is therefore a reinterpretation of existing objectives through the sensitivity lens and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard MDP definitions and the new lens of control maximization; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Standard Markov Decision Process assumptions (states, actions, transition probabilities, reward functions, and policies).
    These underpin all definitions of trajectories, goal-conditioned policies, and sensitivity measures used in both GCRL and MISL.

pith-pipeline@v0.9.0 · 5620 in / 1321 out tokens · 62728 ms · 2026-05-08T13:49:01.234178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, pp. 529–533, 2015

  2. [2]

    Mastering the game of go with deep neural networks and tree search,

    D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016

  3. [3]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020

  4. [4]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

  5. [5]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, pp. 8748–8763, PMLR, 2021

  6. [6]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022

  7. [7]

    Learning to achieve goals,

    L. P. Kaelbling, “Learning to achieve goals,” inIJCAI, 1993

  8. [8]

    Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,

    R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” inThe 10th international conference on autonomous agents and multiagent systems-volume 2, pp. 761–768, 2011

  9. [9]

    Universal value function approximators,

    T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” inInternational Conference on Machine Learning, pp. 1312–1320, PMLR, 2015

  10. [10]

    Many-goals reinforcement learning,

    V . Veeriah, J. Oh, and S. Singh, “Many-goals reinforcement learning,”arXiv preprint arXiv:1806.09605, 2018

  11. [11]

    Automatic goal generation for reinforcement learning agents,

    C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” inInternational Conference on Machine Learning, pp. 1515–1528, PMLR, 2018

  12. [12]

    Dual goal representations, 2025

    S. Park, D. Mann, and S. Levine, “Dual goal representations,”arXiv preprint arXiv:2510.06714, 2025

  13. [13]

    Accelerating goal-conditioned rl algorithms and research.arXiv preprint arXiv:2408.11052,

    M. Bortkiewicz, W. Pałucki, V . Myers, T. Dziarmaga, T. Arczewski, Ł. Kuci ´nski, and B. Eysenbach, “Accelerating goal-conditioned RL algorithms and research,”arXiv preprint arXiv:2408.11052, 2024

  14. [14]

    Optimal goal-reaching reinforcement learning via quasimetric learning,

    T. Wang, A. Torralba, P. Isola, and A. Zhang, “Optimal goal-reaching reinforcement learning via quasimetric learning,” inInternational Conference on Machine Learning, pp. 36411–36430, PMLR, 2023

  15. [15]

    C-learning: Learning to achieve goals via recursive classification,

    B. Eysenbach, R. Salakhutdinov, and S. Levine, “C-learning: Learning to achieve goals via recursive classification,” inInternational Conference on Learning Representations, 2021

  16. [16]

    Contrastive learning as goal-conditioned reinforcement learning,

    B. Eysenbach, T. Zhang, S. Levine, and R. Salakhutdinov, “Contrastive learning as goal-conditioned reinforcement learning,” inAdvances in Neural Information Processing Systems(A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022

  17. [17]

    Discovering and achieving goals via world models,

    R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak, “Discovering and achieving goals via world models,” inAdvances in Neural Information Processing Systems(M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 24379–24391, Curran Associates, Inc., 2021

  18. [18]

    Offline goal-conditioned reinforcement learning via f-advantage regression,

    J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f-advantage regression,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

  19. [19]

    METRA: Scalable unsupervised RL with metric-aware abstraction,

    S. Park, O. Rybkin, and S. Levine, “METRA: Scalable unsupervised RL with metric-aware abstraction,” arXiv preprint arXiv:2310.08887, 2023

  20. [20]

    Gregor, D

    K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,”arXiv preprint arXiv:1611.07507, 2016

  21. [21]

    arXiv preprint arXiv:1807.10299 , year=

    J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,”arXiv preprint arXiv:1807.10299, 2018. 10

  22. [22]

    Diversity is all you need: Learning skills without a reward function,

    B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” inInternational Conference on Learning Representations, 2019

  23. [23]

    arXiv preprint arXiv:1907.01657 , year=

    A. Sharma, S. Gu, S. Levine, V . Kumar, and K. Hausman, “Dynamics-aware unsupervised discovery of skills,”arXiv preprint arXiv:1907.01657, 2019

  24. [24]

    Can a MISL fly? analysis and ingredients for mutual information skill learning,

    C. Zheng, J. Tuyls, J. Peng, and B. Eysenbach, “Can a MISL fly? analysis and ingredients for mutual information skill learning,”arXiv preprint arXiv:2412.08021, 2024

  25. [25]

    Lipschitz-constrained unsupervised skill discovery,

    S. Park, J. Choi, J. Kim, H. Lee, and G. Kim, “Lipschitz-constrained unsupervised skill discovery,” in International Conference on Learning Representations, 2022

  26. [26]

    Hierarchical empowerment: Towards tractable empowerment-based skill learning,

    A. Levy, S. Rammohan, A. Allievi, S. Niekum, and G. Konidaris, “Hierarchical empowerment: Towards tractable empowerment-based skill learning,”arXiv preprint arXiv:2307.02728, 2023

  27. [27]

    The information geometry of unsupervised reinforcement learning,

    B. Eysenbach, R. Salakhutdinov, and S. Levine, “The information geometry of unsupervised reinforcement learning,” inInternational Conference on Learning Representations, 2022

  28. [28]

    CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery.arXiv preprint arXiv:2202.00161, 2022

    M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “CIC: Contrastive intrinsic control for unsupervised skill discovery,”arXiv preprint arXiv:2202.00161, 2022

  29. [29]

    Learning to reach goals via iterated supervised learning,

    D. Ghosh, A. Gupta, A. Reddy, J. Fu, C. Devin, B. Eysenbach, and S. Levine, “Learning to reach goals via iterated supervised learning,” inInternational Conference on Learning Representations, 2019

  30. [30]

    arXiv preprint arXiv:1903.03698 , year=

    V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine, “Skew-fit: State-covering self-supervised reinforcement learning,”arXiv preprint arXiv:1903.03698, 2019

  31. [31]

    Unsupervised control through non-parametric discriminative rewards,

    D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V . Mnih, “Unsupervised control through non-parametric discriminative rewards,”arXiv preprint arXiv:1811.11359, 2018

  32. [32]

    f-policy gradients: A general framework for goal- conditioned RL using f-divergences,

    S. Agarwal, I. Durugkar, P. Stone, and A. Zhang, “f-policy gradients: A general framework for goal- conditioned RL using f-divergences,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  33. [33]

    An integrative framework for the human sense of control,

    A. Modirshanechi, P. Dayan, and E. Schulz, “An integrative framework for the human sense of control,” PsyArXiv, 2025

  34. [34]

    Hindsight experience replay,

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” inAdvances in Neural Information Processing Systems(I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017

  35. [35]

    Hamilton-Jacobi reachability: A brief overview and recent advances,

    S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-Jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242–2253, IEEE, 2017

  36. [36]

    Reachability analysis for black-box dynamical systems,

    V . K. Chilakamarri, Z. Feng, and S. Bansal, “Reachability analysis for black-box dynamical systems,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 3552–3558, IEEE, 2025

  37. [37]

    Deepreach: A deep learning approach to high-dimensional reachability,

    S. Bansal and C. J. Tomlin, “Deepreach: A deep learning approach to high-dimensional reachability,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 1817–1824, IEEE, 2021

  38. [38]

    Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,

    A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,”Automatica, vol. 44, no. 11, pp. 2724–2734, 2008

  39. [39]

    Model-free stochastic reachability using kernel distribution embeddings,

    A. J. Thorpe and M. M. Oishi, “Model-free stochastic reachability using kernel distribution embeddings,” IEEE Control Systems Letters, vol. 4, no. 2, pp. 512–517, 2019

  40. [40]

    Approximate stochastic reachability for high dimensional systems,

    A. J. Thorpe, V . Sivaramakrishnan, and M. M. Oishi, “Approximate stochastic reachability for high dimensional systems,” in2021 American Control Conference (ACC), pp. 1287–1293, IEEE, 2021

  41. [41]

    E. D. Sontag,Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer New York, NY , 2013

  42. [42]

    Ogata,Modern Control Engineering

    K. Ogata,Modern Control Engineering. Prentice Hall, 5th ed., 2010

  43. [43]

    Empowerment: a universal agent-centric measure of control,

    A. Klyubin, D. Polani, and C. Nehaniv, “Empowerment: a universal agent-centric measure of control,” in 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 128–135 V ol.1, 2005

  44. [44]

    Empowerment–an introduction,

    C. Salge, C. Glackin, and D. Polani, “Empowerment–an introduction,”Guided Self-Organization: Inception, pp. 67–114, 2014

  45. [45]

    Empowerment for continuous agent—environment systems,

    T. Jung, D. Polani, and P. Stone, “Empowerment for continuous agent—environment systems,”Adaptive Behavior, vol. 19, no. 1, pp. 16–39, 2011

  46. [46]

    A unified Bellman optimality principle combining reward maximization and empowerment,

    F. Leibfried, S. Pascual-Díaz, and J. Grau-Moya, “A unified Bellman optimality principle combining reward maximization and empowerment,” inAdvances in Neural Information Processing Systems(H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. 11

  47. [47]

    Information prioritization through empowerment in visual model-based RL,

    H. Bharadhwaj, M. Babaeizadeh, D. Erhan, and S. Levine, “Information prioritization through empowerment in visual model-based RL,” inInternational Conference on Learning Representations, 2022

  48. [48]

    Exploration via empowerment gain: Combining novelty, surprise and learning progress,

    P. Becker-Ehmck, M. Karl, J. Peters, and P. van der Smagt, “Exploration via empowerment gain: Combining novelty, surprise and learning progress,” inICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021

  49. [49]

    Merits of curiosity: A simulation study,

    L. Gruaz, A. Modirshanechi, S. Becker, and J. Brea, “Merits of curiosity: A simulation study,”Open Mind, vol. 9, pp. 1037–1065, 2025

  50. [50]

    Towards empowerment gain through causal structure learning in model-based reinforcement learning,

    H. Cao, F. Feng, M. Fang, S. Dong, T. Yang, J. Huo, and Y . Gao, “Towards empowerment gain through causal structure learning in model-based reinforcement learning,” inThe Thirteenth International Conference on Learning Representations, 2025

  51. [51]

    Large-scale study of curiosity- driven learning,

    Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity- driven learning,” inInternational Conference on Learning Representations, 2019

  52. [52]

    Curiosity-driven exploration by self-supervised prediction,

    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 2778–2787, JMLR.org, 2017

  53. [53]

    Capdepuy,Informational principles of perception-action loops and collective behaviours

    P. Capdepuy,Informational principles of perception-action loops and collective behaviours. PhD thesis, University of Hertfordshire, 2011

  54. [54]

    Learning to assist humans without inferring rewards,

    V . Myers, E. Ellis, S. Levine, B. Eysenbach, and A. Dragan, “Learning to assist humans without inferring rewards,” inAdvances in Neural Information Processing Systems, 2024

  55. [55]

    Advances in Neural Information Processing Systems , year =

    D. Abel, M. Bowling, A. Barreto, W. Dabney, S. Dong, S. Hansen, A. Harutyunyan, K. Khetarpal, C. Lyle, R. Pascanu,et al., “Plasticity as the mirror of empowerment,”arXiv preprint arXiv:2505.10361, 2025

  56. [56]

    What can learned intrinsic rewards capture?,

    Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. Van Hasselt, D. Silver, and S. Singh, “What can learned intrinsic rewards capture?,” inProceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 11436–11446, PMLR, 2020

  57. [57]

    Variational empowerment as representation learning for goal-conditioned reinforcement learning,

    J. Choi, A. Sharma, H. Lee, S. Levine, and S. S. Gu, “Variational empowerment as representation learning for goal-conditioned reinforcement learning,” inInternational Conference on Machine Learning, pp. 1953– 1963, PMLR, 2021

  58. [58]

    Variational information maximisation for intrinsically motivated reinforcement learning,

    S. Mohamed and D. Jimenez Rezende, “Variational information maximisation for intrinsically motivated reinforcement learning,” inAdvances in Neural Information Processing Systems(C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015

  59. [59]

    On variational bounds of mutual information,

    B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inInternational Conference on Machine Learning, pp. 5171–5180, PMLR, 2019

  60. [60]

    Skill learning via policy diversity yields identifiable representations for reinforcement learning,

    P. Reizinger, B. Mucsányi, S. Guo, B. Eysenbach, B. Schölkopf, and W. Brendel, “Skill learning via policy diversity yields identifiable representations for reinforcement learning,”arXiv preprint arXiv:2507.14748, 2025

  61. [61]

    M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994

  62. [62]

    Improving generalization for temporal difference learning: The successor representation,

    P. Dayan, “Improving generalization for temporal difference learning: The successor representation,”Neural Computation, vol. 5, no. 4, pp. 613–624, 1993

  63. [63]

    Temporal difference models: Model-free deep rl for model-based control,

    V . Pong, S. Gu, M. Dalal, and S. Levine, “Temporal difference models: Model-free deep rl for model-based control,”arXiv preprint arXiv:1802.09081, 2018

  64. [64]

    Visual reinforcement learning with imagined goals,

    A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual reinforcement learning with imagined goals,” inAdvances in Neural Information Processing Systems, vol. 31, 2018

  65. [65]

    Achieving target state-action frequencies in multichain average-reward markov decision processes,

    D. Krass and O. J. Vrieze, “Achieving target state-action frequencies in multichain average-reward markov decision processes,”Mathematics of Operations Research, vol. 27, no. 3, pp. 545–566, 2002

  66. [66]

    Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,

    F. Dufour and T. Prieto-Rumeau, “Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,”Journal of Mathematical Analysis and Applications, vol. 505, no. 2, p. 125639, 2022

  67. [67]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inProceedings of the Conference on Robot Learning(L. P. Kaelbling, D. Kragic, and K. Sugiura, eds.), vol. 100 ofProceedings of Machine Learning Research, pp. 1094–1100, PMLR, 2020

  68. [68]

    T. M. Cover,Elements of Information Theory. John Wiley & Sons, 1999

  69. [69]

    The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,

    O. Rioul, “The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,”Entropy, vol. 25, no. 7, p. 978, 2023. 12

  70. [70]

    Uncertainty and the probability of error (corresp.),

    D. Tebbe and S. Dwyer, “Uncertainty and the probability of error (corresp.),”IEEE Transactions on Information theory, vol. 14, no. 3, pp. 516–518, 1968

  71. [71]

    Entropy bounds for discrete random variables via maximal coupling,

    I. Sason, “Entropy bounds for discrete random variables via maximal coupling,”IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7118–7131, 2013. 13 Contents of the appendices A Precise statements of the equivalences in Section 4 15 A.1 Equivalence of formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Equivalence of o...

  72. [72]

    6 and Eq

    For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = 1 Ns X g J(s, g, π{g′,.}) | {z } =1 = 1 Ns . (57) 20 Optimal policy for state goal - - - - - - - - - - A B 0.9 0.8 A non-optimal policy with higher state goal - - - - - - - - - - C Figure 8: Counterexample showing that maximizing goal sensitivity C ...

  73. [73]

    56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.})

    J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.}). (59) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, 1 Ns X g′∈S J(s, π{g′,.}) = 1 N2s X g,g ′∈S J(s, g, π{g′,.}) ≥ 1 N2s X g∈S J(s, g, π{g,.})...

  74. [74]

    8A, K= 2 , and γ= 1

    The maximally in-control policy for OW(K, γ):Consider the environment p(.|., .) in Fig. 8A, K= 2 , and γ= 1 . The goal-conditioned policy in Fig. 8B is optimal with respect to JOW, but the suboptimal policy in Fig. 8C has a higher goal sensitivity. Hence, a maximally in-control policy is not necessarily optimal forOW(K, γ). For the bound, letπ C∗ {.,.} be...

  75. [75]

    (78) Since, using Eq

    The case of OW(K, γ):For each g∈ S , let P π{g,.} F denote the distribution of FK,γ under the policyπ {g,.}, and let P ¯π{−,.} F := 1 Ns X g∈S P π{g,.} F . (78) Since, using Eq. 8, JOW(s, g, π{g′,.}, K, γ) =E π{g′ ,.} F K,γ g |S 0 =s , (79) which, together with the definition ofC OW in Eq. 10, implies COW(s, π{.,.}, K, γ) = 1 N2s X g,g ′∈S EP π{g,.} F [F ...

  76. [76]

    (125) Equalities holds iffp goal = Unif(S)

    ForPe(γ)andET(K), we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}) (124) and, as a result, 0≤J ∗(s)−J(s, π C∗ {.,.})≤p max goal −p min goal. (125) Equalities holds iffp goal = Unif(S)

  77. [77]

    (126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)

    ForOW(K, γ)and any formulation with non-negative rewards (i.e.,R t(s;g)≥0), we have J(s, π{.,.})≥ 1 1−p min goal C(s, π{.,.}). (126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)

  78. [78]

    30 Proof:As a direct consequence of Eq

    ForOW(K, γ), 0≤J ∗(s)−J(s, π C∗ {.,.})≤1− 1 1−p min goal C∗(s) (127) so largerC ∗(s)andp min goal yield a tighter bound. 30 Proof:As a direct consequence of Eq. 122, we have J(s, π{.,.})−J(s, π {g′,.}) = X g∈S pgoal(g) J(s, g, π{g,.})−J(s, g, π {g′,.}) (128) for an arbitrary goalg ′ ∈ S. By averaging Eq. 55 overg ′, we have J(s, π{.,.}) = X g′∈S pgoal(g′)...

  79. [79]

    6 and Eq

    For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = X g pgoal(g)J(s, g, π {g′,.})∈[p min goal, pmax goal], (130) wherep min goal := ming pgoal(g)andp max goal := maxg pgoal(g). As a result, for Eq. 129, we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}). (131) Therefore, for the...

  80. [80]

    129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0

    J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, X g′∈S pgoal(g′)J(s, π{g′,.}) = X g,g ′∈S pgoal(g′)pgoal(g)J(s, g, π {g′,.}) ≥ X g∈S ...

Showing first 80 references.