Recognition: unknown
Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization
Pith reviewed 2026-05-08 13:49 UTC · model grok-4.3
The pith
GCRL and MISL are unified through control maximization, where each goal-reaching formulation matches a skill-learning objective that benefits from greater diversity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks.
What carries the argument
Control maximization, which frames goal-reaching and skill discovery as maximizing the sensitivity of future trajectories to a command (goal or skill identifier).
If this is right
- The three canonical GCRL formulations can induce incompatible optimal policies even in the same environment.
- For every GCRL formulation there exists a matching MISL objective.
- More diverse skills afford greater downstream goal sensitivity under the matching MISL objective.
- Pretraining objectives should be selected to align with the sensitivity definition of the intended downstream GCRL tasks.
Where Pith is reading between the lines
- The correspondence could guide selection of pretraining methods in applied settings such as robotics based on expected goal types.
- Hybrid objectives might be designed to cover multiple sensitivity measures for robustness across task classes.
- The bounding technique may extend to other unsupervised RL methods that optimize behavioral diversity.
Load-bearing premise
The sensitivity measures defined for MISL are directly comparable to those in each GCRL formulation in a way that permits the stated bounds to hold across general MDPs, without additional restrictions on the reward functions or policy classes.
What would settle it
A counterexample MDP where, for a given GCRL formulation, increasing diversity under the matching MISL objective does not increase the corresponding goal-sensitivity or violates the claimed bound.
Figures
read the original abstract
Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper unifies goal-conditioned RL (GCRL) and mutual information skill learning (MISL) by framing both as instances of control maximization. It identifies three canonical GCRL formulations, proves they are inequivalent (inducing incompatible optimal policies in the same environment), and reinterprets each via a distinct notion of goal-sensitivity of future trajectories. It then shows that MISL objectives are bounded above by formulation-specific downstream goal-sensitivities, establishing a precise matching correspondence: for each GCRL formulation there exists an MISL objective such that greater skill diversity yields greater goal sensitivity.
Significance. If the sensitivity bounds hold with the stated generality, the work supplies a missing theoretical account for why unsupervised MISL pretraining aids downstream GCRL and supplies a principled way to select pretraining objectives for a given class of goal-reaching tasks. The inequivalence result among GCRL formulations is itself a useful clarification. The control-maximization perspective and the explicit bounds constitute a substantive contribution beyond reinterpretation.
major comments (1)
- [Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.
minor comments (2)
- [Introduction / Section 3] The three GCRL formulations are introduced without an explicit table or side-by-side comparison of their objective functions and optimal-policy characterizations; adding such a table would make the inequivalence claim easier to verify at a glance.
- [Preliminaries] Notation for the various sensitivity measures (goal-sensitivity vs. skill-sensitivity) is introduced piecemeal; a single consolidated definition table would reduce ambiguity when the bounds are stated.
Simulated Author's Rebuttal
We thank the referee for their positive summary, for recognizing the value of the inequivalence result and the control-maximization framing, and for the constructive request to clarify the sensitivity bounds. We address the single major comment below and will revise the manuscript to improve transparency.
read point-by-point responses
-
Referee: [Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.
Authors: We appreciate the referee drawing attention to the need for explicit assumptions. The derivations appear in Appendix B of the full manuscript (not included in the excerpt). They are stated for finite MDPs with bounded rewards and hold for stochastic policies; no further restrictions on reward functions or dynamics are imposed beyond these. The abstract and main text do not claim the bounds for arbitrary continuous or infinite MDPs. In the revision we will (i) move the complete proof steps into the main body of the sensitivity-bounds section, (ii) open the section with an explicit list of assumptions, and (iii) add a clarifying remark that extensions to continuous settings require additional regularity conditions and are left for future work. This change will make the correspondence fully reproducible from the main text. revision: yes
Circularity Check
Reinterpretation of MISL objectives as skill-sensitivity bounds on GCRL goal-sensitivity without definitional reduction or fitted predictions
full rationale
The paper derives a correspondence by defining goal-sensitivity for each of three inequivalent GCRL formulations and showing that MISL objectives are bounded by matching sensitivity measures. This establishes that more diverse skills improve downstream sensitivity for the corresponding formulation. The bounds follow from the paper's stated definitions of sensitivity in general MDPs rather than from any parameter fit, self-referential definition, or load-bearing self-citation. The central unification is therefore a reinterpretation of existing objectives through the sensitivity lens and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard Markov Decision Process assumptions (states, actions, transition probabilities, reward functions, and policies).
Reference graph
Works this paper leans on
-
[1]
Human-level control through deep reinforcement learning,
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, pp. 529–533, 2015
2015
-
[2]
Mastering the game of go with deep neural networks and tree search,
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016
2016
-
[3]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020
1901
-
[4]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review arXiv 2021
-
[5]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, pp. 8748–8763, PMLR, 2021
2021
-
[6]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022
2022
-
[7]
Learning to achieve goals,
L. P. Kaelbling, “Learning to achieve goals,” inIJCAI, 1993
1993
-
[8]
Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,
R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” inThe 10th international conference on autonomous agents and multiagent systems-volume 2, pp. 761–768, 2011
2011
-
[9]
Universal value function approximators,
T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” inInternational Conference on Machine Learning, pp. 1312–1320, PMLR, 2015
2015
-
[10]
Many-goals reinforcement learning,
V . Veeriah, J. Oh, and S. Singh, “Many-goals reinforcement learning,”arXiv preprint arXiv:1806.09605, 2018
-
[11]
Automatic goal generation for reinforcement learning agents,
C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” inInternational Conference on Machine Learning, pp. 1515–1528, PMLR, 2018
2018
-
[12]
Dual goal representations, 2025
S. Park, D. Mann, and S. Levine, “Dual goal representations,”arXiv preprint arXiv:2510.06714, 2025
-
[13]
Accelerating goal-conditioned rl algorithms and research.arXiv preprint arXiv:2408.11052,
M. Bortkiewicz, W. Pałucki, V . Myers, T. Dziarmaga, T. Arczewski, Ł. Kuci ´nski, and B. Eysenbach, “Accelerating goal-conditioned RL algorithms and research,”arXiv preprint arXiv:2408.11052, 2024
-
[14]
Optimal goal-reaching reinforcement learning via quasimetric learning,
T. Wang, A. Torralba, P. Isola, and A. Zhang, “Optimal goal-reaching reinforcement learning via quasimetric learning,” inInternational Conference on Machine Learning, pp. 36411–36430, PMLR, 2023
2023
-
[15]
C-learning: Learning to achieve goals via recursive classification,
B. Eysenbach, R. Salakhutdinov, and S. Levine, “C-learning: Learning to achieve goals via recursive classification,” inInternational Conference on Learning Representations, 2021
2021
-
[16]
Contrastive learning as goal-conditioned reinforcement learning,
B. Eysenbach, T. Zhang, S. Levine, and R. Salakhutdinov, “Contrastive learning as goal-conditioned reinforcement learning,” inAdvances in Neural Information Processing Systems(A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022
2022
-
[17]
Discovering and achieving goals via world models,
R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak, “Discovering and achieving goals via world models,” inAdvances in Neural Information Processing Systems(M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 24379–24391, Curran Associates, Inc., 2021
2021
-
[18]
Offline goal-conditioned reinforcement learning via f-advantage regression,
J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f-advantage regression,” inAdvances in Neural Information Processing Systems, vol. 35, 2022
2022
-
[19]
METRA: Scalable unsupervised RL with metric-aware abstraction,
S. Park, O. Rybkin, and S. Levine, “METRA: Scalable unsupervised RL with metric-aware abstraction,” arXiv preprint arXiv:2310.08887, 2023
- [20]
-
[21]
arXiv preprint arXiv:1807.10299 , year=
J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,”arXiv preprint arXiv:1807.10299, 2018. 10
-
[22]
Diversity is all you need: Learning skills without a reward function,
B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” inInternational Conference on Learning Representations, 2019
2019
-
[23]
arXiv preprint arXiv:1907.01657 , year=
A. Sharma, S. Gu, S. Levine, V . Kumar, and K. Hausman, “Dynamics-aware unsupervised discovery of skills,”arXiv preprint arXiv:1907.01657, 2019
-
[24]
Can a MISL fly? analysis and ingredients for mutual information skill learning,
C. Zheng, J. Tuyls, J. Peng, and B. Eysenbach, “Can a MISL fly? analysis and ingredients for mutual information skill learning,”arXiv preprint arXiv:2412.08021, 2024
-
[25]
Lipschitz-constrained unsupervised skill discovery,
S. Park, J. Choi, J. Kim, H. Lee, and G. Kim, “Lipschitz-constrained unsupervised skill discovery,” in International Conference on Learning Representations, 2022
2022
-
[26]
Hierarchical empowerment: Towards tractable empowerment-based skill learning,
A. Levy, S. Rammohan, A. Allievi, S. Niekum, and G. Konidaris, “Hierarchical empowerment: Towards tractable empowerment-based skill learning,”arXiv preprint arXiv:2307.02728, 2023
-
[27]
The information geometry of unsupervised reinforcement learning,
B. Eysenbach, R. Salakhutdinov, and S. Levine, “The information geometry of unsupervised reinforcement learning,” inInternational Conference on Learning Representations, 2022
2022
-
[28]
M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “CIC: Contrastive intrinsic control for unsupervised skill discovery,”arXiv preprint arXiv:2202.00161, 2022
-
[29]
Learning to reach goals via iterated supervised learning,
D. Ghosh, A. Gupta, A. Reddy, J. Fu, C. Devin, B. Eysenbach, and S. Levine, “Learning to reach goals via iterated supervised learning,” inInternational Conference on Learning Representations, 2019
2019
-
[30]
arXiv preprint arXiv:1903.03698 , year=
V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine, “Skew-fit: State-covering self-supervised reinforcement learning,”arXiv preprint arXiv:1903.03698, 2019
-
[31]
Unsupervised control through non-parametric discriminative rewards,
D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V . Mnih, “Unsupervised control through non-parametric discriminative rewards,”arXiv preprint arXiv:1811.11359, 2018
-
[32]
f-policy gradients: A general framework for goal- conditioned RL using f-divergences,
S. Agarwal, I. Durugkar, P. Stone, and A. Zhang, “f-policy gradients: A general framework for goal- conditioned RL using f-divergences,” inAdvances in Neural Information Processing Systems, vol. 36, 2023
2023
-
[33]
An integrative framework for the human sense of control,
A. Modirshanechi, P. Dayan, and E. Schulz, “An integrative framework for the human sense of control,” PsyArXiv, 2025
2025
-
[34]
Hindsight experience replay,
M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” inAdvances in Neural Information Processing Systems(I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017
2017
-
[35]
Hamilton-Jacobi reachability: A brief overview and recent advances,
S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-Jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242–2253, IEEE, 2017
2017
-
[36]
Reachability analysis for black-box dynamical systems,
V . K. Chilakamarri, Z. Feng, and S. Bansal, “Reachability analysis for black-box dynamical systems,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 3552–3558, IEEE, 2025
2025
-
[37]
Deepreach: A deep learning approach to high-dimensional reachability,
S. Bansal and C. J. Tomlin, “Deepreach: A deep learning approach to high-dimensional reachability,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 1817–1824, IEEE, 2021
2021
-
[38]
Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,
A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,”Automatica, vol. 44, no. 11, pp. 2724–2734, 2008
2008
-
[39]
Model-free stochastic reachability using kernel distribution embeddings,
A. J. Thorpe and M. M. Oishi, “Model-free stochastic reachability using kernel distribution embeddings,” IEEE Control Systems Letters, vol. 4, no. 2, pp. 512–517, 2019
2019
-
[40]
Approximate stochastic reachability for high dimensional systems,
A. J. Thorpe, V . Sivaramakrishnan, and M. M. Oishi, “Approximate stochastic reachability for high dimensional systems,” in2021 American Control Conference (ACC), pp. 1287–1293, IEEE, 2021
2021
-
[41]
E. D. Sontag,Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer New York, NY , 2013
2013
-
[42]
Ogata,Modern Control Engineering
K. Ogata,Modern Control Engineering. Prentice Hall, 5th ed., 2010
2010
-
[43]
Empowerment: a universal agent-centric measure of control,
A. Klyubin, D. Polani, and C. Nehaniv, “Empowerment: a universal agent-centric measure of control,” in 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 128–135 V ol.1, 2005
2005
-
[44]
Empowerment–an introduction,
C. Salge, C. Glackin, and D. Polani, “Empowerment–an introduction,”Guided Self-Organization: Inception, pp. 67–114, 2014
2014
-
[45]
Empowerment for continuous agent—environment systems,
T. Jung, D. Polani, and P. Stone, “Empowerment for continuous agent—environment systems,”Adaptive Behavior, vol. 19, no. 1, pp. 16–39, 2011
2011
-
[46]
A unified Bellman optimality principle combining reward maximization and empowerment,
F. Leibfried, S. Pascual-Díaz, and J. Grau-Moya, “A unified Bellman optimality principle combining reward maximization and empowerment,” inAdvances in Neural Information Processing Systems(H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. 11
2019
-
[47]
Information prioritization through empowerment in visual model-based RL,
H. Bharadhwaj, M. Babaeizadeh, D. Erhan, and S. Levine, “Information prioritization through empowerment in visual model-based RL,” inInternational Conference on Learning Representations, 2022
2022
-
[48]
Exploration via empowerment gain: Combining novelty, surprise and learning progress,
P. Becker-Ehmck, M. Karl, J. Peters, and P. van der Smagt, “Exploration via empowerment gain: Combining novelty, surprise and learning progress,” inICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021
2021
-
[49]
Merits of curiosity: A simulation study,
L. Gruaz, A. Modirshanechi, S. Becker, and J. Brea, “Merits of curiosity: A simulation study,”Open Mind, vol. 9, pp. 1037–1065, 2025
2025
-
[50]
Towards empowerment gain through causal structure learning in model-based reinforcement learning,
H. Cao, F. Feng, M. Fang, S. Dong, T. Yang, J. Huo, and Y . Gao, “Towards empowerment gain through causal structure learning in model-based reinforcement learning,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[51]
Large-scale study of curiosity- driven learning,
Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity- driven learning,” inInternational Conference on Learning Representations, 2019
2019
-
[52]
Curiosity-driven exploration by self-supervised prediction,
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 2778–2787, JMLR.org, 2017
2017
-
[53]
Capdepuy,Informational principles of perception-action loops and collective behaviours
P. Capdepuy,Informational principles of perception-action loops and collective behaviours. PhD thesis, University of Hertfordshire, 2011
2011
-
[54]
Learning to assist humans without inferring rewards,
V . Myers, E. Ellis, S. Levine, B. Eysenbach, and A. Dragan, “Learning to assist humans without inferring rewards,” inAdvances in Neural Information Processing Systems, 2024
2024
-
[55]
Advances in Neural Information Processing Systems , year =
D. Abel, M. Bowling, A. Barreto, W. Dabney, S. Dong, S. Hansen, A. Harutyunyan, K. Khetarpal, C. Lyle, R. Pascanu,et al., “Plasticity as the mirror of empowerment,”arXiv preprint arXiv:2505.10361, 2025
-
[56]
What can learned intrinsic rewards capture?,
Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. Van Hasselt, D. Silver, and S. Singh, “What can learned intrinsic rewards capture?,” inProceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 11436–11446, PMLR, 2020
2020
-
[57]
Variational empowerment as representation learning for goal-conditioned reinforcement learning,
J. Choi, A. Sharma, H. Lee, S. Levine, and S. S. Gu, “Variational empowerment as representation learning for goal-conditioned reinforcement learning,” inInternational Conference on Machine Learning, pp. 1953– 1963, PMLR, 2021
1953
-
[58]
Variational information maximisation for intrinsically motivated reinforcement learning,
S. Mohamed and D. Jimenez Rezende, “Variational information maximisation for intrinsically motivated reinforcement learning,” inAdvances in Neural Information Processing Systems(C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015
2015
-
[59]
On variational bounds of mutual information,
B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inInternational Conference on Machine Learning, pp. 5171–5180, PMLR, 2019
2019
-
[60]
Skill learning via policy diversity yields identifiable representations for reinforcement learning,
P. Reizinger, B. Mucsányi, S. Guo, B. Eysenbach, B. Schölkopf, and W. Brendel, “Skill learning via policy diversity yields identifiable representations for reinforcement learning,”arXiv preprint arXiv:2507.14748, 2025
-
[61]
M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994
1994
-
[62]
Improving generalization for temporal difference learning: The successor representation,
P. Dayan, “Improving generalization for temporal difference learning: The successor representation,”Neural Computation, vol. 5, no. 4, pp. 613–624, 1993
1993
-
[63]
Temporal difference models: Model-free deep rl for model-based control,
V . Pong, S. Gu, M. Dalal, and S. Levine, “Temporal difference models: Model-free deep rl for model-based control,”arXiv preprint arXiv:1802.09081, 2018
-
[64]
Visual reinforcement learning with imagined goals,
A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual reinforcement learning with imagined goals,” inAdvances in Neural Information Processing Systems, vol. 31, 2018
2018
-
[65]
Achieving target state-action frequencies in multichain average-reward markov decision processes,
D. Krass and O. J. Vrieze, “Achieving target state-action frequencies in multichain average-reward markov decision processes,”Mathematics of Operations Research, vol. 27, no. 3, pp. 545–566, 2002
2002
-
[66]
Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,
F. Dufour and T. Prieto-Rumeau, “Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,”Journal of Mathematical Analysis and Applications, vol. 505, no. 2, p. 125639, 2022
2022
-
[67]
Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inProceedings of the Conference on Robot Learning(L. P. Kaelbling, D. Kragic, and K. Sugiura, eds.), vol. 100 ofProceedings of Machine Learning Research, pp. 1094–1100, PMLR, 2020
2020
-
[68]
T. M. Cover,Elements of Information Theory. John Wiley & Sons, 1999
1999
-
[69]
The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,
O. Rioul, “The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,”Entropy, vol. 25, no. 7, p. 978, 2023. 12
2023
-
[70]
Uncertainty and the probability of error (corresp.),
D. Tebbe and S. Dwyer, “Uncertainty and the probability of error (corresp.),”IEEE Transactions on Information theory, vol. 14, no. 3, pp. 516–518, 1968
1968
-
[71]
Entropy bounds for discrete random variables via maximal coupling,
I. Sason, “Entropy bounds for discrete random variables via maximal coupling,”IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7118–7131, 2013. 13 Contents of the appendices A Precise statements of the equivalences in Section 4 15 A.1 Equivalence of formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Equivalence of o...
2013
-
[72]
6 and Eq
For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = 1 Ns X g J(s, g, π{g′,.}) | {z } =1 = 1 Ns . (57) 20 Optimal policy for state goal - - - - - - - - - - A B 0.9 0.8 A non-optimal policy with higher state goal - - - - - - - - - - C Figure 8: Counterexample showing that maximizing goal sensitivity C ...
-
[73]
56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.})
J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.}). (59) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, 1 Ns X g′∈S J(s, π{g′,.}) = 1 N2s X g,g ′∈S J(s, g, π{g′,.}) ≥ 1 N2s X g∈S J(s, g, π{g,.})...
-
[74]
8A, K= 2 , and γ= 1
The maximally in-control policy for OW(K, γ):Consider the environment p(.|., .) in Fig. 8A, K= 2 , and γ= 1 . The goal-conditioned policy in Fig. 8B is optimal with respect to JOW, but the suboptimal policy in Fig. 8C has a higher goal sensitivity. Hence, a maximally in-control policy is not necessarily optimal forOW(K, γ). For the bound, letπ C∗ {.,.} be...
-
[75]
(78) Since, using Eq
The case of OW(K, γ):For each g∈ S , let P π{g,.} F denote the distribution of FK,γ under the policyπ {g,.}, and let P ¯π{−,.} F := 1 Ns X g∈S P π{g,.} F . (78) Since, using Eq. 8, JOW(s, g, π{g′,.}, K, γ) =E π{g′ ,.} F K,γ g |S 0 =s , (79) which, together with the definition ofC OW in Eq. 10, implies COW(s, π{.,.}, K, γ) = 1 N2s X g,g ′∈S EP π{g,.} F [F ...
-
[76]
(125) Equalities holds iffp goal = Unif(S)
ForPe(γ)andET(K), we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}) (124) and, as a result, 0≤J ∗(s)−J(s, π C∗ {.,.})≤p max goal −p min goal. (125) Equalities holds iffp goal = Unif(S)
-
[77]
(126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)
ForOW(K, γ)and any formulation with non-negative rewards (i.e.,R t(s;g)≥0), we have J(s, π{.,.})≥ 1 1−p min goal C(s, π{.,.}). (126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)
-
[78]
30 Proof:As a direct consequence of Eq
ForOW(K, γ), 0≤J ∗(s)−J(s, π C∗ {.,.})≤1− 1 1−p min goal C∗(s) (127) so largerC ∗(s)andp min goal yield a tighter bound. 30 Proof:As a direct consequence of Eq. 122, we have J(s, π{.,.})−J(s, π {g′,.}) = X g∈S pgoal(g) J(s, g, π{g,.})−J(s, g, π {g′,.}) (128) for an arbitrary goalg ′ ∈ S. By averaging Eq. 55 overg ′, we have J(s, π{.,.}) = X g′∈S pgoal(g′)...
-
[79]
6 and Eq
For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = X g pgoal(g)J(s, g, π {g′,.})∈[p min goal, pmax goal], (130) wherep min goal := ming pgoal(g)andp max goal := maxg pgoal(g). As a result, for Eq. 129, we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}). (131) Therefore, for the...
-
[80]
129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0
J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, X g′∈S pgoal(g′)J(s, π{g′,.}) = X g,g ′∈S pgoal(g′)pgoal(g)J(s, g, π {g′,.}) ≥ X g∈S ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.