arxiv: 2605.06145 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Recognition: unknown

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

Alireza Modirshanechi , Benjamin Eysenbach , Peter Dayan , Eric Schulz

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY

keywords goal-conditioned reinforcement learningmutual information skill learningcontrol maximizationtrajectory sensitivityunsupervised pretrainingreinforcement learning theorybehavioral diversity

0 comments

The pith

GCRL and MISL are unified through control maximization, where each goal-reaching formulation matches a skill-learning objective that benefits from greater diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a theoretical link between goal-conditioned reinforcement learning and unsupervised mutual information skill learning by framing both as forms of control maximization. It first distinguishes three inequivalent GCRL formulations that measure goal-reaching success differently and can lead to different optimal policies. Then it shows that MISL objectives serve as bounds on the goal-sensitivity measures for these formulations. This correspondence means that for any chosen GCRL task, there is an aligned MISL method where increasing skill diversity directly improves the ability to reach goals. The result explains the empirical success of pretraining and guides which unsupervised method to use for particular downstream problems.

Core claim

We unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks.

What carries the argument

Control maximization, which frames goal-reaching and skill discovery as maximizing the sensitivity of future trajectories to a command (goal or skill identifier).

If this is right

The three canonical GCRL formulations can induce incompatible optimal policies even in the same environment.
For every GCRL formulation there exists a matching MISL objective.
More diverse skills afford greater downstream goal sensitivity under the matching MISL objective.
Pretraining objectives should be selected to align with the sensitivity definition of the intended downstream GCRL tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The correspondence could guide selection of pretraining methods in applied settings such as robotics based on expected goal types.
Hybrid objectives might be designed to cover multiple sensitivity measures for robustness across task classes.
The bounding technique may extend to other unsupervised RL methods that optimize behavioral diversity.

Load-bearing premise

The sensitivity measures defined for MISL are directly comparable to those in each GCRL formulation in a way that permits the stated bounds to hold across general MDPs, without additional restrictions on the reward functions or policy classes.

What would settle it

A counterexample MDP where, for a given GCRL formulation, increasing diversity under the matching MISL objective does not increase the corresponding goal-sensitivity or violates the claimed bound.

Figures

Figures reproduced from arXiv: 2605.06145 by Alireza Modirshanechi, Benjamin Eysenbach, Eric Schulz, Peter Dayan.

**Figure 1.** Figure 1: We unify GCRL and MISL as control-maximization problems and prove a correspondence view at source ↗

**Figure 2.** Figure 2: Different GCRL formulations yield incompatible optimal policies ( view at source ↗

**Figure 3.** Figure 3: Equivalence conditions. Black edges indicate identical policy orderings; blue edges indicate view at source ↗

**Figure 4.** Figure 4: Goal sensitivity C(s, π{.,.}) reflects both objective controllability C ∗ (s) and agent competence. In uncontrollable environments, it is zero (A); in fully controllable environments, it depends on whether the policy ignores goals (B) or reliably selects goal-reaching actions (C). for every s, g, g ′ ∈ S (see Appendix F for attainability of this condition). To measure the degree of consistency, we can com… view at source ↗

**Figure 5.** Figure 5: The precise correspondence of the MISL objectives to the downstream GCRL performance; view at source ↗

**Figure 6.** Figure 6: Theoretical bounds linking goal-sensitivity to empowerment and goal-behavior MIs. view at source ↗

**Figure 7.** Figure 7: Counterexample environment (A) showing that different GCRL formulations can induce view at source ↗

**Figure 8.** Figure 8: Counterexample showing that maximizing goal sensitivity view at source ↗

read the original abstract

Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are umbrella terms: different GCRL tasks use distinct criteria for measuring goal-reaching performance, while different MISL methods optimize distinct notions of behavioral diversity. We address this challenge and unify GCRL and MISL as instances of control maximization. We identify three canonical GCRL formulations and prove that they are fundamentally inequivalent: they can induce incompatible optimal policies even in the same environment. Nevertheless, they all share a common interpretation: a well-performing goal-conditioned policy is one whose future trajectory is highly sensitive to the commanded goal, with the precise notion of sensitivity determined by the GCRL formulation. Noting that MISL objectives can be understood as measures of skill-sensitivity akin to goal-sensitivity, we show that MISL objectives are bounded by formulation-specific downstream goal-sensitivities. These bounds establish a precise correspondence between MISL methods and downstream GCRL tasks: for every GCRL formulation, there exists a matching MISL objective for which more diverse skills afford greater downstream goal sensitivity. Our results thus lay a theoretical foundation for RL pretraining and have important practical implications, such as suggesting which pretraining objectives to use when a user cares about a specific class of downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links MISL pretraining to GCRL by proving three goal formulations are inequivalent and deriving sensitivity bounds that match specific skill objectives to each one.

read the letter

The core takeaway is that this work shows mutual information skill learning objectives are bounded by formulation-specific goal sensitivities, so more diverse skills improve downstream performance in a matching GCRL task. It also proves the three common GCRL formulations are not equivalent and can produce incompatible policies in the same environment. Both results come from reframing everything as control maximization, where performance means high sensitivity of trajectories to the goal or skill command. The correspondence is new and gives a direct reason why unsupervised pretraining helps goal-reaching agents, plus a rule for picking the right MISL method for a given GCRL variant. No new parameters or fitted quantities are introduced, which keeps the claims clean. The inequivalence proof stands on its own and clarifies why prior work sometimes got inconsistent results across papers. The bounds themselves are the main advance, but they rest on the assumption that the sensitivity measures line up directly enough for the inequalities to hold in general MDPs. If the derivations need finite spaces, bounded rewards, or deterministic dynamics, the result narrows for the continuous stochastic cases that dominate practical GCRL. The abstract does not flag those restrictions, so the practical guidance could be overstated until the proofs are checked. This is aimed at RL theorists who care about foundations of pretraining and transfer. Readers who need to choose or justify a skill-learning objective for a downstream goal task will get the most from the matching rule. It deserves a serious referee because the gap it targets is real and the formal steps are precise, even if the bounds require close inspection for hidden assumptions. I would send it to review with instructions to verify the inequality directions and domain of the claims.

Referee Report

1 major / 2 minor

Summary. The paper unifies goal-conditioned RL (GCRL) and mutual information skill learning (MISL) by framing both as instances of control maximization. It identifies three canonical GCRL formulations, proves they are inequivalent (inducing incompatible optimal policies in the same environment), and reinterprets each via a distinct notion of goal-sensitivity of future trajectories. It then shows that MISL objectives are bounded above by formulation-specific downstream goal-sensitivities, establishing a precise matching correspondence: for each GCRL formulation there exists an MISL objective such that greater skill diversity yields greater goal sensitivity.

Significance. If the sensitivity bounds hold with the stated generality, the work supplies a missing theoretical account for why unsupervised MISL pretraining aids downstream GCRL and supplies a principled way to select pretraining objectives for a given class of goal-reaching tasks. The inequivalence result among GCRL formulations is itself a useful clarification. The control-maximization perspective and the explicit bounds constitute a substantive contribution beyond reinterpretation.

major comments (1)

[Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.

minor comments (2)

[Introduction / Section 3] The three GCRL formulations are introduced without an explicit table or side-by-side comparison of their objective functions and optimal-policy characterizations; adding such a table would make the inequivalence claim easier to verify at a glance.
[Preliminaries] Notation for the various sensitivity measures (goal-sensitivity vs. skill-sensitivity) is introduced piecemeal; a single consolidated definition table would reduce ambiguity when the bounds are stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary, for recognizing the value of the inequivalence result and the control-maximization framing, and for the constructive request to clarify the sensitivity bounds. We address the single major comment below and will revise the manuscript to improve transparency.

read point-by-point responses

Referee: [Section deriving MISL–GCRL sensitivity bounds] The central correspondence result (abstract and the section deriving the sensitivity bounds) asserts that MISL objectives are bounded by formulation-specific goal-sensitivities for arbitrary MDPs, yet the provided derivations are not reproduced in the excerpt and the skeptic note correctly flags that direct comparability of the sensitivity measures may require unstated restrictions on reward functions, policy classes, or dynamics. If the proofs rely on finite spaces, bounded rewards, or deterministic policies, the claimed matching for continuous or stochastic GCRL settings does not follow. Please supply the full proof of the bound (including all assumptions) or state the precise conditions under which the inequality direction holds.

Authors: We appreciate the referee drawing attention to the need for explicit assumptions. The derivations appear in Appendix B of the full manuscript (not included in the excerpt). They are stated for finite MDPs with bounded rewards and hold for stochastic policies; no further restrictions on reward functions or dynamics are imposed beyond these. The abstract and main text do not claim the bounds for arbitrary continuous or infinite MDPs. In the revision we will (i) move the complete proof steps into the main body of the sensitivity-bounds section, (ii) open the section with an explicit list of assumptions, and (iii) add a clarifying remark that extensions to continuous settings require additional regularity conditions and are left for future work. This change will make the correspondence fully reproducible from the main text. revision: yes

Circularity Check

0 steps flagged

Reinterpretation of MISL objectives as skill-sensitivity bounds on GCRL goal-sensitivity without definitional reduction or fitted predictions

full rationale

The paper derives a correspondence by defining goal-sensitivity for each of three inequivalent GCRL formulations and showing that MISL objectives are bounded by matching sensitivity measures. This establishes that more diverse skills improve downstream sensitivity for the corresponding formulation. The bounds follow from the paper's stated definitions of sensitivity in general MDPs rather than from any parameter fit, self-referential definition, or load-bearing self-citation. The central unification is therefore a reinterpretation of existing objectives through the sensitivity lens and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard MDP definitions and the new lens of control maximization; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Standard Markov Decision Process assumptions (states, actions, transition probabilities, reward functions, and policies).
These underpin all definitions of trajectories, goal-conditioned policies, and sensitivity measures used in both GCRL and MISL.

pith-pipeline@v0.9.0 · 5620 in / 1321 out tokens · 62728 ms · 2026-05-08T13:49:01.234178+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

85 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, pp. 529–533, 2015

2015
[2]

Mastering the game of go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016

2016
[3]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020

1901
[4]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review arXiv 2021
[5]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning, pp. 8748–8763, PMLR, 2021

2021
[6]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022

2022
[7]

Learning to achieve goals,

L. P. Kaelbling, “Learning to achieve goals,” inIJCAI, 1993

1993
[8]

Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,

R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” inThe 10th international conference on autonomous agents and multiagent systems-volume 2, pp. 761–768, 2011

2011
[9]

Universal value function approximators,

T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” inInternational Conference on Machine Learning, pp. 1312–1320, PMLR, 2015

2015
[10]

Many-goals reinforcement learning,

V . Veeriah, J. Oh, and S. Singh, “Many-goals reinforcement learning,”arXiv preprint arXiv:1806.09605, 2018

work page arXiv 2018
[11]

Automatic goal generation for reinforcement learning agents,

C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” inInternational Conference on Machine Learning, pp. 1515–1528, PMLR, 2018

2018
[12]

Dual goal representations, 2025

S. Park, D. Mann, and S. Levine, “Dual goal representations,”arXiv preprint arXiv:2510.06714, 2025

work page arXiv 2025
[13]

Accelerating goal-conditioned rl algorithms and research.arXiv preprint arXiv:2408.11052,

M. Bortkiewicz, W. Pałucki, V . Myers, T. Dziarmaga, T. Arczewski, Ł. Kuci ´nski, and B. Eysenbach, “Accelerating goal-conditioned RL algorithms and research,”arXiv preprint arXiv:2408.11052, 2024

work page arXiv 2024
[14]

Optimal goal-reaching reinforcement learning via quasimetric learning,

T. Wang, A. Torralba, P. Isola, and A. Zhang, “Optimal goal-reaching reinforcement learning via quasimetric learning,” inInternational Conference on Machine Learning, pp. 36411–36430, PMLR, 2023

2023
[15]

C-learning: Learning to achieve goals via recursive classification,

B. Eysenbach, R. Salakhutdinov, and S. Levine, “C-learning: Learning to achieve goals via recursive classification,” inInternational Conference on Learning Representations, 2021

2021
[16]

Contrastive learning as goal-conditioned reinforcement learning,

B. Eysenbach, T. Zhang, S. Levine, and R. Salakhutdinov, “Contrastive learning as goal-conditioned reinforcement learning,” inAdvances in Neural Information Processing Systems(A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022

2022
[17]

Discovering and achieving goals via world models,

R. Mendonca, O. Rybkin, K. Daniilidis, D. Hafner, and D. Pathak, “Discovering and achieving goals via world models,” inAdvances in Neural Information Processing Systems(M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 24379–24391, Curran Associates, Inc., 2021

2021
[18]

Offline goal-conditioned reinforcement learning via f-advantage regression,

J. Y . Ma, J. Yan, D. Jayaraman, and O. Bastani, “Offline goal-conditioned reinforcement learning via f-advantage regression,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

2022
[19]

METRA: Scalable unsupervised RL with metric-aware abstraction,

S. Park, O. Rybkin, and S. Levine, “METRA: Scalable unsupervised RL with metric-aware abstraction,” arXiv preprint arXiv:2310.08887, 2023

work page arXiv 2023
[20]

Gregor, D

K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,”arXiv preprint arXiv:1611.07507, 2016

work page arXiv 2016
[21]

arXiv preprint arXiv:1807.10299 , year=

J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,”arXiv preprint arXiv:1807.10299, 2018. 10

work page arXiv 2018
[22]

Diversity is all you need: Learning skills without a reward function,

B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” inInternational Conference on Learning Representations, 2019

2019
[23]

arXiv preprint arXiv:1907.01657 , year=

A. Sharma, S. Gu, S. Levine, V . Kumar, and K. Hausman, “Dynamics-aware unsupervised discovery of skills,”arXiv preprint arXiv:1907.01657, 2019

work page arXiv 1907
[24]

Can a MISL fly? analysis and ingredients for mutual information skill learning,

C. Zheng, J. Tuyls, J. Peng, and B. Eysenbach, “Can a MISL fly? analysis and ingredients for mutual information skill learning,”arXiv preprint arXiv:2412.08021, 2024

work page arXiv 2024
[25]

Lipschitz-constrained unsupervised skill discovery,

S. Park, J. Choi, J. Kim, H. Lee, and G. Kim, “Lipschitz-constrained unsupervised skill discovery,” in International Conference on Learning Representations, 2022

2022
[26]

Hierarchical empowerment: Towards tractable empowerment-based skill learning,

A. Levy, S. Rammohan, A. Allievi, S. Niekum, and G. Konidaris, “Hierarchical empowerment: Towards tractable empowerment-based skill learning,”arXiv preprint arXiv:2307.02728, 2023

work page arXiv 2023
[27]

The information geometry of unsupervised reinforcement learning,

B. Eysenbach, R. Salakhutdinov, and S. Levine, “The information geometry of unsupervised reinforcement learning,” inInternational Conference on Learning Representations, 2022

2022
[28]

CIC: Contrastive Intrinsic Control for Unsupervised Skill Discovery.arXiv preprint arXiv:2202.00161, 2022

M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel, “CIC: Contrastive intrinsic control for unsupervised skill discovery,”arXiv preprint arXiv:2202.00161, 2022

work page arXiv 2022
[29]

Learning to reach goals via iterated supervised learning,

D. Ghosh, A. Gupta, A. Reddy, J. Fu, C. Devin, B. Eysenbach, and S. Levine, “Learning to reach goals via iterated supervised learning,” inInternational Conference on Learning Representations, 2019

2019
[30]

arXiv preprint arXiv:1903.03698 , year=

V . H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine, “Skew-fit: State-covering self-supervised reinforcement learning,”arXiv preprint arXiv:1903.03698, 2019

work page arXiv 1903
[31]

Unsupervised control through non-parametric discriminative rewards,

D. Warde-Farley, T. Van de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V . Mnih, “Unsupervised control through non-parametric discriminative rewards,”arXiv preprint arXiv:1811.11359, 2018

work page arXiv 2018
[32]

f-policy gradients: A general framework for goal- conditioned RL using f-divergences,

S. Agarwal, I. Durugkar, P. Stone, and A. Zhang, “f-policy gradients: A general framework for goal- conditioned RL using f-divergences,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

2023
[33]

An integrative framework for the human sense of control,

A. Modirshanechi, P. Dayan, and E. Schulz, “An integrative framework for the human sense of control,” PsyArXiv, 2025

2025
[34]

Hindsight experience replay,

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” inAdvances in Neural Information Processing Systems(I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017

2017
[35]

Hamilton-Jacobi reachability: A brief overview and recent advances,

S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin, “Hamilton-Jacobi reachability: A brief overview and recent advances,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242–2253, IEEE, 2017

2017
[36]

Reachability analysis for black-box dynamical systems,

V . K. Chilakamarri, Z. Feng, and S. Bansal, “Reachability analysis for black-box dynamical systems,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 3552–3558, IEEE, 2025

2025
[37]

Deepreach: A deep learning approach to high-dimensional reachability,

S. Bansal and C. J. Tomlin, “Deepreach: A deep learning approach to high-dimensional reachability,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 1817–1824, IEEE, 2021

2021
[38]

Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,

A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,”Automatica, vol. 44, no. 11, pp. 2724–2734, 2008

2008
[39]

Model-free stochastic reachability using kernel distribution embeddings,

A. J. Thorpe and M. M. Oishi, “Model-free stochastic reachability using kernel distribution embeddings,” IEEE Control Systems Letters, vol. 4, no. 2, pp. 512–517, 2019

2019
[40]

Approximate stochastic reachability for high dimensional systems,

A. J. Thorpe, V . Sivaramakrishnan, and M. M. Oishi, “Approximate stochastic reachability for high dimensional systems,” in2021 American Control Conference (ACC), pp. 1287–1293, IEEE, 2021

2021
[41]

E. D. Sontag,Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer New York, NY , 2013

2013
[42]

Ogata,Modern Control Engineering

K. Ogata,Modern Control Engineering. Prentice Hall, 5th ed., 2010

2010
[43]

Empowerment: a universal agent-centric measure of control,

A. Klyubin, D. Polani, and C. Nehaniv, “Empowerment: a universal agent-centric measure of control,” in 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 128–135 V ol.1, 2005

2005
[44]

Empowerment–an introduction,

C. Salge, C. Glackin, and D. Polani, “Empowerment–an introduction,”Guided Self-Organization: Inception, pp. 67–114, 2014

2014
[45]

Empowerment for continuous agent—environment systems,

T. Jung, D. Polani, and P. Stone, “Empowerment for continuous agent—environment systems,”Adaptive Behavior, vol. 19, no. 1, pp. 16–39, 2011

2011
[46]

A unified Bellman optimality principle combining reward maximization and empowerment,

F. Leibfried, S. Pascual-Díaz, and J. Grau-Moya, “A unified Bellman optimality principle combining reward maximization and empowerment,” inAdvances in Neural Information Processing Systems(H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019. 11

2019
[47]

Information prioritization through empowerment in visual model-based RL,

H. Bharadhwaj, M. Babaeizadeh, D. Erhan, and S. Levine, “Information prioritization through empowerment in visual model-based RL,” inInternational Conference on Learning Representations, 2022

2022
[48]

Exploration via empowerment gain: Combining novelty, surprise and learning progress,

P. Becker-Ehmck, M. Karl, J. Peters, and P. van der Smagt, “Exploration via empowerment gain: Combining novelty, surprise and learning progress,” inICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021

2021
[49]

Merits of curiosity: A simulation study,

L. Gruaz, A. Modirshanechi, S. Becker, and J. Brea, “Merits of curiosity: A simulation study,”Open Mind, vol. 9, pp. 1037–1065, 2025

2025
[50]

Towards empowerment gain through causal structure learning in model-based reinforcement learning,

H. Cao, F. Feng, M. Fang, S. Dong, T. Yang, J. Huo, and Y . Gao, “Towards empowerment gain through causal structure learning in model-based reinforcement learning,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[51]

Large-scale study of curiosity- driven learning,

Y . Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity- driven learning,” inInternational Conference on Learning Representations, 2019

2019
[52]

Curiosity-driven exploration by self-supervised prediction,

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 2778–2787, JMLR.org, 2017

2017
[53]

Capdepuy,Informational principles of perception-action loops and collective behaviours

P. Capdepuy,Informational principles of perception-action loops and collective behaviours. PhD thesis, University of Hertfordshire, 2011

2011
[54]

Learning to assist humans without inferring rewards,

V . Myers, E. Ellis, S. Levine, B. Eysenbach, and A. Dragan, “Learning to assist humans without inferring rewards,” inAdvances in Neural Information Processing Systems, 2024

2024
[55]

Advances in Neural Information Processing Systems , year =

D. Abel, M. Bowling, A. Barreto, W. Dabney, S. Dong, S. Hansen, A. Harutyunyan, K. Khetarpal, C. Lyle, R. Pascanu,et al., “Plasticity as the mirror of empowerment,”arXiv preprint arXiv:2505.10361, 2025

work page arXiv 2025
[56]

What can learned intrinsic rewards capture?,

Z. Zheng, J. Oh, M. Hessel, Z. Xu, M. Kroiss, H. Van Hasselt, D. Silver, and S. Singh, “What can learned intrinsic rewards capture?,” inProceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 11436–11446, PMLR, 2020

2020
[57]

Variational empowerment as representation learning for goal-conditioned reinforcement learning,

J. Choi, A. Sharma, H. Lee, S. Levine, and S. S. Gu, “Variational empowerment as representation learning for goal-conditioned reinforcement learning,” inInternational Conference on Machine Learning, pp. 1953– 1963, PMLR, 2021

1953
[58]

Variational information maximisation for intrinsically motivated reinforcement learning,

S. Mohamed and D. Jimenez Rezende, “Variational information maximisation for intrinsically motivated reinforcement learning,” inAdvances in Neural Information Processing Systems(C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015

2015
[59]

On variational bounds of mutual information,

B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” inInternational Conference on Machine Learning, pp. 5171–5180, PMLR, 2019

2019
[60]

Skill learning via policy diversity yields identifiable representations for reinforcement learning,

P. Reizinger, B. Mucsányi, S. Guo, B. Eysenbach, B. Schölkopf, and W. Brendel, “Skill learning via policy diversity yields identifiable representations for reinforcement learning,”arXiv preprint arXiv:2507.14748, 2025

work page arXiv 2025
[61]

M. L. Puterman,Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994

1994
[62]

Improving generalization for temporal difference learning: The successor representation,

P. Dayan, “Improving generalization for temporal difference learning: The successor representation,”Neural Computation, vol. 5, no. 4, pp. 613–624, 1993

1993
[63]

Temporal difference models: Model-free deep rl for model-based control,

V . Pong, S. Gu, M. Dalal, and S. Levine, “Temporal difference models: Model-free deep rl for model-based control,”arXiv preprint arXiv:1802.09081, 2018

work page arXiv 2018
[64]

Visual reinforcement learning with imagined goals,

A. V . Nair, V . Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Visual reinforcement learning with imagined goals,” inAdvances in Neural Information Processing Systems, vol. 31, 2018

2018
[65]

Achieving target state-action frequencies in multichain average-reward markov decision processes,

D. Krass and O. J. Vrieze, “Achieving target state-action frequencies in multichain average-reward markov decision processes,”Mathematics of Operations Research, vol. 27, no. 3, pp. 545–566, 2002

2002
[66]

Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,

F. Dufour and T. Prieto-Rumeau, “Maximizing the probability of visiting a set infinitely often for a countable state space markov decision process,”Journal of Mathematical Analysis and Applications, vol. 505, no. 2, p. 125639, 2022

2022
[67]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” inProceedings of the Conference on Robot Learning(L. P. Kaelbling, D. Kragic, and K. Sugiura, eds.), vol. 100 ofProceedings of Machine Learning Research, pp. 1094–1100, PMLR, 2020

2020
[68]

T. M. Cover,Elements of Information Theory. John Wiley & Sons, 1999

1999
[69]

The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,

O. Rioul, “The interplay between error, total variation, alpha-entropy and guessing: Fano and Pinsker direct and reverse inequalities,”Entropy, vol. 25, no. 7, p. 978, 2023. 12

2023
[70]

Uncertainty and the probability of error (corresp.),

D. Tebbe and S. Dwyer, “Uncertainty and the probability of error (corresp.),”IEEE Transactions on Information theory, vol. 14, no. 3, pp. 516–518, 1968

1968
[71]

Entropy bounds for discrete random variables via maximal coupling,

I. Sason, “Entropy bounds for discrete random variables via maximal coupling,”IEEE Transactions on Information Theory, vol. 59, no. 11, pp. 7118–7131, 2013. 13 Contents of the appendices A Precise statements of the equivalences in Section 4 15 A.1 Equivalence of formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Equivalence of o...

2013
[72]

6 and Eq

For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = 1 Ns X g J(s, g, π{g′,.}) | {z } =1 = 1 Ns . (57) 20 Optimal policy for state goal - - - - - - - - - - A B 0.9 0.8 A non-optimal policy with higher state goal - - - - - - - - - - C Figure 8: Counterexample showing that maximizing goal sensitivity C ...
[73]

56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.})

J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 56, we have C(s, π{.,.}) =J(s, π {.,.})− 1 Ns X g′∈S J(s, π{g′,.}). (59) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, 1 Ns X g′∈S J(s, π{g′,.}) = 1 N2s X g,g ′∈S J(s, g, π{g′,.}) ≥ 1 N2s X g∈S J(s, g, π{g,.})...
[74]

8A, K= 2 , and γ= 1

The maximally in-control policy for OW(K, γ):Consider the environment p(.|., .) in Fig. 8A, K= 2 , and γ= 1 . The goal-conditioned policy in Fig. 8B is optimal with respect to JOW, but the suboptimal policy in Fig. 8C has a higher goal sensitivity. Hence, a maximally in-control policy is not necessarily optimal forOW(K, γ). For the bound, letπ C∗ {.,.} be...
[75]

(78) Since, using Eq

The case of OW(K, γ):For each g∈ S , let P π{g,.} F denote the distribution of FK,γ under the policyπ {g,.}, and let P ¯π{−,.} F := 1 Ns X g∈S P π{g,.} F . (78) Since, using Eq. 8, JOW(s, g, π{g′,.}, K, γ) =E π{g′ ,.} F K,γ g |S 0 =s , (79) which, together with the definition ofC OW in Eq. 10, implies COW(s, π{.,.}, K, γ) = 1 N2s X g,g ′∈S EP π{g,.} F [F ...
[76]

(125) Equalities holds iffp goal = Unif(S)

ForPe(γ)andET(K), we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}) (124) and, as a result, 0≤J ∗(s)−J(s, π C∗ {.,.})≤p max goal −p min goal. (125) Equalities holds iffp goal = Unif(S)
[77]

(126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)

ForOW(K, γ)and any formulation with non-negative rewards (i.e.,R t(s;g)≥0), we have J(s, π{.,.})≥ 1 1−p min goal C(s, π{.,.}). (126) Equality holds ifJ(s, g, π {g′,.}) = 0for allg ′ ̸=gandp goal = Unif(S)
[78]

30 Proof:As a direct consequence of Eq

ForOW(K, γ), 0≤J ∗(s)−J(s, π C∗ {.,.})≤1− 1 1−p min goal C∗(s) (127) so largerC ∗(s)andp min goal yield a tighter bound. 30 Proof:As a direct consequence of Eq. 122, we have J(s, π{.,.})−J(s, π {g′,.}) = X g∈S pgoal(g) J(s, g, π{g,.})−J(s, g, π {g′,.}) (128) for an arbitrary goalg ′ ∈ S. By averaging Eq. 55 overg ′, we have J(s, π{.,.}) = X g′∈S pgoal(g′)...
[79]

6 and Eq

For the case ofPe(γ)andET(K):A direct consequence of Eq. 6 and Eq. 7, respectively, is that J(s, π{g′,.}) = X g pgoal(g)J(s, g, π {g′,.})∈[p min goal, pmax goal], (130) wherep min goal := ming pgoal(g)andp max goal := maxg pgoal(g). As a result, for Eq. 129, we have pmin goal +C(s, π {.,.})≤J(s, π {.,.})≤p max goal +C(s, π {.,.}). (131) Therefore, for the...
[80]

129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0

J-C relationship for OW(K, γ) and formulations with non-negative rewards:Using Eq. 129, we have C(s, π{.,.}) =J(s, π {.,.})− X g′∈S pgoal(g′)J(s, π{g′,.}) (135) Moreover, the non-negative reward assumption implies that, for everyg∈ Sandg ′ ∈ S, J(s, g, π{g′,.})≥0. Hence, X g′∈S pgoal(g′)J(s, π{g′,.}) = X g,g ′∈S pgoal(g′)pgoal(g)J(s, g, π {g′,.}) ≥ X g∈S ...

Showing first 80 references.