Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

Ali Mesbah; Nathan P. Lawrence

arxiv: 2512.06471 · v2 · pith:SZRBQK25new · submitted 2025-12-06 · 💻 cs.LG · cs.AI

Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

Nathan P. Lawrence , Ali Mesbah This is my paper

Pith reviewed 2026-05-17 00:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords goal-conditioned reinforcement learningdual controloptimality gappartially observed Markov decision processesprobabilistic rewardstate estimationoptimal control

0 comments

The pith

Goal-conditioned reinforcement learning succeeds because its reward represents the probability of reaching target states, yielding a smaller optimality gap than classical quadratic objectives and suiting it to dual control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to explain the empirical success of goal-conditioned reinforcement learning by analyzing it from an optimal control perspective. It derives an optimality gap between the goal-conditioned reward and more traditional dense rewards such as quadratic costs, showing why the former can be more effective. The work then extends the analysis to partially observed settings, connecting the probabilistic reward to state estimation and positioning goal-conditioned RL as a natural fit for dual control problems that require simultaneous estimation and control. Sympathetic readers would care if this unification holds because it offers a principled way to choose rewards in reinforcement learning for tasks involving uncertainty and incomplete observations.

Core claim

On the paper's own terms, the core discovery is that interpreting the goal-conditioned reward as the probability of reaching the target states creates an optimality gap relative to classical objectives, which helps explain the success of goal-conditioned RL. In the POMDP setting, this interpretation links the reward to state estimation, making the approach well suited to dual control. The advantages are shown through validation on nonlinear and uncertain environments with both RL and predictive control.

What carries the argument

The derived optimality gap between quadratic objectives and the goal-conditioned reward, along with the connection of the probabilistic reward to state estimation in dual control problems.

If this is right

Classical dense rewards can falter because they do not directly optimize for goal reaching probability.
Goal-conditioned policies perform well in uncertain environments due to the smaller optimality gap.
In partially observed Markov decision processes, the goal-conditioned reward naturally incorporates state estimation for dual control.
Both reinforcement learning and predictive control techniques benefit from goal-conditioned objectives in nonlinear systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests reward designs prioritizing goal probabilities could improve RL in uncertain settings.
The analysis points to potential benefits in combining goal-conditioned methods with explicit estimation techniques.

Load-bearing premise

The goal-conditioned reward can be interpreted directly as a probability of reaching target states, and this interpretation holds when extending the analysis to partially observed settings without additional restrictions.

What would settle it

An experiment in a partially observed nonlinear system where goal-conditioned RL fails to show advantages over classical rewards in reaching performance under uncertainty.

Figures

Figures reproduced from arXiv: 2512.06471 by Ali Mesbah, Nathan P. Lawrence.

**Figure 2.** Figure 2: (Top) Time profiles of the same agent, but using 100 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The distribution of time spent near goal across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives a clean optimality gap between quadratic costs and goal-conditioned rewards, then maps the probabilistic view onto dual control in POMDPs without obvious contradictions.

read the letter

The main thing to know is that the authors derive an optimality gap showing why goal-conditioned rewards, interpreted as probabilities of reaching targets, outperform classical dense quadratic penalties in uncertain systems. They then link this directly to dual control by treating state estimation as part of the value function over beliefs. That framing is the clearest part of the contribution and explains the empirical edge of goal-conditioned RL without hand-waving.

Referee Report

1 major / 2 minor

Summary. The paper analyzes goal-conditioned reinforcement learning from an optimal control viewpoint. It derives an optimality gap between classical quadratic-style objectives and the goal-conditioned probabilistic reward, explaining the success of goal-conditioned RL and limitations of dense rewards. It extends the analysis to partially observed MDPs by linking state estimation to the probabilistic reward, arguing that goal-conditioned rewards are particularly suited to dual control problems. The claims are supported by validation on nonlinear and uncertain environments using both RL and predictive control methods.

Significance. If the derivations are rigorous, the work supplies a theoretical account for the observed advantages of goal-conditioned RL, especially under uncertainty and partial observability, and clarifies its relation to dual control. The explicit optimality-gap derivation and the belief-state reformulation for POMDPs constitute clear strengths that could inform reward design and controller synthesis in uncertain systems.

major comments (1)

[Theoretical analysis and POMDP extension] The central optimality-gap derivation and its extension to the POMDP/dual-control setting rest on interpreting the goal-conditioned reward as the probability of reaching target states. The manuscript should explicitly delineate the dynamical and observational assumptions required for this interpretation to transfer without additional restrictions (see the weakest-assumption note in the reader's report); without this, the load-bearing claim that the same reward remains well-suited to dual control risks being under-supported.

minor comments (2)

[Abstract] The abstract states that advantages are validated on nonlinear and uncertain environments but provides no concrete environment names, metrics, or comparison baselines; a brief sentence with these details would improve readability.
[Notation and definitions] Notation for value functions, beliefs, and the probabilistic reward should be checked for consistency between the fully observed and partially observed sections to avoid reader confusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We have addressed the major comment on the theoretical analysis and POMDP extension by clarifying the required assumptions, which strengthens the support for our claims regarding the suitability of goal-conditioned rewards for dual control.

read point-by-point responses

Referee: [Theoretical analysis and POMDP extension] The central optimality-gap derivation and its extension to the POMDP/dual-control setting rest on interpreting the goal-conditioned reward as the probability of reaching target states. The manuscript should explicitly delineate the dynamical and observational assumptions required for this interpretation to transfer without additional restrictions (see the weakest-assumption note in the reader's report); without this, the load-bearing claim that the same reward remains well-suited to dual control risks being under-supported.

Authors: We thank the referee for this observation. The optimality-gap derivation interprets the goal-conditioned reward as the probability of reaching target states under the standard assumptions of a finite-state MDP with Markovian stochastic dynamics and terminal-time goal achievement. The POMDP extension further assumes a standard belief-state formulation where partial observability is modeled via an observation kernel, and the probabilistic reward is evaluated with respect to the posterior belief over latent states. These are the minimal dynamical and observational assumptions under which the interpretation transfers directly to the dual-control setting without further restrictions, consistent with the literature on goal-conditioned RL and POMDPs. To address the concern, we will add an explicit subsection (or paragraph) in Section 3 that delineates these assumptions, including a note on the weakest conditions referenced in the reader's report, thereby reinforcing the load-bearing claim that the same reward is well-suited to dual control. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper claims to derive an optimality gap between classical (often quadratic) objectives and the goal-conditioned probabilistic reward directly from optimal control principles, then extends the same interpretation to POMDPs by re-expressing the value function over beliefs for dual control. No equations or steps in the abstract or described claims reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the derivation is presented as following from standard optimal-control relations under explicit assumptions on dynamics and rewards. The analysis remains internally consistent without renaming known results or smuggling ansatzes via prior work by the same authors. This is the expected outcome for a paper whose central claims rest on explicit derivations rather than tautological reparameterizations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters or invented entities are described. The central claim rests on the standard domain assumption that goal-conditioned RL is defined by maximizing reachability probability.

axioms (1)

domain assumption Goal-conditioned RL is the problem of training an agent to maximize the probability of reaching target goal states.
This is the opening definition in the abstract and underpins the subsequent optimality-gap derivation.

pith-pipeline@v0.9.0 · 5401 in / 1235 out tokens · 37960 ms · 2026-05-17T00:34:21.113237+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 ... (1−γ)E[∑ γ^t log p(...)] ≤ log((1−γ)E[∑ γ^t p(...)]) ... by Jensen’s inequality
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

r(b,u) = E[p(x'=0|x,u)] ... V^*(b) = max {E[p(x'=0|b,u,y')] + γ E[V^*(b')]}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

and Tse, E

Bar-Shalom, Y. and Tse, E. (1974). Dual effect, certainty equivalence, and separation in stochastic control.IEEE Transactions on Automatic Control, 19(5), 494–500. Bayard, D.S. and Schumitzky, A. (2010). Implicit dual control based on particle filtering and forward dynamic programming.International Journal of Adaptive Control and Signal Processing, 24(3),...

work page 1974
[2]

Athena scientific. Chen, Z. (2003). Bayesian filtering: From Kalman filters to particle filters, and beyond.Statistics, 182(1), 1–69. Drgoˇ na, J., Kiˇ s, K., Tuor, A., Vrabie, D., and Klauˇ co, M. (2022). Differentiable predictive control: Deep learning alternative to explicit model predictive control for unknown nonlinear systems.Journal of Process Cont...

work page 2003
[3]

Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S

IEEE, Cancun, Mexico. Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. (2025). SOAP: Improving and Stabilizing Shampoo using Adam

work page 2025

[1] [1]

and Tse, E

Bar-Shalom, Y. and Tse, E. (1974). Dual effect, certainty equivalence, and separation in stochastic control.IEEE Transactions on Automatic Control, 19(5), 494–500. Bayard, D.S. and Schumitzky, A. (2010). Implicit dual control based on particle filtering and forward dynamic programming.International Journal of Adaptive Control and Signal Processing, 24(3),...

work page 1974

[2] [2]

Athena scientific. Chen, Z. (2003). Bayesian filtering: From Kalman filters to particle filters, and beyond.Statistics, 182(1), 1–69. Drgoˇ na, J., Kiˇ s, K., Tuor, A., Vrabie, D., and Klauˇ co, M. (2022). Differentiable predictive control: Deep learning alternative to explicit model predictive control for unknown nonlinear systems.Journal of Process Cont...

work page 2003

[3] [3]

Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S

IEEE, Cancun, Mexico. Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. (2025). SOAP: Improving and Stabilizing Shampoo using Adam

work page 2025