Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Daniel Zhuoyu Long; Duan Li; Xin Huang

arxiv: 1906.09035 · v1 · pith:OQLCKNVRnew · submitted 2019-06-21 · 📡 eess.SY · cs.LG· cs.SY

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Xin Huang , Duan Li , Daniel Zhuoyu Long This is my paper

Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY

keywords Bayesian reinforcement learningprogressive hedging algorithmdynamic programminglinear quadratic Gaussiandual controluncertainty decompositiontwo-layer schemestochastic control

0 comments

The pith

A two-layer scheme approximates the optimal policy for Bayesian reinforcement learning by separating reducible and irreducible uncertainties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-layer solution scheme for Bayesian RL problems that involve both inherent noise and unknown parameters. The lower layer uses time-decomposition dynamic programming while the upper layer applies scenario-decomposition with a revised progressive hedging algorithm. This structure allows separating reducible system uncertainty from irreducible uncertainty. A reader would care because it offers a direct approximation of the optimal policy for challenging non-episodic cases like the linear-quadratic-Gaussian problem with unknown gain, a problem that has persisted for decades.

Core claim

The central claim is that combining time-decomposition based dynamic programming at the lower layer and scenario-decomposition based revised progressive hedging algorithm at the upper layer provides a two-layer scheme to approximate the optimal policy directly in a type of Bayesian RL problem, with the key feature being the separation of reducible system uncertainty from irreducible one at two different layers, as demonstrated in the linear-quadratic-Gaussian problem with unknown gain.

What carries the argument

The two-layer solution scheme that uses dynamic programming for time decomposition and revised progressive hedging algorithm for scenario decomposition to separate reducible from irreducible uncertainty.

If this is right

The scheme enables direct policy approximation rather than value function approximation in Bayesian RL.
It addresses the dual control challenge in linear-quadratic-Gaussian systems with unknown parameters.
By decomposing and conquering uncertainties at different layers, it improves handling of non-episodic online learning.
Existing approaches like Thompson sampling can be compared against this decomposition method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of uncertainty types could be tested in other stochastic control problems beyond LQG.
If effective, this might reduce the computational burden in high-dimensional Bayesian RL by handling scenarios separately.
Future work could integrate this with online learning to update the layers dynamically.

Load-bearing premise

The revised progressive hedging algorithm applies effectively at the upper layer to decompose scenarios and separate reducible from irreducible uncertainty in the Bayesian RL problem.

What would settle it

Simulation results on the linear-quadratic-Gaussian problem with unknown gain where the two-layer scheme fails to produce a policy with lower cost than standard Bayesian RL approximations.

read the original abstract

Stochastic control with both inherent random system noise and lack of knowledge on system parameters constitutes the core and fundamental topic in reinforcement learning (RL), especially under non-episodic situations where online learning is much more demanding. This challenge has been notably addressed in Bayesian RL recently where some approximation techniques have been developed to find suboptimal policies. While existing approaches mainly focus on approximating the value function, or on involving Thompson sampling, we propose a novel two-layer solution scheme in this paper to approximate the optimal policy directly, by combining the time-decomposition based dynamic programming (DP) at the lower layer and the scenario-decomposition based revised progressive hedging algorithm (PHA) at the upper layer, for a type of Bayesian RL problem. The key feature of our approach is to separate reducible system uncertainty from irreducible one at two different layers, thus decomposing and conquering. We demonstrate our solution framework more especially via the linear-quadratic-Gaussian problem with unknown gain, which, although seemingly simple, has been a notorious subject over more than half century in dual control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a two-layer DP-plus-revised-PHA scheme for Bayesian LQG but supplies no derivations or results, so the key extension remains unverified.

read the letter

The main thing to know is that this paper sketches a two-layer decomposition for Bayesian RL in the LQG case with unknown gain: dynamic programming handles time decomposition at the lower layer while a revised progressive hedging algorithm does scenario decomposition at the upper layer, with the stated aim of separating reducible parameter uncertainty from irreducible noise. The writeup gives no derivations, no proofs, and no numerical checks, so it is impossible to tell whether the revision to PHA actually works when the policy affects the measure itself. The stress-test note is accurate on this point. Standard PHA applies to fixed stochastic programs; extending it to dual-control problems where the distribution depends on the chosen policy requires showing how non-anticipativity and convexity are preserved under the Bayesian update, and none of that appears here. What is new is the specific layering chosen to attack direct policy approximation rather than value-function approximation or Thompson sampling. The motivation section does a clear job of recalling why the dual-control problem has stayed hard for decades. Beyond that framing, there is little substance. No code, data, or reproducible experiments are mentioned, and the citation pattern is not an issue because the paper is mostly a proposal. This kind of outline might interest a narrow group of researchers already working on decomposition methods in stochastic control who want to see one more idea for separating uncertainty types. Most readers in RL or control will find it too thin to extract usable value or to build on. It does not show enough technical grounding to justify sending it to peer review; the central algorithmic claim is not demonstrated.

Referee Report

1 major / 0 minor

Summary. The paper proposes a novel two-layer solution scheme for a class of Bayesian RL problems (exemplified by the LQG problem with unknown gain). The lower layer applies time-decomposition dynamic programming while the upper layer applies scenario-decomposition via a revised progressive hedging algorithm; the central feature is the separation of reducible parameter uncertainty from irreducible noise at the two layers in order to approximate the optimal policy directly.

Significance. If the claimed decomposition and the extension of the revised PHA are valid, the work would supply a direct policy-approximation route for non-episodic dual-control problems that have resisted solution for decades, complementing existing value-function or Thompson-sampling approximations.

major comments (1)

[Abstract] Abstract: the manuscript asserts that the revised PHA at the upper layer successfully separates reducible from irreducible uncertainty, yet supplies neither a derivation showing preservation of non-anticipativity constraints under the Bayesian update nor any verification that the convexity or penalty-update rules of standard PHA remain intact when the measure itself depends on the policy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will strengthen the manuscript with additional derivations as indicated.

read point-by-point responses

Referee: [Abstract] Abstract: the manuscript asserts that the revised PHA at the upper layer successfully separates reducible from irreducible uncertainty, yet supplies neither a derivation showing preservation of non-anticipativity constraints under the Bayesian update nor any verification that the convexity or penalty-update rules of standard PHA remain intact when the measure itself depends on the policy.

Authors: We agree the current version does not contain an explicit derivation of non-anticipativity preservation or a formal check that convexity and penalty updates survive when the measure is policy-dependent. In the revision we will add a new subsection (or appendix) that (i) shows the Bayesian update occurs only at the upper layer after the lower-layer DP has produced a candidate policy, so that the scenario set remains non-anticipative with respect to the information available at each stage; (ii) verifies that the quadratic structure of the LQG cost preserves convexity of the augmented Lagrangian even under the posterior measure; and (iii) confirms that the standard PHA penalty-update rule continues to drive the iterates to a feasible non-anticipative solution because the measure update is independent of the intra-scenario decisions. These additions will be placed in Section 3 and will not alter the algorithmic claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in two-layer DP-PHA decomposition for Bayesian RL

full rationale

The paper's core contribution is a methodological proposal: a two-layer scheme that applies time-decomposition DP at the lower layer and scenario-decomposition via a revised PHA at the upper layer to separate reducible parameter uncertainty from irreducible noise in Bayesian RL (exemplified on LQG with unknown gain). This decomposition is introduced as a novel construction rather than derived from prior fitted parameters or self-referential definitions. No equations or steps in the abstract reduce a claimed prediction back to an input by construction, and the revision/extension of PHA is framed as part of the new scheme without load-bearing reliance on unverified self-citations for the separation logic. The derivation remains self-contained as an algorithmic framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the proposal relies on standard DP and PHA but details are absent.

pith-pipeline@v0.9.0 · 5719 in / 1115 out tokens · 37534 ms · 2026-05-25T18:53:47.878794+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Aoki, M. (1967). Optimization of Stochastic Systems: Topics in Discrete-Time Systems , volume 32. Academic Press

work page 1967
[2]

str \"o m, K. J. and Helmersson, A. (1986). Dual control of an integrator with unknown gain. Computers & Mathematics with Applications , 12(6):653--662

work page 1986
[3]

Bar-Shalom, Y. (1981). Stochastic dynamic programming: Caution and probing. IEEE Transactions on Automatic Control , 26(5):1184--1195

work page 1981
[4]

Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control . Unpublished textbook manuscript, see https://web.mit.edu/dimitrib/www/RLbook.html

work page 2019
[5]

Dallaire, P., Besse, C., Ross, S., and Chaib-draa, B. (2009). Bayesian reinforcement learning in continuous POMDP s with G aussian processes. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 2604--2609. IEEE

work page 2009
[6]

Deshpande, J., Upadhyay, T., and Lainiotis, D. (1973). Adaptive control of linear stochastic systems. Automatica , 9(1):107--115

work page 1973
[7]

(1960--1961)

Feldbaum, A. (1960--1961). Dual control theory I -- IV . Avtomatika i Telemekhanika , 21(9), 21(11), 22(1), 22(2)

work page 1960
[8]

Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483

work page 2015
[9]

Kirk, D. E. (1970). Optimal Control Theory: An Introduction . Springer

work page 1970
[10]

Klenske, E. D. and Hennig, P. (2016). Dual control for approximate bayesian reinforcement learning. Journal of Machine Learning Research , 17:1--30

work page 2016
[11]

and Ng, W.-L

Li, D. and Ng, W.-L. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance , 10(3):387--406

work page 2000
[12]

Li, D., Qian, F., and Fu, P. (2008). Optimal nominal dual control for discrete-time linear-quadratic gaussian problems with unknown parameters. Automatica , 44(1):119--127

work page 2008
[13]

Ouyang, Y., Gagrani, M., and Jain, R. (2017). Learning-based control of unknown linear systems with T hompson sampling. arXiv preprint arXiv:1709.04047

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning , pages 697--704. ACM

work page 2006
[15]

Rockafellar, R. T. (2018). Progressive hedging in nonconvex stochastic optimization. In The Workshop on Variational Analysis and Stochastic Optimization , Hong Kong Polytechnic University

work page 2018
[16]

Rockafellar, R. T. and Wets, R. J.-B. (1991). Scenarios and policy aggregation in optimization under uncertainty. Mathematics of Operations Research , 16(1):119--147

work page 1991
[17]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT press

work page 2018
[18]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933
[19]

and Bar-Shalom, Y

Tse, E. and Bar-Shalom, Y. (1973). An actively adaptive control for linear systems with random parameters via the dual control approach. IEEE Transactions on Automatic Control , 18(2):109--117

work page 1973

[1] [1]

Aoki, M. (1967). Optimization of Stochastic Systems: Topics in Discrete-Time Systems , volume 32. Academic Press

work page 1967

[2] [2]

str \"o m, K. J. and Helmersson, A. (1986). Dual control of an integrator with unknown gain. Computers & Mathematics with Applications , 12(6):653--662

work page 1986

[3] [3]

Bar-Shalom, Y. (1981). Stochastic dynamic programming: Caution and probing. IEEE Transactions on Automatic Control , 26(5):1184--1195

work page 1981

[4] [4]

Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control . Unpublished textbook manuscript, see https://web.mit.edu/dimitrib/www/RLbook.html

work page 2019

[5] [5]

Dallaire, P., Besse, C., Ross, S., and Chaib-draa, B. (2009). Bayesian reinforcement learning in continuous POMDP s with G aussian processes. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 2604--2609. IEEE

work page 2009

[6] [6]

Deshpande, J., Upadhyay, T., and Lainiotis, D. (1973). Adaptive control of linear stochastic systems. Automatica , 9(1):107--115

work page 1973

[7] [7]

(1960--1961)

Feldbaum, A. (1960--1961). Dual control theory I -- IV . Avtomatika i Telemekhanika , 21(9), 21(11), 22(1), 22(2)

work page 1960

[8] [8]

Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483

work page 2015

[9] [9]

Kirk, D. E. (1970). Optimal Control Theory: An Introduction . Springer

work page 1970

[10] [10]

Klenske, E. D. and Hennig, P. (2016). Dual control for approximate bayesian reinforcement learning. Journal of Machine Learning Research , 17:1--30

work page 2016

[11] [11]

and Ng, W.-L

Li, D. and Ng, W.-L. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance , 10(3):387--406

work page 2000

[12] [12]

Li, D., Qian, F., and Fu, P. (2008). Optimal nominal dual control for discrete-time linear-quadratic gaussian problems with unknown parameters. Automatica , 44(1):119--127

work page 2008

[13] [13]

Ouyang, Y., Gagrani, M., and Jain, R. (2017). Learning-based control of unknown linear systems with T hompson sampling. arXiv preprint arXiv:1709.04047

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning , pages 697--704. ACM

work page 2006

[15] [15]

Rockafellar, R. T. (2018). Progressive hedging in nonconvex stochastic optimization. In The Workshop on Variational Analysis and Stochastic Optimization , Hong Kong Polytechnic University

work page 2018

[16] [16]

Rockafellar, R. T. and Wets, R. J.-B. (1991). Scenarios and policy aggregation in optimization under uncertainty. Mathematics of Operations Research , 16(1):119--147

work page 1991

[17] [17]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT press

work page 2018

[18] [18]

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

work page 1933

[19] [19]

and Bar-Shalom, Y

Tse, E. and Bar-Shalom, Y. (1973). An actively adaptive control for linear systems with random parameters via the dual control approach. IEEE Transactions on Automatic Control , 18(2):109--117

work page 1973