A Fully Data-Driven Value Iteration for Stochastic LQR: Convergence, Robustness and Stability

Gr\'egoire G. Macqueron; Leilei Cui; Petter N. Kolm; Zhong-Ping Jiang

arxiv: 2505.02970 · v4 · submitted 2025-05-05 · 🧮 math.OC

A Fully Data-Driven Value Iteration for Stochastic LQR: Convergence, Robustness and Stability

Leilei Cui , Zhong-Ping Jiang , Petter N. Kolm , Gr\'egoire G. Macqueron This is my paper

Pith reviewed 2026-05-22 16:33 UTC · model grok-4.3

classification 🧮 math.OC

keywords value iterationdata-driven controlstochastic LQRinput-to-state stabilityadaptive dynamic programmingreinforcement learningrobustnessconvergence

0 comments

The pith

Value iteration for data-driven stochastic LQR is globally exponentially stable from any positive semidefinite initial matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that value iteration learns optimal controllers directly from input-state data for stochastic linear quadratic systems without first identifying a model. It proves the iteration is globally exponentially stable for any nonnegative initial value matrix when noise is absent, removing the need for special starting conditions required in earlier work. When external disturbances are present, the process remains input-to-state stable and approaches the optimal solution within a small neighborhood if the disturbance level is low enough. A new robust adaptive dynamic programming algorithm is given that needs no initial admissible policy. These results support more reliable data-driven control in settings where models are unavailable and measurements contain noise.

Core claim

For discrete-time stochastic linear quadratic regulator problems with completely unknown dynamics and cost, value iteration is globally exponentially stable for any positive semidefinite initial value matrix in the noise-free case. In the presence of external disturbances the iteration exhibits small-disturbance input-to-state stability and converges inside a neighborhood of the optimal solution whose radius shrinks with the disturbance size. A new non-model-based robust adaptive dynamic programming algorithm is introduced that requires no prior knowledge of an initial admissible control policy.

What carries the argument

The direct data-driven value iteration step that updates the value function estimate solely from collected input-state trajectories without any intermediate system identification.

Load-bearing premise

The system must be exactly a discrete-time stochastic linear quadratic regulator and the collected input-state data must be rich enough to support direct policy learning without model identification.

What would settle it

Run the value iteration from a zero initial matrix on a noise-free stochastic LQR system whose input-state data satisfy persistent excitation; divergence or failure to reach the known optimal cost would falsify the global exponential stability claim.

Figures

Figures reproduced from arXiv: 2505.02970 by Gr\'egoire G. Macqueron, Leilei Cui, Petter N. Kolm, Zhong-Ping Jiang.

**Figure 1.** Figure 1: Convergence and stability of the nominal control (PI and VI), O-LSPI, LSPI, R-LSVI (with and without rescaling) and policy gradient algorithms on the data center cooling benchmark problem as the sample size increases. (Left) Cost relative error. Dashed lines represent the median relative error, with the shaded region covering the 25th to 75th percentiles, estimated from 100 trajectories. (Right) Frequency … view at source ↗

**Figure 2.** Figure 2: Convergence and stability of the nominal control (VI and PI), LSPI, O-LSPI, R-LSVI and policy gradient algorithms on the data center cooling benchmark problem with non-quadratic cost. (Left) Costs as the exponent κ varies. Dashed lines represent the median relative error, with the shaded region covering the 25th to 75th percentiles, estimated from 100 trajectories. (Right) Frequency of stabilizing control… view at source ↗

**Figure 3.** Figure 3: Convergence and stability of the nominal VI, O-LSPI and R-LSVI algorithms on the dynamic portfolio allocation problem as the sample size increases. (Left) Cost relative error. Dashed lines represent the median relative error, with the shaded region covering the 25th to 75th percentiles, estimated from 100 trajectories. (Right) Frequency of stabilizing controllers found by the algorithms. Only costs corre… view at source ↗

**Figure 4.** Figure 4: Convergence and stability of the nominal VI, O-LSPI and R-LSVI algorithms on the dynamic portfolio allocation problem with non-quadratic cost. (Left) Costs as the exponent κ varies. Dashed lines represent the median relative error, with the shaded region covering the 25th to 75th percentiles, estimated from 100 trajectories. (Right) Frequency of stabilizing controllers found by the algorithms. Only costs … view at source ↗

read the original abstract

Unlike traditional model-based reinforcement learning approaches that estimate system parameters from data, non-model-based data-driven control learns the optimal policy directly from input-state data without any intermediate model identification. Although this direct reinforcement learning approach offers increased adaptability and resilience to model misspecification, its reliance on raw data leaves it vulnerable to system noise and disturbances that may undermine convergence, robustness, and stability. In this article, we establish the convergence, robustness, and stability of value iteration (VI) for data-driven control of stochastic linear quadratic (LQ) systems in discrete-time with entirely unknown dynamics and cost. Our contributions are three-fold. First, we prove that VI is globally exponentially stable for any positive semidefinite initial value matrix in noise-free settings, thereby significantly relaxing restrictive assumptions on initial value functions in existing literature. Second, we extend our analysis to settings with external disturbances, proving that VI maintains small-disturbance input-to-state stability (ISS) and converges within a small neighborhood of the optimal solution when disturbances are sufficiently small. Third, we propose a new non-model-based robust adaptive dynamic programming (ADP) algorithm for adaptive optimal controller design, which, unlike existing procedures, requires no prior knowledge of an initial admissible control policy. Numerical experiments on a ``data center cooling'' problem demonstrate the convergence and stability of the algorithm compared to established methods, highlighting its robustness and adaptability for data-driven control in noisy environments. Finally, we apply the method to dynamic portfolio allocation, demonstrating its practical relevance outside traditional control tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows global exponential stability for data-driven value iteration on stochastic LQR from any positive-semidefinite start and gives a robust ADP version that skips the usual initial admissible policy.

read the letter

The main point is that value iteration stays globally exponentially stable for stochastic LQR even when you start from an arbitrary positive-semidefinite matrix instead of a stabilizing one, and the authors build a disturbance-robust version that converges near the optimum for small enough noise without needing any initial admissible policy. That relaxation is the concrete step forward relative to earlier data-driven ADP work on LQR. They also supply the small-disturbance ISS argument and test the algorithm on a data-center cooling example plus a portfolio allocation task, where it holds up against standard baselines. The numerical cases are useful for seeing practical behavior in noisy settings. The proofs rest on standard rank conditions for the collected data to identify the Q-function, and the paper claims these hold even from poor initial matrices because the iteration itself generates the necessary excitation. If that argument is tight, the result is clean; the only real question is whether the data matrix stays well-conditioned in the early transient for every possible PSD start, which would be worth checking in the derivations. Overall the work sits squarely in the data-driven stochastic control literature and should interest people who want to drop the admissible-policy precondition in LQR-type problems. It is worth a serious referee because the stability claims are stated precisely and the algorithm is reproducible from the description.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a fully data-driven value iteration (VI) algorithm for discrete-time stochastic linear quadratic regulators with unknown dynamics and quadratic costs. It claims three main results: (i) global exponential stability of the VI iterates for any positive semidefinite initial value matrix in the noise-free case, relaxing the usual requirement of an initial admissible policy; (ii) small-disturbance input-to-state stability (ISS) of the iteration under bounded external disturbances, with convergence to a neighborhood of the optimal solution; and (iii) a robust adaptive dynamic programming procedure that implements the method without prior knowledge of a stabilizing controller. The claims are supported by numerical experiments on a data-center cooling system and a dynamic portfolio allocation task.

Significance. If the stability proofs are free of gaps in the data-richness argument, the relaxation of the admissible-policy assumption would be a substantive contribution to data-driven LQR methods, as it removes a common practical bottleneck. The small-disturbance ISS extension and the non-model-based robust ADP algorithm add robustness considerations that are relevant for stochastic and noisy environments. The portfolio-allocation example illustrates applicability outside classical control.

major comments (2)

The global exponential stability result for arbitrary PSD initial value matrices (abstract and stability theorem) presupposes that the collected input-state data matrix satisfies the rank/excitation condition needed to uniquely recover the optimal Q-function from the Bellman residual. When the initial value matrix induces an unstable closed-loop, the generated trajectories may fail to provide persistent excitation. The proof must explicitly show how this rank condition is guaranteed without an auxiliary excitation signal or an initial stabilizing policy; otherwise the contraction-mapping argument does not hold for all PSD starts.
In the small-disturbance ISS analysis, the size of the ultimate convergence neighborhood is asserted to shrink with the disturbance bound. The manuscript should supply the explicit functional dependence of this neighborhood radius on the disturbance magnitude (e.g., via the constants appearing in the ISS-Lyapunov inequality) and verify that the contraction rate remains uniform over the considered disturbance class.

minor comments (2)

Clarify the precise assumptions on controllability, observability, and positive-definiteness of the cost matrices at the outset of the theoretical development.
In the numerical sections, report the condition number or rank of the data matrix for each initial-value choice to allow readers to assess whether the excitation condition was satisfied in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below we provide point-by-point responses to the major comments. We believe these clarifications and additions will improve the manuscript.

read point-by-point responses

Referee: The global exponential stability result for arbitrary PSD initial value matrices (abstract and stability theorem) presupposes that the collected input-state data matrix satisfies the rank/excitation condition needed to uniquely recover the optimal Q-function from the Bellman residual. When the initial value matrix induces an unstable closed-loop, the generated trajectories may fail to provide persistent excitation. The proof must explicitly show how this rank condition is guaranteed without an auxiliary excitation signal or an initial stabilizing policy; otherwise the contraction-mapping argument does not hold for all PSD starts.

Authors: We thank the referee for highlighting this important point. In our approach, the input-state data is collected offline using a fixed exploratory input sequence that includes sufficient excitation (such as white noise added to a nominal input), which is independent of the initial value matrix and the subsequent value iteration process. This ensures the rank condition on the data matrix holds a priori, regardless of whether the initial value matrix corresponds to a stabilizing policy. The value iteration then proceeds using this fixed data set, and the global exponential stability result applies to the sequence of value functions under this condition. We will revise the manuscript to explicitly state this separation between data collection and the iteration in the relevant sections and add a remark in the stability theorem to address this concern. revision: partial
Referee: In the small-disturbance ISS analysis, the size of the ultimate convergence neighborhood is asserted to shrink with the disturbance bound. The manuscript should supply the explicit functional dependence of this neighborhood radius on the disturbance magnitude (e.g., via the constants appearing in the ISS-Lyapunov inequality) and verify that the contraction rate remains uniform over the considered disturbance class.

Authors: We agree that making the dependence explicit will enhance the clarity of the robustness result. In the proof of Theorem Y on small-disturbance ISS, the ultimate bound on the neighborhood is given by a term of the form C * delta / (1 - rho), where delta is the disturbance bound, rho < 1 is the contraction rate, and C is a constant depending on the system parameters and the ISS-Lyapunov function. We show that rho remains uniform (less than some value strictly below 1) for all disturbances satisfying delta < delta_max, where delta_max is derived from the Lipschitz constants and system bounds. We will include these explicit expressions and the uniformity argument in the revised version of the proof. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on standard LQR contraction and data excitation assumptions

full rationale

The paper derives global exponential stability of value iteration for arbitrary positive semidefinite initial value matrices from the contraction property of the Bellman operator under the stated rank/excitation condition on collected input-state trajectories. This is a standard first-principles argument in stochastic LQR and does not reduce the target stability result to a quantity defined or fitted by the paper's own equations. The extension to small-disturbance ISS follows similarly from perturbation analysis around the nominal contraction. No load-bearing self-citation, ansatz smuggling, or renaming of known empirical patterns is present; the data-richness assumption is explicitly stated rather than derived from the algorithm outputs themselves. The derivation chain is therefore self-contained against external LQR benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard discrete-time stochastic LQR formulation with unknown dynamics, the existence of sufficiently rich input-state data, and the smallness of external disturbances for the ISS property. No new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption The plant is exactly a discrete-time stochastic linear system with quadratic cost and additive noise.
Invoked throughout the abstract as the setting for which convergence and stability are proved.
domain assumption Collected input-state trajectories are sufficiently exciting to permit direct policy learning without model identification.
Required for the non-model-based approach to be well-defined.

pith-pipeline@v0.9.0 · 5820 in / 1511 out tokens · 46287 ms · 2026-05-22T16:33:06.484230+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Data-driven online control for real-time optimal economic dispatch and temperature regulation in district heating systems
eess.SY 2026-03 unverdicted novelty 5.0

A data-driven controller embeds steady-state economic optimality into district heating temperature dynamics for forecast-free convergence to optimal dispatch and temperature regulation.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Abbasi-Yadkori, N

Y. Abbasi-Yadkori, N. Lazic, and C. Szepesv´ ari. Model- free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages 3108–3117. PMLR, 2019

work page 2019
[2]

Abbasi-Yadkori and C

Y. Abbasi-Yadkori and C. Szepesv´ ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theor y, pages 1–26. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[3]

LQG for portfolio optimization

Marc Abeille, Alessandro Lazaric, Xavier Brokmann, et a l. Lqg for portfolio opti- mization. arXiv preprint arXiv:1611.00997 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

K. J. ˚ Astr¨ om and B. Wittenmark.Adaptive Control. Addison-Wesley, MA, USA, 2nd edition, 1997. A FULLY DATA-DRIVEN V ALUE ITERATION FOR STOCHASTIC LQR 37

work page 1997
[5]

R. W. Beard, G. N. Saridis, and J. T. Wen. Galerkin approxi mations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 33(12):2159–2177, 1997

work page 1997
[6]

R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957

work page 1957
[7]

D. P. Bertsekas. Dynamic Programming and Optimal Control , volume 1. Athena Sci- entiﬁc, Belmont, MA, 3rd edition, 2011

work page 2011
[8]

T. Bian, Y. Jiang, and Z.-P. Jiang. Adaptive dynamic prog ramming and optimal control of nonlinear nonaﬃne systems. Automatica, 50(10):2624–2632, 2014

work page 2014
[9]

T. Bian, Y. Jiang, and Z.-P. Jiang. Adaptive dynamic prog ramming for stochastic systems with state and control dependent noise. IEEE Transactions on Automatic Control, 61(12):4170–4175, 2016

work page 2016
[10]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Value iteration and adaptive dy namic programming for data-driven adaptive optimal control design. Automatica, 71:348–360, 2016

work page 2016
[11]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Continuous-time robust dynami c programming. SIAM Jour- nal on Control and Optimization , 57(6):4150–4174, 2019

work page 2019
[12]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Reinforcement learning and ada ptive optimal control for continuous-time nonlinear systems: A value iteration appr oach. IEEE Transactions on Neural Networks and Learning Systems , 33(7):2781–2790, 2022

work page 2022
[13]

T. Bian, D. M. Wolpert, and Z.-P. Jiang. Model-free robu st optimal feedback mech- anisms of biological motor control. Neural Computation , 32(3):562–595, 2020

work page 2020
[14]

S. J. Bradtke, B. E. Ydstie, and A. G. Barto. Adaptive lin ear quadratic control using policy iteration. In Proceedings of 1994 American Control Conference , pages 3475–3479, 1994

work page 1994
[15]

Cui, Z.-P

L. Cui, Z.-P. Jiang, and E. D. Sontag. Small-disturbanc e input-to-state stability of perturbed gradient ﬂows: Applications to LQR problem. Systems & Control Letters , 188:105804, 2024

work page 2024
[16]

L. Cui, B. Pang, and Z.-P. Jiang. Reinforcement-learni ng-based risk-sensitive opti- mal feedback mechanisms of biological motor control. In 62nd IEEE Conference on Decision and Control (CDC) , pages 7944–7949, 2023

work page 2023
[17]

L. Cui, S. Wang, J. Zhang, D. Zhang, J. Lai, Y. Zheng, Z. Zh ang, and Z.-P. Jiang. Learning-based balance control of wheel-legged robots. IEEE Robotics and Automa- tion Letters, 6(4):7667–7674, 2021

work page 2021
[18]

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sam ple complexity of the linear quadratic regulator. Foundations of Computational Mathematics , 20(4):633– 679, 2020

work page 2020
[19]

Value iteration for LQR cont rol of unknown stochastic- parameter linear systems

Wenwu Fan and Junlin Xiong. Value iteration for LQR cont rol of unknown stochastic- parameter linear systems. Systems & Control Letters , 185:105731, 2024

work page 2024
[20]

Machine learning applications for data center optimization

Jim Gao. Machine learning applications for data center optimization. White Paper 21, Google, Mountain View, CA, USA, 2014

work page 2014
[21]

Gˆ arleanu and L

N. Gˆ arleanu and L. H. Pedersen. Dynamic trading with predictable returns and trans- action costs. The Journal of Finance , 68(6):2309–2340, 2013

work page 2013
[22]

W. Y. Ha, S. Chakraborty, Y. Yu, S. Ghasemi, and Z.-P. Jia ng. Automated lane changing through learning-based control: An experimental study. In IEEE 26th Inter- national Conference on Intelligent Transportation System s (ITSC) , pages 4215–4220, 2023. 38LEILEI CUI, ZHONG-PING JIANG, PETTER N. KOLM, GREGOIRE G. MACQUERON

work page 2023
[23]

G. Hewer. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control , 16(4):382–384, 1971

work page 1971
[24]

Matrix analysis

Roger A Horn and Charles R Johnson. Matrix analysis . Cambridge University Press, 2012

work page 2012
[25]

R. A. Howard. Dynamic Programming and Markov Processes . John Wiley & Sons, New York, 1960

work page 1960
[26]

Huang, Z.-P

M. Huang, Z.-P. Jiang, and K. Ozbay. Learning-based ada ptive optimal control for connected vehicles in mixed traﬃc: Robustness to driver rea ction time. IEEE Trans- actions on Cybernetics , 52(6):5267–5277, 2022

work page 2022
[27]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Computational adaptive optim al control for continuous-time linear systems with completely unknown dynamics. Automatica, 48(10):2699–2704, 2012

work page 2012
[28]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Adaptive dynamic programming as a theory of sensorimotor control. Biological Cybernetics, 108(4):459–473, 2014

work page 2014
[29]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Robust Adaptive Dynamic Programming. Wiley-IEEE Press, Hoboken, New Jersey, 2017

work page 2017
[30]

Adaptive optimal control of networked control systems with two-channel stoc hastic dropouts

Yi Jiang, Weinan Gao, Ci Chen, Tianyou Chai, and Frank L L ewis. Adaptive optimal control of networked control systems with two-channel stoc hastic dropouts. SIAM Journal on Control and Optimization , 61(5):3183–3208, October 2023

work page 2023
[31]

Adaptive linear quadrat ic control for stochastic discrete-time linear systems with unmeasurable multiplic ative and additive noises

Yi Jiang, Lu Liu, and Gang Feng. Adaptive linear quadrat ic control for stochastic discrete-time linear systems with unmeasurable multiplic ative and additive noises. IEEE Transactions on Automatic Control , 69(11):7808–7815, November 2024

work page 2024
[32]

Jiang, T

Z.-P. Jiang, T. Bian, and W. Gao. Learning-based contro l: A tutorial and some recent results. Foundations and Trends ® in Systems and Control , 8(3):176–284, 2020

work page 2020
[33]

Jiang, A

Z.-P. Jiang, A. R. Teel, and L. Praly. Small-gain theore m for ISS systems and appli- cations. Mathematics of Control, Signals and Systems , 7:95–120, 1994

work page 1994
[34]

R. Kalman. Contribution to the theory of optimal contro l. Bolet ´ ın de la Sociedad Matem´ atica Mexicana, 5(2):102–119, 1960

work page 1960
[35]

Kamalapurkar, P

R. Kamalapurkar, P. Walters, J. Rosenfeld, and W. Dixon . Reinforcement Learning for Optimal Feedback Control . Springer, Berlin, 2018

work page 2018
[36]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stocha stic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Kiumarsi, K

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lew is. Optimal and au- tonomous control using reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems , 29(6):2042–2062, 2017

work page 2042
[38]

Kleinman

D. Kleinman. On an iterative technique for riccati equa tion computations. IEEE Transactions on Automatic Control , 13(1):114–115, 1968

work page 1968
[39]

Kleinman

D. Kleinman. Optimal stationary control of linear syst ems with control-dependent noise. IEEE Transactions on Automatic Control , 14(6):673–677, 1969

work page 1969
[40]

Kolm and Nicholas Westray

Petter N. Kolm and Nicholas Westray. Mean-variance opt imization for simulation of order ﬂow. Journal of Portfolio Management , 48(6), 2022. A FULLY DATA-DRIVEN V ALUE ITERATION FOR STOCHASTIC LQR 39

work page 2022
[41]

Krauth, S

K. Krauth, S. Tu, and B. Recht. Finite-time analysis of a pproximate policy itera- tion for the linear quadratic regulator. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[42]

Value iteration for stochastic LQR with con- vergence guarantees

Jing Lai, Junlin Xiong, and Yu Kang. Value iteration for stochastic LQR with con- vergence guarantees. IEEE Transactions on Neural Networks and Learning Systems , 2025

work page 2025
[43]

R. J. Leake and R.-W. Liu. Construction of suboptimal co ntrol sequences. SIAM Journal on Control , 5(1):54–63, 1967

work page 1967
[44]

D. Lee. Convergence of dynamic programming on the semid eﬁnite cone for discrete- time inﬁnite-horizon LQR. IEEE Transactions on Automatic Control , 67(10):5661– 5668, 2022

work page 2022
[45]

F. L. Lewis, D. Vrabie, and V. L. Syrmos. Optimal Control . John Wiley & Sons, Hoboken, New Jersey, 2012

work page 2012
[46]

J. R. Magnus and H. Neudecker. Matrix Diﬀerential Calculus with Applications in Statistics and Econometrics . John Wiley & Sons, Hoboken, New Jersey, 2019

work page 2019
[47]

Dynamic portfoli o choice with linear re- balancing rules

Ciamac C Moallemi and Mehmet Sa˘ glam. Dynamic portfoli o choice with linear re- balancing rules. Journal of Financial and Quantitative Analysis , 52(3):1247–1278, 2017

work page 2017
[48]

Pang and Z

B. Pang and Z. P. Jiang. Robust reinforcement learning: A case study in linear quadratic regulation. In Proceedings of the AAAI conference on artiﬁcial intelligen ce, volume 35, pages 9303–9311, 2021

work page 2021
[49]

Pang and Z.-P

B. Pang and Z.-P. Jiang. Reinforcement learning for ada ptive optimal station- ary control of linear stochastic systems. IEEE Transactions on Automatic Control , 68(4):2383–2390, 2022

work page 2022
[50]

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pr ogram- ming. John Wiley & Sons, Hoboken, New Jersey, 2014

work page 2014
[51]

B. Recht. A tour of reinforcement learning: The view fro m continuous control. Annual Review of Control, Robotics, and Autonomous Systems , 2:253–279, 2019

work page 2019
[52]

Data-driven near optimization for fast sampling singularly perturbed syste ms

Hao Shen, Chuanjun Peng, Huaicheng Yan, and Shengyuan X u. Data-driven near optimization for fast sampling singularly perturbed syste ms. IEEE Transactions on Automatic Control, 69(7):4689–4694, 2024

work page 2024
[53]

Converge nce and robustness of value and policy iteration for the linear quadratic regul ator

Bowen Song, Chenxuan Wu, and Andrea Iannelli. Converge nce and robustness of value and policy iteration for the linear quadratic regul ator. arXiv preprint arXiv:2411.04548, 2024

work page arXiv 2024
[54]

E. D. Sontag. Smooth stabilization implies coprime fac torization. IEEE Transactions on Automatic Control , 34(4):435–443, 1989

work page 1989
[55]

E. D. Sontag. Input-to-State Stability: Basic Concepts and Results , pages 163–220. Lecture Notes in Mathematics. Springer Verlag, Germany, 20 08

work page
[56]

B. L. Stevens, F. L. Lewis, and E. N. Johnson. Aircraft Control and Simulation: Dynamics, Controls Design, and Autonomous Systems . John Wiley & Sons, Hoboken, New Jersey, 2015

work page 2015
[57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 2nd edition, 2018. 40LEILEI CUI, ZHONG-PING JIANG, PETTER N. KOLM, GREGOIRE G. MACQUERON

work page 2018
[58]

G. Teschl. Ordinary Diﬀerential Equations and Dynamical Systems , volume 140. American Mathematical Society, Providence, Rhode Island, 2024

work page 2024
[59]

Tu and B

S. Tu and B. Recht. Least-squares temporal diﬀerence le arning for the linear qua- dratic regulator. In International Conference on Machine Learning , pages 5005–5014. PMLR, 2018

work page 2018
[60]

Tu and B

S. Tu and B. Recht. The gap between model-based and model -free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019

work page 2019
[61]

J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. M. De Moor. A note on persis- tency of excitation. Systems & Control Letters , 54(4):325–329, 2005

work page 2005
[62]

Zhang and T

X. Zhang and T. Ba¸ sar. Revisiting LQR control from the p erspective of receding- horizon policy gradient. IEEE Control Systems Letters , 7:1664–1669, 2023

work page 2023

[1] [1]

Abbasi-Yadkori, N

Y. Abbasi-Yadkori, N. Lazic, and C. Szepesv´ ari. Model- free linear quadratic control via reduction to expert prediction. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics , pages 3108–3117. PMLR, 2019

work page 2019

[2] [2]

Abbasi-Yadkori and C

Y. Abbasi-Yadkori and C. Szepesv´ ari. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theor y, pages 1–26. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[3] [3]

LQG for portfolio optimization

Marc Abeille, Alessandro Lazaric, Xavier Brokmann, et a l. Lqg for portfolio opti- mization. arXiv preprint arXiv:1611.00997 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

K. J. ˚ Astr¨ om and B. Wittenmark.Adaptive Control. Addison-Wesley, MA, USA, 2nd edition, 1997. A FULLY DATA-DRIVEN V ALUE ITERATION FOR STOCHASTIC LQR 37

work page 1997

[5] [5]

R. W. Beard, G. N. Saridis, and J. T. Wen. Galerkin approxi mations of the generalized Hamilton-Jacobi-Bellman equation. Automatica, 33(12):2159–2177, 1997

work page 1997

[6] [6]

R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957

work page 1957

[7] [7]

D. P. Bertsekas. Dynamic Programming and Optimal Control , volume 1. Athena Sci- entiﬁc, Belmont, MA, 3rd edition, 2011

work page 2011

[8] [8]

T. Bian, Y. Jiang, and Z.-P. Jiang. Adaptive dynamic prog ramming and optimal control of nonlinear nonaﬃne systems. Automatica, 50(10):2624–2632, 2014

work page 2014

[9] [9]

T. Bian, Y. Jiang, and Z.-P. Jiang. Adaptive dynamic prog ramming for stochastic systems with state and control dependent noise. IEEE Transactions on Automatic Control, 61(12):4170–4175, 2016

work page 2016

[10] [10]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Value iteration and adaptive dy namic programming for data-driven adaptive optimal control design. Automatica, 71:348–360, 2016

work page 2016

[11] [11]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Continuous-time robust dynami c programming. SIAM Jour- nal on Control and Optimization , 57(6):4150–4174, 2019

work page 2019

[12] [12]

Bian and Z.-P

T. Bian and Z.-P. Jiang. Reinforcement learning and ada ptive optimal control for continuous-time nonlinear systems: A value iteration appr oach. IEEE Transactions on Neural Networks and Learning Systems , 33(7):2781–2790, 2022

work page 2022

[13] [13]

T. Bian, D. M. Wolpert, and Z.-P. Jiang. Model-free robu st optimal feedback mech- anisms of biological motor control. Neural Computation , 32(3):562–595, 2020

work page 2020

[14] [14]

S. J. Bradtke, B. E. Ydstie, and A. G. Barto. Adaptive lin ear quadratic control using policy iteration. In Proceedings of 1994 American Control Conference , pages 3475–3479, 1994

work page 1994

[15] [15]

Cui, Z.-P

L. Cui, Z.-P. Jiang, and E. D. Sontag. Small-disturbanc e input-to-state stability of perturbed gradient ﬂows: Applications to LQR problem. Systems & Control Letters , 188:105804, 2024

work page 2024

[16] [16]

L. Cui, B. Pang, and Z.-P. Jiang. Reinforcement-learni ng-based risk-sensitive opti- mal feedback mechanisms of biological motor control. In 62nd IEEE Conference on Decision and Control (CDC) , pages 7944–7949, 2023

work page 2023

[17] [17]

L. Cui, S. Wang, J. Zhang, D. Zhang, J. Lai, Y. Zheng, Z. Zh ang, and Z.-P. Jiang. Learning-based balance control of wheel-legged robots. IEEE Robotics and Automa- tion Letters, 6(4):7667–7674, 2021

work page 2021

[18] [18]

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sam ple complexity of the linear quadratic regulator. Foundations of Computational Mathematics , 20(4):633– 679, 2020

work page 2020

[19] [19]

Value iteration for LQR cont rol of unknown stochastic- parameter linear systems

Wenwu Fan and Junlin Xiong. Value iteration for LQR cont rol of unknown stochastic- parameter linear systems. Systems & Control Letters , 185:105731, 2024

work page 2024

[20] [20]

Machine learning applications for data center optimization

Jim Gao. Machine learning applications for data center optimization. White Paper 21, Google, Mountain View, CA, USA, 2014

work page 2014

[21] [21]

Gˆ arleanu and L

N. Gˆ arleanu and L. H. Pedersen. Dynamic trading with predictable returns and trans- action costs. The Journal of Finance , 68(6):2309–2340, 2013

work page 2013

[22] [22]

W. Y. Ha, S. Chakraborty, Y. Yu, S. Ghasemi, and Z.-P. Jia ng. Automated lane changing through learning-based control: An experimental study. In IEEE 26th Inter- national Conference on Intelligent Transportation System s (ITSC) , pages 4215–4220, 2023. 38LEILEI CUI, ZHONG-PING JIANG, PETTER N. KOLM, GREGOIRE G. MACQUERON

work page 2023

[23] [23]

G. Hewer. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Transactions on Automatic Control , 16(4):382–384, 1971

work page 1971

[24] [24]

Matrix analysis

Roger A Horn and Charles R Johnson. Matrix analysis . Cambridge University Press, 2012

work page 2012

[25] [25]

R. A. Howard. Dynamic Programming and Markov Processes . John Wiley & Sons, New York, 1960

work page 1960

[26] [26]

Huang, Z.-P

M. Huang, Z.-P. Jiang, and K. Ozbay. Learning-based ada ptive optimal control for connected vehicles in mixed traﬃc: Robustness to driver rea ction time. IEEE Trans- actions on Cybernetics , 52(6):5267–5277, 2022

work page 2022

[27] [27]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Computational adaptive optim al control for continuous-time linear systems with completely unknown dynamics. Automatica, 48(10):2699–2704, 2012

work page 2012

[28] [28]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Adaptive dynamic programming as a theory of sensorimotor control. Biological Cybernetics, 108(4):459–473, 2014

work page 2014

[29] [29]

Jiang and Z.-P

Y. Jiang and Z.-P. Jiang. Robust Adaptive Dynamic Programming. Wiley-IEEE Press, Hoboken, New Jersey, 2017

work page 2017

[30] [30]

Adaptive optimal control of networked control systems with two-channel stoc hastic dropouts

Yi Jiang, Weinan Gao, Ci Chen, Tianyou Chai, and Frank L L ewis. Adaptive optimal control of networked control systems with two-channel stoc hastic dropouts. SIAM Journal on Control and Optimization , 61(5):3183–3208, October 2023

work page 2023

[31] [31]

Adaptive linear quadrat ic control for stochastic discrete-time linear systems with unmeasurable multiplic ative and additive noises

Yi Jiang, Lu Liu, and Gang Feng. Adaptive linear quadrat ic control for stochastic discrete-time linear systems with unmeasurable multiplic ative and additive noises. IEEE Transactions on Automatic Control , 69(11):7808–7815, November 2024

work page 2024

[32] [32]

Jiang, T

Z.-P. Jiang, T. Bian, and W. Gao. Learning-based contro l: A tutorial and some recent results. Foundations and Trends ® in Systems and Control , 8(3):176–284, 2020

work page 2020

[33] [33]

Jiang, A

Z.-P. Jiang, A. R. Teel, and L. Praly. Small-gain theore m for ISS systems and appli- cations. Mathematics of Control, Signals and Systems , 7:95–120, 1994

work page 1994

[34] [34]

R. Kalman. Contribution to the theory of optimal contro l. Bolet ´ ın de la Sociedad Matem´ atica Mexicana, 5(2):102–119, 1960

work page 1960

[35] [35]

Kamalapurkar, P

R. Kamalapurkar, P. Walters, J. Rosenfeld, and W. Dixon . Reinforcement Learning for Optimal Feedback Control . Springer, Berlin, 2018

work page 2018

[36] [36]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stocha stic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

Kiumarsi, K

B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lew is. Optimal and au- tonomous control using reinforcement learning: A survey. IEEE Transactions on Neural Networks and Learning Systems , 29(6):2042–2062, 2017

work page 2042

[38] [38]

Kleinman

D. Kleinman. On an iterative technique for riccati equa tion computations. IEEE Transactions on Automatic Control , 13(1):114–115, 1968

work page 1968

[39] [39]

Kleinman

D. Kleinman. Optimal stationary control of linear syst ems with control-dependent noise. IEEE Transactions on Automatic Control , 14(6):673–677, 1969

work page 1969

[40] [40]

Kolm and Nicholas Westray

Petter N. Kolm and Nicholas Westray. Mean-variance opt imization for simulation of order ﬂow. Journal of Portfolio Management , 48(6), 2022. A FULLY DATA-DRIVEN V ALUE ITERATION FOR STOCHASTIC LQR 39

work page 2022

[41] [41]

Krauth, S

K. Krauth, S. Tu, and B. Recht. Finite-time analysis of a pproximate policy itera- tion for the linear quadratic regulator. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[42] [42]

Value iteration for stochastic LQR with con- vergence guarantees

Jing Lai, Junlin Xiong, and Yu Kang. Value iteration for stochastic LQR with con- vergence guarantees. IEEE Transactions on Neural Networks and Learning Systems , 2025

work page 2025

[43] [43]

R. J. Leake and R.-W. Liu. Construction of suboptimal co ntrol sequences. SIAM Journal on Control , 5(1):54–63, 1967

work page 1967

[44] [44]

D. Lee. Convergence of dynamic programming on the semid eﬁnite cone for discrete- time inﬁnite-horizon LQR. IEEE Transactions on Automatic Control , 67(10):5661– 5668, 2022

work page 2022

[45] [45]

F. L. Lewis, D. Vrabie, and V. L. Syrmos. Optimal Control . John Wiley & Sons, Hoboken, New Jersey, 2012

work page 2012

[46] [46]

J. R. Magnus and H. Neudecker. Matrix Diﬀerential Calculus with Applications in Statistics and Econometrics . John Wiley & Sons, Hoboken, New Jersey, 2019

work page 2019

[47] [47]

Dynamic portfoli o choice with linear re- balancing rules

Ciamac C Moallemi and Mehmet Sa˘ glam. Dynamic portfoli o choice with linear re- balancing rules. Journal of Financial and Quantitative Analysis , 52(3):1247–1278, 2017

work page 2017

[48] [48]

Pang and Z

B. Pang and Z. P. Jiang. Robust reinforcement learning: A case study in linear quadratic regulation. In Proceedings of the AAAI conference on artiﬁcial intelligen ce, volume 35, pages 9303–9311, 2021

work page 2021

[49] [49]

Pang and Z.-P

B. Pang and Z.-P. Jiang. Reinforcement learning for ada ptive optimal station- ary control of linear stochastic systems. IEEE Transactions on Automatic Control , 68(4):2383–2390, 2022

work page 2022

[50] [50]

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Pr ogram- ming. John Wiley & Sons, Hoboken, New Jersey, 2014

work page 2014

[51] [51]

B. Recht. A tour of reinforcement learning: The view fro m continuous control. Annual Review of Control, Robotics, and Autonomous Systems , 2:253–279, 2019

work page 2019

[52] [52]

Data-driven near optimization for fast sampling singularly perturbed syste ms

Hao Shen, Chuanjun Peng, Huaicheng Yan, and Shengyuan X u. Data-driven near optimization for fast sampling singularly perturbed syste ms. IEEE Transactions on Automatic Control, 69(7):4689–4694, 2024

work page 2024

[53] [53]

Converge nce and robustness of value and policy iteration for the linear quadratic regul ator

Bowen Song, Chenxuan Wu, and Andrea Iannelli. Converge nce and robustness of value and policy iteration for the linear quadratic regul ator. arXiv preprint arXiv:2411.04548, 2024

work page arXiv 2024

[54] [54]

E. D. Sontag. Smooth stabilization implies coprime fac torization. IEEE Transactions on Automatic Control , 34(4):435–443, 1989

work page 1989

[55] [55]

E. D. Sontag. Input-to-State Stability: Basic Concepts and Results , pages 163–220. Lecture Notes in Mathematics. Springer Verlag, Germany, 20 08

work page

[56] [56]

B. L. Stevens, F. L. Lewis, and E. N. Johnson. Aircraft Control and Simulation: Dynamics, Controls Design, and Autonomous Systems . John Wiley & Sons, Hoboken, New Jersey, 2015

work page 2015

[57] [57]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, Cambridge, MA, 2nd edition, 2018. 40LEILEI CUI, ZHONG-PING JIANG, PETTER N. KOLM, GREGOIRE G. MACQUERON

work page 2018

[58] [58]

G. Teschl. Ordinary Diﬀerential Equations and Dynamical Systems , volume 140. American Mathematical Society, Providence, Rhode Island, 2024

work page 2024

[59] [59]

Tu and B

S. Tu and B. Recht. Least-squares temporal diﬀerence le arning for the linear qua- dratic regulator. In International Conference on Machine Learning , pages 5005–5014. PMLR, 2018

work page 2018

[60] [60]

Tu and B

S. Tu and B. Recht. The gap between model-based and model -free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019

work page 2019

[61] [61]

J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. M. De Moor. A note on persis- tency of excitation. Systems & Control Letters , 54(4):325–329, 2005

work page 2005

[62] [62]

Zhang and T

X. Zhang and T. Ba¸ sar. Revisiting LQR control from the p erspective of receding- horizon policy gradient. IEEE Control Systems Letters , 7:1664–1669, 2023

work page 2023