Continuous-time q-learning for mean-field control with common noise, part-II: q-learning algorithms
Pith reviewed 2026-05-07 08:33 UTC · model grok-4.3
The pith
Q-learning algorithms learn optimal policies for continuous-time mean-field control with common noise by substituting observable exploratory data for non-observable relaxed distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Based on the relaxed control formulation, the martingale condition of the value function and the Iq-function is established by evaluating along the conditional state distributions generated by all test policies. The error incurred when these non-observable distributions are replaced by observable data from the exploratory formulation under discretely sampled actions is quantified. Combined with the two-layer fixed-point characterization of an optimal policy, this error control permits several algorithms, including an Actor-Critic q-learning procedure in which the policy is updated in the Actor step by the iteration rule induced by the improved Iq-function and the value function together with
What carries the argument
The Actor-Critic q-learning algorithm that updates the policy in the Actor-step from the iteration rule of the improved Iq-function and updates the value function and Iq-function in the Critic-step from the martingale orthogonality condition applied to exploratory data.
If this is right
- The algorithms achieve satisfactory performance when implemented in numerical examples both within and outside the linear-quadratic framework.
- Inner iterations of the Actor step converge in the infinite-horizon linear-quadratic setting.
- The quantified error bound between relaxed and exploratory data justifies the use of practical observable trajectories in the learning updates.
- The two-layer fixed-point structure of an optimal policy permits separate Actor and Critic updates without simultaneous solution of the full optimality system.
Where Pith is reading between the lines
- If the error bound extends to continuous action spaces, the same substitution technique could support q-learning in mean-field control problems without action discretization.
- The approach may transfer to learning in mean-field games where agents face common shocks and only aggregate statistics are observable.
- Applying the algorithms to finite-population approximations of mean-field systems would test whether the convergence properties survive the passage to the infinite-agent limit.
Load-bearing premise
The error incurred when non-observable conditional state distributions from the relaxed control formulation are replaced by observable data from the exploratory formulation under discretely sampled actions can be quantified and controlled.
What would settle it
Running the Actor-Critic algorithm on the infinite-horizon linear-quadratic example and observing that the inner policy iterations diverge from the known optimum, or that the quantified replacement error exceeds the derived bound in the reported test cases, would falsify the claims.
Figures
read the original abstract
This paper is a continuation work of Ren et al. (2026) aiming to further devise q-learning algorithms for mean-field control (MFC) with controlled common noise. Based on the relaxed control formulation, we first establish the martingale condition of the value function and the Iq-function by evaluating along the conditional state distributions generated by all test policies. As the data in the relaxed control formulation are not observable in practice, we quantify the error incurred when they are replaced by the observable ones in the exploratory formulation under discretely sampled actions. This, together with a two-layer fixed point characterization of an optimal policy in Ren et al. (2026), allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the iteration rule induced by the improved Iq-function, and the value function and Iq-function are updated in the Critic-step based on the martingale orthogonality condition using the data from the exploratory formulation. We also establish the convergence of the inner iterations in the Actor-step in an infinite-horizon linear quadratic (LQ) framework. In two examples, within and beyond LQ framework, our q-learning algorithms are implemented with satisfactory performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This continuation paper develops q-learning algorithms for mean-field control with controlled common noise. Starting from the relaxed-control formulation, it derives martingale orthogonality conditions for the value function and Iq-function evaluated along conditional state distributions induced by test policies. It then quantifies the approximation error incurred by replacing those non-observable distributions with observable data generated by an exploratory formulation under discretely sampled actions. Combining this error control with the two-layer fixed-point characterization of an optimal policy from the authors' prior Part-I work, the paper proposes several algorithms, including an Actor-Critic scheme in which the actor updates the policy via the improved Iq-function and the critic updates the value and Iq-functions via the (approximate) martingale conditions. Convergence of the inner actor iterations is proved in the infinite-horizon linear-quadratic case, and the algorithms are illustrated numerically on both LQ and non-LQ examples.
Significance. If the error bounds are rigorous and sufficiently tight, the work supplies the first implementable q-learning procedures for MFC with common noise that are grounded in a relaxed-control martingale characterization. The explicit LQ convergence result for the actor inner loop and the numerical validation constitute concrete strengths. The contribution is incremental on the Part-I fixed-point theory but addresses a practically relevant gap between theoretical relaxed formulations and observable data.
major comments (3)
- [§3.2] §3.2 (error quantification between relaxed and exploratory formulations): The bound on the discrepancy between the non-observable conditional distributions appearing in the martingale orthogonality conditions and the observable data generated under discrete action sampling must be shown to remain small enough that the orthogonality relation is preserved up to a controllable perturbation. The current derivation appears to yield an O(h) term (h = sampling interval), but its dependence on the mean-field interaction Lipschitz constant and the common-noise intensity is not made explicit; without this, the transfer of the two-layer fixed-point iteration from Part I to the implementable Actor-Critic scheme is not guaranteed.
- [§4.3] §4.3 (Actor-Critic algorithm and inner-loop convergence): The policy-update rule in the actor step is induced by the improved Iq-function, yet the convergence proof for the inner iterations (Theorem 5.1) is stated only for the exact LQ case. It is unclear whether the proof accounts for the residual error that propagates from the critic's use of exploratory data; an explicit perturbation analysis showing that the contraction mapping remains valid when the orthogonality condition holds only approximately is required.
- [§5] §5 (numerical examples): The reported performance in the non-LQ example relies on the same error-controlled substitution, but no diagnostic is provided that quantifies how large the realized approximation error actually is (e.g., distance between the empirical conditional distributions and the relaxed ones). This leaves open whether the satisfactory numerical results truly validate the error-control claim or merely reflect a favorable choice of discretization.
minor comments (2)
- [§2] The notation for the Iq-function and the two-layer fixed-point map is introduced without a self-contained recap of the Part-I definitions; a short paragraph or table summarizing the key objects would improve readability for readers who have not yet consulted the companion paper.
- [§3.1] In the statement of the martingale orthogonality condition, the test-policy class is not explicitly delimited; clarifying whether the class includes only feedback policies or also open-loop controls would help readers verify that the derived conditions are sufficient for optimality.
Simulated Author's Rebuttal
We sincerely thank the referee for the thorough review and constructive comments on our manuscript. We address each major point below and describe the revisions that will be incorporated to strengthen the paper.
read point-by-point responses
-
Referee: [§3.2] The bound on the discrepancy between the non-observable conditional distributions appearing in the martingale orthogonality conditions and the observable data generated under discrete action sampling must be shown to remain small enough that the orthogonality relation is preserved up to a controllable perturbation. The current derivation appears to yield an O(h) term (h = sampling interval), but its dependence on the mean-field interaction Lipschitz constant and the common-noise intensity is not made explicit; without this, the transfer of the two-layer fixed-point iteration from Part I to the implementable Actor-Critic scheme is not guaranteed.
Authors: We thank the referee for highlighting this issue. Proposition 3.2 currently establishes an O(h) error bound under standard Lipschitz assumptions on the dynamics and costs. We agree that the explicit dependence on the mean-field interaction Lipschitz constant L_mf and common-noise intensity σ should be displayed. In the revision we will expand the proof of Proposition 3.2 to obtain the sharper bound C(L_mf, σ) h, where the prefactor C grows at most linearly in L_mf and σ. With this explicit form, the perturbation to the martingale orthogonality condition remains controllable for sufficiently small h, thereby justifying the transfer of the two-layer fixed-point characterization from Part I (with an additional vanishing error term). The updated proposition and proof will appear in the revised Section 3.2. revision: yes
-
Referee: [§4.3] The policy-update rule in the actor step is induced by the improved Iq-function, yet the convergence proof for the inner iterations (Theorem 5.1) is stated only for the exact LQ case. It is unclear whether the proof accounts for the residual error that propagates from the critic's use of exploratory data; an explicit perturbation analysis showing that the contraction mapping remains valid when the orthogonality condition holds only approximately is required.
Authors: We appreciate this observation. Theorem 5.1 proves contraction for the exact martingale orthogonality condition in the infinite-horizon LQ setting. In the Actor-Critic algorithm the critic employs approximate data, introducing a residual error of size O(h). We will add a new perturbation lemma in Section 4.3 (immediately preceding Theorem 5.1) that shows: if the orthogonality condition holds up to an additive error ε, then the inner actor iterations converge to a policy whose value function lies within O(ε) of the optimum, while the contraction rate remains strictly less than one for small enough ε. The lemma will be proved by a standard perturbation argument on the Bellman operator and will be used to justify the practical algorithm. This material will be included in the revised manuscript. revision: yes
-
Referee: [§5] The reported performance in the non-LQ example relies on the same error-controlled substitution, but no diagnostic is provided that quantifies how large the realized approximation error actually is (e.g., distance between the empirical conditional distributions and the relaxed ones). This leaves open whether the satisfactory numerical results truly validate the error-control claim or merely reflect a favorable choice of discretization.
Authors: We agree that a quantitative diagnostic would strengthen the numerical validation. In the revised Section 5 we will add, for the non-LQ example, a table (or subplot) reporting the 2-Wasserstein distance between the empirical conditional distributions obtained from the exploratory data and the corresponding relaxed-control distributions at representative time instants. The table will also list the chosen sampling interval h and the resulting error magnitude (expected to be on the order of 10^{-2} or smaller). This diagnostic will confirm that the approximation error remains small for the discretization used, thereby supporting the error-control claim beyond the LQ case. revision: yes
Circularity Check
Self-cited two-layer fixed-point characterization load-bearing for algorithm proposal and convergence transfer
specific steps
-
self citation load bearing
[Abstract]
"This, together with a two-layer fixed point characterization of an optimal policy in Ren et al. (2026), allows us to propose several algorithms including the Actor-Critic q-learning algorithm, in which the policy is updated in the Actor-step based on the iteration rule induced by the improved Iq-function, and the value function and Iq-function are updated in the Critic-step based on the martingale orthogonality condition using the data from the exploratory formulation. We also establish the convergence of the inner iterations in the Actor-step in an infinite-horizon linear quadratic (LQ)框架."
The algorithms and their convergence rest on combining the paper's new error quantification with the two-layer fixed-point characterization of an optimal policy, which is imported wholesale from the authors' own prior work (part I). While the martingale conditions and error analysis are derived here, the fixed-point structure that justifies the Actor-step iteration rule and allows the convergence argument to transfer is not re-derived or independently verified in this manuscript, making the self-citation load-bearing for the central claims.
full rationale
The paper develops original martingale orthogonality conditions from the relaxed-control formulation and quantifies the replacement error when using observable exploratory data under discrete sampling. These steps are self-contained. However, the proposal of the Actor-Critic q-learning algorithm and the transfer of the fixed-point iteration to the implementable scheme explicitly combine the new error analysis with the two-layer fixed-point characterization of an optimal policy imported from the authors' prior paper (Ren et al. 2026). This creates a moderate load-bearing self-citation dependency for the central algorithmic framework and the claimed convergence of inner Actor iterations (even in the LQ case), though the error control and convergence arguments themselves contain independent content developed here. No reduction by construction or fitted-input renaming occurs within this paper's own derivations.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Martingale condition of the value function and Iq-function obtained by evaluating along conditional state distributions generated by all test policies under the relaxed control formulation.
- domain assumption Two-layer fixed point characterization of an optimal policy.
Reference graph
Works this paper leans on
-
[1]
L. Ambrosio, N. Gigli. and G. Savaré (2005). Gradient flows: In metric spaces and in the space of probability measures. Springer Science & Business Media
work page 2005
-
[2]
B. Anahtarci, C. D. Kariksiz and N. Saldi (2022): Q-learning in regularized mean-field games. Preprint, arXiv:2003.12151
-
[3]
A. Angiuli, J. P. Fouque and M. Lauri\`ere (2022). Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems . 34(2), 217-271
work page 2022
-
[4]
A. Angiuli, J. P. Fouque, R. Hu and A. Raydan (2023a): Deep reinforcement learning for infinite horizon mean field problems in continuous spaces. Preprint, available at arXiv:2309.10953. To appear in Journal of Machine Learning
-
[5]
A. Angiuli, J.P. Fouque, M. Lauri\`ere and M. Zhang (2023b). Convergence of multi-scale reinforcement Q-learning algorithms for mean field game and control problems. Preprint, available at arXiv:2312.06659
-
[6]
L. Bo, Y. Huang and X. Yu (2025): On optimal tracking portfolio in incomplete markets: The reinforcement learning approach. SIAM Journal on Control and Optimization . 63(1), 321-348
work page 2025
-
[7]
R. Carmona and F. Delarue (2018a): Probabilistic Theory of Mean Field Games with Applications, Vol I. Springer
-
[8]
R. Carmona and F. Delarue (2018b): Probabilistic Theory of Mean Field Games with Applications, Vol II. Springer
-
[9]
R. Carmona, J. P. Fouque and L. H. Sun (2015): Mean field games and systemic risk. Communications in Mathematical Sciences , 13(4):911-933
work page 2015
-
[10]
R. Carmona, F. Delarue and D. Lacker (2016): Mean field games with common noise. Annals of Probability , 44(6), 3740-3803
work page 2016
-
[11]
R. Carmona and M. Lauri\`ere (2025): Reconciling Discrete-Time Mixed Policies and Continuous-Time Relaxed Controls in Reinforcement Learning and Stochastic Control. Preprint, available at arXiv:2504.21793
-
[12]
R. Carmona, M. Lauri\`ere and Z. Tan. (2023): Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. Annals of Applied Probability . 33(6B), 5334-5381
work page 2023
-
[13]
J.F. Chassagneux, D. Crisan, and F. Delarue (2022): A probabilistic approach to classical solutions
work page 2022
- [14]
-
[15]
G. Conforti, A. Kazeykina, Z. Ren (2023): Game on random environment, mean-field Langevin system, and neural networks. Mathematics of Operations Research , 48(1):78-99
work page 2023
-
[16]
A. Cosso, F Gozzi, I. Kharroubi, H. Pham and M. Rosestolato (2020): Optimal control of path-dependent McKean-Vlasov SDEs in infinite dimension. Preprint, available at arXiv:2012.14772
-
[17]
D. Crisan and E. McMurray (2018): Smoothing properties of McKean–Vlasov SDEs. Probability Theory and Related Fields , 171:97–148
work page 2018
-
[18]
K. Cui, A. Tahir, M. Sinzger and H. Koeppl (2021): Discrete-time mean field control with environment states. In 2021 60th IEEE Conference on Decision and Control (CDC)
work page 2021
-
[19]
M. Dai, Y. Dong and Y. Jia (2023): Learning equilibrium mean-variance strategy. Mathematical Finance . 33(4), 1166-1212
work page 2023
- [20]
- [21]
-
[22]
Y. Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization . 62(3), 1590-1614
work page 2024
- [23]
- [24]
-
[25]
Y. Duan, X. Chen, R. Houthooft, J. Schulman and P. Abbeel (2016): Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning , 1329-1338. PMLR
work page 2016
-
[26]
K. Doya (2020). Reinforcement learning in continuous time and space. Neural Computation , 12(1):219–245
work page 2020
-
[27]
D. Firoozi and S. Jaimungal (2022). Exploratory LQG mean field games with entropy regularization. Automatica 139:110177
work page 2022
- [28]
- [29]
-
[30]
M. Giegrich, C. Reisinger and Y. Zhang (2024): Convergence of policy gradient methods for finite-horizon exploratory linear-quadratic control problems. SIAM Journal on Control and Optimization . 62(2):1060-92
work page 2024
-
[31]
H. Gu, X. Guo, X. Wei and R. Xu (2021): Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM Journal on Mathematics of Data Science . 3(4), 1168-1196
work page 2021
-
[32]
H. Gu, X. Guo, X. Wei and R. Xu (2022): Mean-field multi-agent reinforcement learning: A decentralized network approach. Mathematics of Operations Research . 50(1), 506-536
work page 2022
-
[33]
X. Guo, R. Xu and T. Zariphopoulou (2022): Entropy regularization for mean field games with learning. Mathematics of Operations Research . 47(4), 3239-3260
work page 2022
-
[34]
X. Han, R. Wang and X. Y. Zhou (2023): Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization . 61(5), 2777-2801
work page 2023
-
[35]
Y. Huang, M. Li, X. Yu and Z. Zhou (2025): Continuous-time reinforcement learning for optimal switching over multiple regimes. Preprint, available at arXiv:2512.04697
- [36]
- [37]
- [38]
-
[39]
Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty
Y. Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty. Applied Mathematics & Optimization , forthcoming
work page 2026
-
[40]
Y. Jia, D. Ouyang and Y. Zhang. Accuracy of discretely sampled stochastic policies in
-
[41]
Kallenberg(2002): Foundations of Modern Probability
O. Kallenberg(2002): Foundations of Modern Probability. Probability and its Applications (New York). Springer Verlag, New York, second edition
work page 2002
-
[42]
V. N. Kolokoltsov and M. Troeva (2019): On mean field games with common noise and McKean-Vlasov SPDEs. Stochastic Analysis and Applications , 37(4), 522-549
work page 2019
-
[43]
D. Lacker and T. Zariphopoulou (2018): Mean field and n -agent games for optimal investment under relative performance criteria. Mathematical Finance , 29: 1003-1038
work page 2018
-
[44]
J.M. Lasry and P.L. Lions (2007): Mean field games. Japanese Journal of Mathematics . 2(1), 229-260
work page 2007
-
[45]
Learning in mean field games: A survey.arXiv preprint arXiv:2205.12944, 2022
M. Lauri\`ere, S. Perrin, J. P\'erolat, S. Girgin, P. Muller, R. \'Elie, M. Geist and O. Pietquin (2022): Learning in mean field games: A survey. Preprint, available at arXiv:2205.12944
- [46]
-
[47]
Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens
P.L. Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens. Audio Conference
work page 2006
-
[48]
R. J. McCann (1997): A convexity principle for interacting gases. Advances in Mathematics , 128(1): 153-179
work page 1997
-
[49]
M. Motte and H. Pham (2022): Mean-field Markov decision processes with common noise and open-loop controls. Annals of Applied Probability , 32(2):1421-1458
work page 2022
- [50]
- [51]
-
[52]
B. Pasztor, I. Bogunovic and A. Krause (2021): Efficient model-based multi-agent mean-field reinforcement learning. Preprint, available at arXiv:2107.04050
-
[53]
H. Pham. and X. Wei (2017): Dynamic programming for optimal control of stochastic McKean--Vlasov dynamics. SIAM Journal on Control and Optimization , 55(2), 1069-1101
work page 2017
-
[54]
H. Pham and X. Warin (2024): Mean-field neural networks-based algorithms for McKean-Vlasov control problems. Journal of Machine Learning , 3:176-214
work page 2024
-
[55]
H. Pham and X. Warin (2023): Mean-field neural networks: learning mappings on Wasserstein space. Neural Networks , 168:380-93
work page 2023
-
[56]
Z. Ren, X. Wei, X. Yu and X. Y. Zhou (2026): Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations. Working paper
work page 2026
-
[57]
H. Wang, T. Zariphopoulou and X. Y. Zhou (2020): Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research . 21(1):8145-8178
work page 2020
- [58]
-
[59]
C. J. Watkins (1989): Learning from delayed rewards. Ph.D. thesis, Cambridge University
work page 1989
-
[60]
C. Watkins and P. Dayan (1992): Q-learning. Machine Learning , 8(3):279-292
work page 1992
- [61]
- [62]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.