Continuous-time q-learning for mean-field control with common noise, part-I: Theoretical foundations
Pith reviewed 2026-05-07 09:11 UTC · model grok-4.3
The pith
Under a concavity condition, the optimal policy for entropy-regularized mean-field control with common noise is identified as a two-layer fixed point of the argmax operator on the integrated q-function.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We investigate the continuous-time counterpart of the q-function for entropy-regularized mean-field control with controlled common noise. We show that, under discretely sampled actions, the value function in the exploratory formulation converges to the one in the relaxed control formulation as the time grid refines. Leveraging the relaxed control formulation, we derive the exploratory Hamilton-Jacobi-Bellman equation, in which the controlled common noise gives rise to an additional nonlinear functional of policy. Under a concavity condition, we establish the existence and uniqueness of the optimal one-step policy iteration via a first-order condition using the partial linear functional der
What carries the argument
The integrated q-function (Iq-function) defined on the state distribution and the policy, which identifies an optimal policy as a two-layer fixed point of its argmax operator.
If this is right
- The value function under discrete action sampling converges to the relaxed-control value as the discretization refines.
- The exploratory HJB equation incorporates an extra nonlinear functional of policy induced by controlled common noise.
- Policy improvement at each iteration corresponds to an entropy-regularized optimization problem over the space of policies.
- In the general linear-quadratic setting the optimal policy is explicitly a Gaussian distribution.
- Optimal policies satisfy the two-layer fixed-point relation with respect to the argmax of the Iq-function.
Where Pith is reading between the lines
- The two-layer fixed-point structure suggests an alternating numerical scheme that iterates between updating the policy and the state distribution.
- The explicit Gaussian solution in the LQ case supplies an exact benchmark against which to test approximation methods for non-LQ problems.
- The convergence result between discrete sampling and relaxed controls opens a route to simulation-based algorithms that discretize time while preserving the continuous-time limit.
- Verification of the concavity condition in concrete applications such as portfolio optimization or traffic networks would determine the range of practical use.
Load-bearing premise
The concavity condition invoked to guarantee existence and uniqueness of the optimal one-step policy iteration.
What would settle it
A concrete mean-field control instance in which the concavity condition is violated yet the one-step policy iteration still admits a unique solution, or conversely admits multiple solutions.
Figures
read the original abstract
This paper investigates the continuous-time counterpart of the Q-function for entropy-regularized mean-field control (MFC) with controlled common noise, coined as q-function by Jia and Zhou (2023) in the single agent's model. We first show that, under discretely sampled actions, the value function in the exploratory formulation converges to the one in the relaxed control formulation as the time grid refines. Leveraging the relaxed control formulation, we derive the exploratory Hamilton-Jacobi-Bellman (HJB) equation, in which the controlled common noise gives rise to an additional nonlinear functional of policy, rendering the policy iteration intricate. Under certain concavity condition, we establish the existence and uniqueness of the optimal one-step policy iteration via a first-order condition using the partial linear functional derivative with respect to policy. The policy improvement at each iteration is verified by relating to an entropy-regularized optimization problem over the space of policies. In the mean-field setting, we introduce the integrated q-function (Iq-function) defined on the state distribution and the policy, and it is shown that an optimal policy is identified as a two-layer fixed point to the argmax operator of the Iq-function. Finally, we provide the explicit characterization of an optimal policy as a Gaussian distribution in the general linear-quadratic (LQ) setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops theoretical foundations for continuous-time q-learning in entropy-regularized mean-field control with controlled common noise. It establishes convergence of the exploratory value function (under discretely sampled actions) to the relaxed-control value as the time grid refines, derives the exploratory HJB equation (with an additional nonlinear functional of policy induced by controlled common noise), proves existence and uniqueness of the optimal one-step policy iteration under a concavity condition via a first-order condition using the partial linear functional derivative with respect to policy, introduces the integrated q-function (Iq-function) on state distributions and policies to obtain a two-layer fixed-point characterization of optimal policies, and derives an explicit Gaussian characterization of the optimal policy in the general linear-quadratic setting.
Significance. If the derivations hold, the work supplies a rigorous basis for policy-iteration algorithms in MFC with common noise, extending the single-agent q-function of Jia and Zhou (2023) while handling mean-field interactions. The explicit LQ Gaussian form provides a concrete, falsifiable prediction that can be checked numerically. The introduction of the Iq-function is a natural but non-trivial extension that organizes the fixed-point argument.
major comments (2)
- [Abstract / policy-iteration derivation] Abstract and the section deriving the one-step policy iteration: the 'certain concavity condition' required for existence and uniqueness of the optimal policy via the first-order condition (partial linear functional derivative w.r.t. policy) is never stated explicitly. It is unclear whether concavity is imposed on the exploratory value functional, on the Iq-function, or on the entropy-regularized objective, and in which topology on the space of relaxed controls. Because the subsequent two-layer fixed-point characterization and the verification of policy improvement both rest on this step, the claim cannot be assessed without the precise functional and topology.
- [Exploratory HJB derivation] Section on the exploratory HJB and the nonlinear functional of policy: the additional nonlinear term arising from controlled common noise is asserted to render policy iteration 'intricate,' yet no explicit verification is given that the concavity condition (whatever it is) is preserved under the mean-field interaction. If concavity fails to hold globally, the uniqueness of the argmax and the identification of the optimal policy as a two-layer fixed point of the Iq-function both collapse.
minor comments (2)
- [Iq-function introduction] The definition of the Iq-function (state distribution and policy arguments) should be written with explicit functional dependence on the mean-field measure to avoid ambiguity when the common-noise term is present.
- [Notation] Notation for relaxed controls versus ordinary controls should be unified across the convergence result and the HJB derivation.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments, which help clarify the presentation of our results on continuous-time q-learning for entropy-regularized mean-field control with common noise. We address each major comment below and will revise the manuscript to improve explicitness and rigor.
read point-by-point responses
-
Referee: [Abstract / policy-iteration derivation] Abstract and the section deriving the one-step policy iteration: the 'certain concavity condition' required for existence and uniqueness of the optimal policy via the first-order condition (partial linear functional derivative w.r.t. policy) is never stated explicitly. It is unclear whether concavity is imposed on the exploratory value functional, on the Iq-function, or on the entropy-regularized objective, and in which topology on the space of relaxed controls. Because the subsequent two-layer fixed-point characterization and the verification of policy improvement both rest on this step, the claim cannot be assessed without the precise functional and topology.
Authors: We agree that the concavity condition requires explicit statement. In the revised version, we will define it precisely as the strict concavity of the entropy-regularized objective (equivalently, the exploratory value functional) with respect to the relaxed control in the topology of weak convergence of measures on the space of admissible policies. This concavity is with respect to the policy variable for fixed state distribution, and it is used to guarantee a unique maximizer via the first-order condition involving the partial linear functional derivative. We will also note that the Iq-function inherits this property through its definition on state distributions and policies, ensuring the two-layer fixed-point characterization holds. revision: yes
-
Referee: [Exploratory HJB derivation] Section on the exploratory HJB and the nonlinear functional of policy: the additional nonlinear term arising from controlled common noise is asserted to render policy iteration 'intricate,' yet no explicit verification is given that the concavity condition (whatever it is) is preserved under the mean-field interaction. If concavity fails to hold globally, the uniqueness of the argmax and the identification of the optimal policy as a two-layer fixed point of the Iq-function both collapse.
Authors: The referee correctly identifies that preservation of concavity under the mean-field interaction with controlled common noise is not explicitly verified in the current draft. The nonlinear functional arises from the common noise term in the exploratory HJB, but the entropy regularization and the linear-quadratic structure in the general LQ case ensure concavity is retained (as confirmed by the explicit Gaussian characterization). For the general case, we will add a dedicated remark or short appendix showing that the concavity condition is preserved because the mean-field interaction enters linearly in the dynamics and the objective remains jointly concave in the policy for fixed distributions. If this verification requires additional assumptions, we will state them explicitly rather than claiming global validity. revision: partial
Circularity Check
Minor self-citation to prior single-agent q-function; central MFC derivations remain independent
full rationale
The paper cites Jia and Zhou (2023) only to name the base q-function concept from the single-agent case and then extends it to mean-field control with common noise via standard HJB and relaxed-control machinery. The existence/uniqueness result under the concavity assumption, the two-layer fixed-point identification of the optimal policy via the newly introduced Iq-function, and the Gaussian characterization in the LQ case are all developed from first principles within this manuscript without reducing to the citation by construction or to any fitted input. No self-definitional loops, ansatz smuggling, or renaming of known results occur.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Certain concavity condition
invented entities (1)
-
Integrated q-function (Iq-function)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
C. Bender and N. T. Thuan (2024): On the grid-sampling limit SDE. Preprint, available at arXiv:2410.07778
-
[2]
L. Bo, Y. Huang and X. Yu (2025): On optimal tracking portfolio in incomplete markets: The reinforcement learning approach. SIAM Journal on Control and Optimization . 63(1), 321-348
work page 2025
-
[3]
R. Buckdahn, J. Li, S. Peng, C. Rainer. Mean-field stochastic differential equations and associated PDEs. Annals of Probability . 45(2):824-878
-
[4]
R. Buckdahn, Y. Chen. and J. Li (2021): Partial derivative with respect to the measure and its application to general controlled mean-field systems. Stochastic Processes and their Applications . 134: 265-307
work page 2021
-
[5]
R. Carmona, F. Delarue and A. Lachapelle (2013): Control of McKean-Vlasov dynamics versus mean field games. Mathematics and Financial Economics . 7, 131-166
work page 2013
-
[6]
R. Carmona and F. Delarue (2018a): Probabilistic Theory of Mean Field Games with Applications, Vol I. Springer
-
[7]
R. Carmona and F. Delarue (2018b): Probabilistic Theory of Mean Field Games with Applications, Vol II. Springer
-
[8]
R. Carmona and M. Lauri\`ere (2025): Reconciling Discrete-Time Mixed Policies and Continuous-Time Relaxed Controls in Reinforcement Learning and Stochastic Control. Preprint, available at arXiv:2504.21793
-
[9]
R. Carmona, M. Lauri\`ere. and Z. Tan. (2023): Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. Annals of Applied Probability . 33(6B), 5334-5381
work page 2023
-
[10]
J.F. Chassagneux, D. Crisan, and F. Delarue (2022): A probabilistic approach to classical solutions of the master equation for large population equilibria. Memoirs of the AMS ,volume 280
work page 2022
- [11]
-
[12]
G. Conforti, A. Kazeykina, Z. Ren (2023): Game on random environment, mean-field Langevin system, and neural networks. Mathematics of Operations Research . 48(1):78-99
work page 2023
-
[13]
D. Crisan and E. McMurray (2018): Smoothing properties of McKean–Vlasov SDEs. Probability Theory and Related Fields , 171:97–148
work page 2018
-
[14]
M. Dai, Y. Dong and Y. Jia (2023): Learning equilibrium mean-variance strategy. Mathematical Finance . 33(4), 1166-1212
work page 2023
- [15]
-
[16]
M. F. Djete, D. Possama\"i and X. Tan (2022): McKean–Vlasov optimal control: the dynamic programming principle. The Annals of Probability . 50(2):791-833
work page 2022
-
[17]
Y. Dong (2024): Randomized optimal stopping problem in continuous time and reinforcement learning algorithm. SIAM Journal on Control and Optimization . 62(3), 1590-1614
work page 2024
- [18]
-
[19]
Kallenberg(2002): Foundations of Modern Probability
O. Kallenberg(2002): Foundations of Modern Probability. Probability and its Applications (New York). Springer Verlag, New York, second edition
work page 2002
-
[20]
D. Lacker (2015): Mean field games via controlled martingale problems: existence of Markovian equilibria. Stochastic Processes and their Applications . 125(7):2856-2894
work page 2015
- [21]
-
[22]
P. Graber(2016): Linear quadratic mean field type control and mean field games with common noise, with applications to production of an exhaustible resource. Applied Mathematics & Optimization . 74, 459-486
work page 2016
-
[23]
H. Gu, X. Guo, X. Wei and R. Xu (2021): Mean-field controls with Q-learning for cooperative MARL: Convergence and complexity analysis. SIAM Journal on Mathematics of Data Science . 3(4), 1168-1196
work page 2021
-
[24]
H. Gu, X. Guo, X. Wei and R. Xu (2023): Dynamic programming principles for mean-field controls with learning. Operations Research . 71(4), 1040-1054
work page 2023
-
[25]
X. Guo, R. Xu and T. Zariphopoulou (2022): Entropy regularization for mean field games with learning. Mathematics of Operations Research . 47(4), 3239-3260
work page 2022
-
[26]
X. Han, R. Wang and X. Y. Zhou (2023): Choquet regularization for continuous-time reinforcement learning. SIAM Journal on Control and Optimization . 61(5), 2777-2801
work page 2023
-
[27]
M. Huang, R.P. Malham\'e, P. E. Caines (2006): Large population stochastic dynamic games closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information and Systems . 6(3), 221–252
work page 2006
-
[28]
Y. Huang, M. Li, X. Yu and Z. Zhou (2025): Continuous-time reinforcement learning for optimal switching over multiple regimes. Preprint, available at arXiv:2512.04697
- [29]
- [30]
- [31]
-
[32]
Y. Jia, D. Ouyang and Y. Zhang(2025): Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning. SIAM Journal on Control and Optimization , forthcoming
work page 2025
-
[33]
Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty
Y. Jia (2026): Continuous-time risk-sensitive reinforcement learning via quadratic variation penalty. Applied Mathematics & Optimization , forthcoming
work page 2026
-
[34]
J. M. Lasry and P. L. Lions (2007): Mean field games. Japanese Journal of Mathematics . 2(1), 229-260
work page 2007
- [35]
-
[36]
P. L. Lions (2006): Cours au coll\` e ge de france: Th\' e orie des jeux \` a champ moyens. Audio Conference
work page 2006
-
[37]
R. J. McCann (1997): A convexity principle for interacting gases. Advances in Mathematics . 128(1): 153-179
work page 1997
-
[38]
M. Motte and H. Pham (2022): Mean-field Markov decision processes with common noise and open-loop controls. Annals of Applied Probability . 32(2):1421-1458
work page 2022
-
[39]
H. Pham. and X. Wei (2017): Dynamic programming for optimal control of stochastic McKean--Vlasov dynamics. SIAM Journal on Control and Optimization . 55(2), 1069-1101
work page 2017
-
[40]
Z. Ren, X. Wei, X. Yu and X. Y. Zhou (2026): Continuous-time q-learning for mean-field control with common noise, part-II: q-learning algorithms. Working paper
work page 2026
-
[41]
D. Stroock and S. Varadhan (1997): Multidimensional diffusion processes, volume 233 of Grundlehren der mathematischen Wissenschaften . Springer–Verlag Berlin Heidelberg
work page 1997
-
[42]
L. Szpruch, T. Treetanthiploet and Y. Zhang (2024): Optimal scheduling of entropy regularizer for continuous-time linear-quadratic reinforcement learning. SIAM Journal on Control and Optimization . 62(1):135–166
work page 2024
-
[43]
Villani (2009): Optimal transport: old and new
C. Villani (2009): Optimal transport: old and new. Berlin: Springer
work page 2009
-
[44]
H. Wang, T. Zariphopoulou and X. Y. Zhou (2020): Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research . 21(1):8145-8178
work page 2020
- [45]
-
[46]
C. Watkins and P. Dayan (1992): Q-learning. Machine Learning . 8(3):279-292
work page 1992
- [47]
- [48]
-
[49]
Wonham (1968): On a matrix Riccati equation of stochastic control
W. Wonham (1968): On a matrix Riccati equation of stochastic control. SIAM Journal on Control and Optimization , 6(4):681-697
work page 1968
-
[50]
J. Yong (2013): Linear-quadratic optimal control problems for mean-field stochastic differential equations. SIAM journal on Control and Optimization . 51(4):2809-38
work page 2013
- [51]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.